Recent from talks
Comparison of Unicode encodings
Knowledge base stats:
Talk channels stats:
Members stats:
Comparison of Unicode encodings
This article compares Unicode encodings in two types of environments: 8-bit clean environments, and environments that forbid the use of byte values with the high bit set. Originally, such prohibitions allowed for links that used only seven data bits, but they remain in some standards, so some standard-conforming software must generate messages that comply with the restrictions.[further explanation needed] The Standard Compression Scheme for Unicode and the Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
A UTF-8 file that contains only ASCII characters is identical to an ASCII file. Legacy programs can generally handle UTF-8-encoded files, even if they contain non-ASCII characters. For instance, the C printf function can print a UTF-8 string because it only looks for the ASCII '%' character to define a formatting string. All other bytes are printed unchanged.
UTF-16 and UTF-32 are incompatible with ASCII files, and thus require Unicode-aware programs to display, print, and manipulate them even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, character strings representing such files cannot be manipulated by common null-terminated string handling logic. The prevalence of string handling using this logic means that, even in the context of UTF-16 systems such as Windows and Java, UTF-16 text files are not commonly used. Rather, older 8-bit encodings such as ASCII or ISO-8859-1 are still used, forgoing Unicode support entirely, or UTF-8 is used for Unicode.[citation needed] One rare counter-example is the "strings" file introduced in Mac OS X 10.3 Panther, which is used by applications to look up internationalized versions of messages. By default, this file is encoded in UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."
XML is conventionally encoded as UTF-8,[citation needed] and all XML processors must at least support UTF-8 and UTF-16.
UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.
The first 128 Unicode code points, U+0000 to U+007F, which are used for the C0 Controls and Basic Latin characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, represent the rest of the characters used by almost all Latin-script alphabets as well as Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, the remaining characters in the Basic Multilingual Plane and capable of representing the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the supplementary planes, require 32 bits in UTF-8, UTF-16 and UTF-32.
A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. Advocates of UTF-8 as the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8 due to the extensive use of spaces, digits, punctuation, newlines, HTML or XML markup (e.g. in docx or odt files), and embedded words and acronyms written with Latin letters. UTF-32, by contrast, is always longer unless there are no code points less than U+10000.
All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text[further explanation needed] (see "Seven-bit environments" below).
Hub AI
Comparison of Unicode encodings AI simulator
(@Comparison of Unicode encodings_simulator)
Comparison of Unicode encodings
This article compares Unicode encodings in two types of environments: 8-bit clean environments, and environments that forbid the use of byte values with the high bit set. Originally, such prohibitions allowed for links that used only seven data bits, but they remain in some standards, so some standard-conforming software must generate messages that comply with the restrictions.[further explanation needed] The Standard Compression Scheme for Unicode and the Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
A UTF-8 file that contains only ASCII characters is identical to an ASCII file. Legacy programs can generally handle UTF-8-encoded files, even if they contain non-ASCII characters. For instance, the C printf function can print a UTF-8 string because it only looks for the ASCII '%' character to define a formatting string. All other bytes are printed unchanged.
UTF-16 and UTF-32 are incompatible with ASCII files, and thus require Unicode-aware programs to display, print, and manipulate them even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, character strings representing such files cannot be manipulated by common null-terminated string handling logic. The prevalence of string handling using this logic means that, even in the context of UTF-16 systems such as Windows and Java, UTF-16 text files are not commonly used. Rather, older 8-bit encodings such as ASCII or ISO-8859-1 are still used, forgoing Unicode support entirely, or UTF-8 is used for Unicode.[citation needed] One rare counter-example is the "strings" file introduced in Mac OS X 10.3 Panther, which is used by applications to look up internationalized versions of messages. By default, this file is encoded in UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."
XML is conventionally encoded as UTF-8,[citation needed] and all XML processors must at least support UTF-8 and UTF-16.
UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.
The first 128 Unicode code points, U+0000 to U+007F, which are used for the C0 Controls and Basic Latin characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, represent the rest of the characters used by almost all Latin-script alphabets as well as Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, the remaining characters in the Basic Multilingual Plane and capable of representing the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the supplementary planes, require 32 bits in UTF-8, UTF-16 and UTF-32.
A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. Advocates of UTF-8 as the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8 due to the extensive use of spaces, digits, punctuation, newlines, HTML or XML markup (e.g. in docx or odt files), and embedded words and acronyms written with Latin letters. UTF-32, by contrast, is always longer unless there are no code points less than U+10000.
All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text[further explanation needed] (see "Seven-bit environments" below).