Recent from talks
Nothing was collected or created yet.
UTF-1
View on WikipediaThis article may be too technical for most readers to understand. (September 2024) |
This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. (September 2024) |
| MIME / IANA | ISO-10646-UTF-1 |
|---|---|
| Language | International |
| Current status | Obscure, of mainly historical interest. |
| Classification | Unicode Transformation Format, extended ASCII, variable-width encoding |
| Extends | US-ASCII |
| Transforms / Encodes | ISO/IEC 10646 (Unicode) |
| Succeeded by | UTF-8 |
UTF-1 is an obsolete method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.
Design
[edit]Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. All ASCII code points are a single byte (the code points U+0080 through U+009F are also single bytes).
UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 protected characters tried to be ISO/IEC 2022 compatible.
UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
| First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 |
|---|---|---|---|---|---|---|
| U+0000 | U+009F | 00–9F | ||||
| U+00A0 | U+00FF | A0 | A0–FF | |||
| U+0100 | U+4015 | A1–F5 | 21–7E, A0–FF | |||
| U+4016 | U+38E2D | F6–FB | 21–7E, A0–FF | 21–7E, A0–FF | ||
| U+38E2E | U+7FFFFFFF | FC–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF |
| code point | UTF-8 | UTF-1 |
|---|---|---|
| U+007F | 7F | 7F |
| U+0080 | C2 80 | 80 |
| U+009F | C2 9F | 9F |
| U+00A0 | C2 A0 | A0 A0 |
| U+00BF | C2 BF | A0 BF |
| U+00C0 | C3 80 | A0 C0 |
| U+00FF | C3 BF | A0 FF |
| U+0100 | C4 80 | A1 21 |
| U+015D | C5 9D | A1 7E |
| U+015E | C5 9E | A1 A0 |
| U+01BD | C6 BD | A1 FF |
| U+01BE | C6 BE | A2 21 |
| U+07FF | DF BF | AA 72 |
| U+0800 | E0 A0 80 | AA 73 |
| U+0FFF | E0 BF BF | B5 48 |
| U+1000 | E1 80 80 | B5 49 |
| U+4015 | E4 80 95 | F5 FF |
| U+4016 | E4 80 96 | F6 21 21 |
| U+D7FF | ED 9F BF | F7 2F C3 |
| U+E000 | EE 80 80 | F7 3A 79 |
| U+F8FF | EF A3 BF | F7 5C 3C |
| U+FDD0 | EF B7 90 | F7 62 BA |
| U+FDEF | EF B7 AF | F7 62 D9 |
| U+FEFF | EF BB BF | F7 64 4C |
| U+FFFD | EF BF BD | F7 65 AD |
| U+FFFE | EF BF BE | F7 65 AE |
| U+FFFF | EF BF BF | F7 65 AF |
| U+10000 | F0 90 80 80 | F7 65 B0 |
| U+38E2D | F0 B8 B8 AD | FB FF FF |
| U+38E2E | F0 B8 B8 AE | FC 21 21 21 21 |
| U+FFFFF | F3 BF BF BF | FC 21 37 B2 7A |
| U+100000 | F4 80 80 80 | FC 21 37 B2 7B |
| U+10FFFF | F4 8F BF BF | FC 21 39 6E 6C |
| U+7FFFFFFF | FD BF BF BF BF BF | FD BD 2B B9 40 |
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.
See also
[edit]References
[edit]- "The Unicode Standard: Appendix F FSS-UTF" (PDF) (PDF, 768 KiB). Version 1.1. Unicode, Inc.
- ISO/IEC JTC 1/SC2/WG2 (1993-01-21). "ISO IR 178: UCS Transformation Format One (UTF-1)" (PDF) (PDF, 256 KiB) (1 ed.). Registration number 178. Archived from the original (PDF) on 2015-03-18.
{{cite web}}: CS1 maint: numeric names: authors list (link) - Czyborra, Roman (1998-11-30). "Unicode Transformation Formats: UTF-8 & Co". Archived from the original on 2016-06-07. Retrieved 2016-06-07.
- Yergeau, F. (November 2003). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/RFC3629. STD 63. RFC 3629.
UTF-1
View on GrokipediaOverview
Definition
UTF-1 is an obsolete variable-width encoding scheme for transforming code points defined in ISO/IEC 10646—also known as the Universal Character Set and the basis for Unicode—into sequences of bytes suitable for 8-bit systems.[4] The purpose of UTF-1 was to enable a compact encoding of the full repertoire of Unicode characters, allowing representation using 1 to 5 bytes per character depending on the code point value.[5] UTF-1 was designed to be compatible with ISO 2022 mechanisms but did not preserve the byte values 0x00 through 0x7F to directly encode the identical ASCII characters; it preserved control bytes while remapping printable ASCII characters to avoid certain byte values (0x00–0x20, 0x7F–0x9F) in multi-byte sequences. Multi-byte sequences employed higher byte values to represent characters beyond the ASCII range.[6] This format was specified in Annex G of the original ISO/IEC 10646 standard published in 1993, emerging from early Unicode drafts around 1992–1993 as one of the initial proposed transformation formats for the emerging universal character encoding system. It was removed from the standard by Amendment 4 in 1996.[2]Historical Context
In the early 1990s, the development of Unicode as a 16-bit fixed-width encoding standard, known as UCS-2, addressed the growing need for a universal character set to support multiple scripts beyond the limitations of ASCII and regional encodings.[6] However, UCS-2's two-octet structure posed challenges for integration with prevalent 8-bit byte-oriented systems and protocols, necessitating variable-length transformation formats to map the Universal Character Set (UCS) into byte streams without disrupting existing infrastructure.[7] This era saw Unicode and ISO/IEC 10646 emerging in parallel, with the latter defining UCS as a 31-bit standard, further highlighting the demand for efficient encodings that could handle the full repertoire while aligning with legacy environments.[8] Amid competing standards like ISO 2022, which used escape sequences for shifting between character sets in 7-bit and 8-bit contexts, there was a strong push for a unified transformation format compatible with ASCII-based systems to enable seamless global text interchange.[9] UTF-1 emerged as one such proposal, defined in the initial 1993 edition of ISO/IEC 10646 as a multi-byte encoding based on ISO 2022 mechanisms, allowing it to operate within escape-based frameworks for multilingual data.[9] It was designed to support the expansive UCS repertoire, including provisions for characters beyond the initial 65,536 code points (later formalized as the Basic Multilingual Plane).[7] UTF-1 formed part of a series of early UTF proposals aimed at optimizing efficiency, compatibility, and safety in diverse environments, alongside formats like UTF-7 for mail-safe 7-bit encoding and FSS-UTF for file system constraints.[7] These efforts underscored the transitional challenges of adopting a universal character set, prioritizing designs that attempted ASCII invariance—though UTF-1 only preserved control bytes and remapped printable ones, failing full backward compatibility with vast existing ASCII-centric software and networks.[6]History
Development
UTF-1 was developed in the early 1990s by the ISO/IEC JTC1/SC2/WG2 committee, which was synchronizing the emerging ISO/IEC 10646 Universal Character Set (UCS) with the Unicode Standard from the Unicode Consortium, established in 1991. UTF-1 emerged as an experimental proposal during the initial standardization phase and was registered as ISO IR 178 in 1992. It was formally specified as an informative Annex G in the first edition of ISO/IEC 10646-1, published in 1993.[8][10] A core design decision for UTF-1 involved the use of modulo 190 arithmetic (derived from 256 minus 66 disallowed byte values) to transform UCS code points into sequences of safe bytes, thereby avoiding conflicts with ISO 2022 escape sequences and ensuring compatibility with legacy 8-bit systems. This arithmetic mapping reserved 66 byte values—specifically the C0 controls (0x00–0x1F), space (0x20), DEL (0x7F), and C1 controls (0x80–0x9F)—to prevent misinterpretation as control codes or formatting shifts in environments like email or terminal emulators. The approach prioritized protection of existing protocols over encoding efficiency, reflecting the era's emphasis on backward compatibility during the transition from ASCII-based systems.[10] UTF-1 was engineered to support up to 231 − 1 code points, encompassing the complete UCS repertoire defined in early ISO 10646, which targeted a 31-bit code space for global character coverage. The encoding used variable-length sequences of 1 to 5 bytes per character, with initial refinements in the 1993 specification focusing on these control protections to facilitate integration with ISO 2022 multi-byte environments; it was registered as ISO IR 178 for invocation via the escape sequence ESC 2/5 4/2 (0x25 0x42). This capacity and protective mechanisms positioned UTF-1 as a bridge between 7-bit ASCII and the expansive UCS, though its complexity limited practical implementation from the outset.[10][11]Standardization and Decline
UTF-1 was initially included as an encoding form in the first edition of ISO/IEC 10646-1, published in 1993, where it was specified in Annex G as a variable-length transformation format for the Universal Character Set (UCS).[12] However, it was never fully ratified as a standard encoding beyond this provisional status in early drafts.[8] By 1996, UTF-1 was removed from ISO/IEC 10646 through Amendment 4, which deleted Annex G due to identified design flaws that rendered it unsuitable for ongoing use.[8] This removal aligned the standard with more robust alternatives, and UTF-1 received its last formal mention in the documentation for Unicode Standard Version 2.0, published that same year, where it was noted solely for historical purposes.[1] The decline of UTF-1 accelerated with the introduction of UTF-8 in 1993, designed by Ken Thompson and Rob Pike as a superior, ASCII-compatible encoding that addressed UTF-1's limitations in efficiency and synchronization.[13] UTF-8's adoption by the Internet Engineering Task Force (IETF) and integration into ISO/IEC 10646 via Amendment 2 further marginalized UTF-1, relegating it to historical status without widespread implementation.[1] Although considered in some early Unix-related efforts by X/Open around 1992, any potential use was brief and quickly abandoned following the UTF-8 proposal, as vendors shifted to the new format to avoid compatibility issues.[14]Technical Specification
Encoding Ranges
UTF-1 divides the Unicode code point space into distinct ranges, each associated with a specific number of bytes for encoding, to support variable-length representation while aiming for compatibility with ASCII. The range from U+0000 to U+009F is encoded using a single byte, directly mapping to byte values 0x00 through 0x9F.[15] For code points in the range U+00A0 to U+4015, UTF-1 employs two-byte encodings; for instance, the no-break space at U+00A0 is represented as the byte sequence 0xA0 0xA0. Code points from U+4016 to U+38E2D require three bytes, while the remaining range from U+38E2E to U+7FFFFFFF uses five bytes to accommodate the full 31-bit code space of ISO/IEC 10646.[15] The following table summarizes the encoding ranges and byte lengths in UTF-1:| Code Point Range | Byte Length | Example Encoding |
|---|---|---|
| U+0000–U+009F | 1 | U+007F → 0x7F |
| U+00A0–U+4015 | 2 | U+00A0 → 0xA0 0xA0 |
| U+4016–U+38E2D | 3 | (Specific sequences via transformation) |
| U+38E2E–U+7FFFFFFF | 5 | (Specific sequences via transformation) |
