UTF-1

UTF-1Main

Community hub

UTF-1

7 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

UTF-1

View on Wikipedia

from Wikipedia

This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (September 2024) (Learn how and when to remove this message)

This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help improve this article by introducing more precise citations. (September 2024) (Learn how and when to remove this message)

UTF-1
MIME / IANA	ISO-10646-UTF-1
Language	International
Current status	Obscure, of mainly historical interest.
Classification	Unicode Transformation Format, extended ASCII, variable-width encoding
Extends	US-ASCII
Transforms / Encodes	ISO/IEC 10646 (Unicode)
Succeeded by	UTF-8

UTF-1 is an obsolete method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

Design

[edit]

Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. All ASCII code points are a single byte (the code points U+0080 through U+009F are also single bytes).

UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 protected characters tried to be ISO/IEC 2022 compatible.

UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 2⁶ = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).

UTF-1
First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5
U+0000	U+009F	00–9F
U+00A0	U+00FF	A0	A0–FF
U+0100	U+4015	A1–F5	21–7E, A0–FF
U+4016	U+38E2D	F6–FB	21–7E, A0–FF	21–7E, A0–FF
U+38E2E	U+7FFFFFFF	FC–FF	21–7E, A0–FF	21–7E, A0–FF	21–7E, A0–FF	21–7E, A0–FF

code point	UTF-8	UTF-1
U+007F	7F	7F
U+0080	C2 80	80
U+009F	C2 9F	9F
U+00A0	C2 A0	A0 A0
U+00BF	C2 BF	A0 BF
U+00C0	C3 80	A0 C0
U+00FF	C3 BF	A0 FF
U+0100	C4 80	A1 21
U+015D	C5 9D	A1 7E
U+015E	C5 9E	A1 A0
U+01BD	C6 BD	A1 FF
U+01BE	C6 BE	A2 21
U+07FF	DF BF	AA 72
U+0800	E0 A0 80	AA 73
U+0FFF	E0 BF BF	B5 48
U+1000	E1 80 80	B5 49
U+4015	E4 80 95	F5 FF
U+4016	E4 80 96	F6 21 21
U+D7FF	ED 9F BF	F7 2F C3
U+E000	EE 80 80	F7 3A 79
U+F8FF	EF A3 BF	F7 5C 3C
U+FDD0	EF B7 90	F7 62 BA
U+FDEF	EF B7 AF	F7 62 D9
U+FEFF	EF BB BF	F7 64 4C
U+FFFD	EF BF BD	F7 65 AD
U+FFFE	EF BF BE	F7 65 AE
U+FFFF	EF BF BF	F7 65 AF
U+10000	F0 90 80 80	F7 65 B0
U+38E2D	F0 B8 B8 AD	FB FF FF
U+38E2E	F0 B8 B8 AE	FC 21 21 21 21
U+FFFFF	F3 BF BF BF	FC 21 37 B2 7A
U+100000	F4 80 80 80	FC 21 37 B2 7B
U+10FFFF	F4 8F BF BF	FC 21 39 6E 6C
U+7FFFFFFF	FD BF BF BF BF BF	FD BD 2B B9 40

Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.

References

[edit]

"The Unicode Standard: Appendix F FSS-UTF" (PDF) (PDF, 768 KiB). Version 1.1. Unicode, Inc.
ISO/IEC JTC 1/SC2/WG2 (1993-01-21). "ISO IR 178: UCS Transformation Format One (UTF-1)" (PDF) (PDF, 256 KiB) (1 ed.). Registration number 178. Archived from the original (PDF) on 2015-03-18.{{cite web}}: CS1 maint: numeric names: authors list (link)
Czyborra, Roman (1998-11-30). "Unicode Transformation Formats: UTF-8 & Co". Archived from the original on 2016-06-07. Retrieved 2016-06-07.
Yergeau, F. (November 2003). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/RFC3629. STD 63. RFC 3629.

Unicode

Code points

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark / Right-to-left mark Soft hyphen Variant form Word joiner Zero-width joiner Zero-width non-joiner Zero-width space
Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth Alias names and abbreviations Whitespace characters

Processing

Algorithms	Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core
Comparison of encodings	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

View on Grokipedia

from Grokipedia

UTF-1 is an obsolete transformation format for encoding the Universal Coded Character Set (UCS) of ISO/IEC 10646 into a stream of bytes.^[1] It was specified in a non-required annex of the first edition of ISO/IEC 10646-1, published in 1993, but was removed from the standard by Amendment 4 in 1996.^[2] As a variable-length encoding based on the ISO 2022 character code structure and extension techniques, UTF-1 aimed to represent the 31-bit UCS code points in a compact byte sequence suitable for transmission and storage.^[3] However, UTF-1 failed to achieve widespread adoption due to its design limitations and was quickly superseded by more efficient and practical alternatives, including UTF-8, which became the dominant encoding for UCS/Unicode.^[3] Today, UTF-1 holds only historical significance and is not implemented or used in any modern software or standards.^[1]

Overview

Definition

UTF-1 is an obsolete variable-width encoding scheme for transforming code points defined in ISO/IEC 10646—also known as the Universal Character Set and the basis for Unicode—into sequences of bytes suitable for 8-bit systems.^[4] The purpose of UTF-1 was to enable a compact encoding of the full repertoire of Unicode characters, allowing representation using 1 to 5 bytes per character depending on the code point value.^[5] UTF-1 was designed to be compatible with ISO 2022 mechanisms but did not preserve the byte values 0x00 through 0x7F to directly encode the identical ASCII characters; it preserved control bytes while remapping printable ASCII characters to avoid certain byte values (0x00–0x20, 0x7F–0x9F) in multi-byte sequences. Multi-byte sequences employed higher byte values to represent characters beyond the ASCII range.^[6] This format was specified in Annex G of the original ISO/IEC 10646 standard published in 1993, emerging from early Unicode drafts around 1992–1993 as one of the initial proposed transformation formats for the emerging universal character encoding system. It was removed from the standard by Amendment 4 in 1996.^[2]

Historical Context

In the early 1990s, the development of Unicode as a 16-bit fixed-width encoding standard, known as UCS-2, addressed the growing need for a universal character set to support multiple scripts beyond the limitations of ASCII and regional encodings.^[6] However, UCS-2's two-octet structure posed challenges for integration with prevalent 8-bit byte-oriented systems and protocols, necessitating variable-length transformation formats to map the Universal Character Set (UCS) into byte streams without disrupting existing infrastructure.^[7] This era saw Unicode and ISO/IEC 10646 emerging in parallel, with the latter defining UCS as a 31-bit standard, further highlighting the demand for efficient encodings that could handle the full repertoire while aligning with legacy environments.^[8] Amid competing standards like ISO 2022, which used escape sequences for shifting between character sets in 7-bit and 8-bit contexts, there was a strong push for a unified transformation format compatible with ASCII-based systems to enable seamless global text interchange.^[9] UTF-1 emerged as one such proposal, defined in the initial 1993 edition of ISO/IEC 10646 as a multi-byte encoding based on ISO 2022 mechanisms, allowing it to operate within escape-based frameworks for multilingual data.^[9] It was designed to support the expansive UCS repertoire, including provisions for characters beyond the initial 65,536 code points (later formalized as the Basic Multilingual Plane).^[7] UTF-1 formed part of a series of early UTF proposals aimed at optimizing efficiency, compatibility, and safety in diverse environments, alongside formats like UTF-7 for mail-safe 7-bit encoding and FSS-UTF for file system constraints.^[7] These efforts underscored the transitional challenges of adopting a universal character set, prioritizing designs that attempted ASCII invariance—though UTF-1 only preserved control bytes and remapped printable ones, failing full backward compatibility with vast existing ASCII-centric software and networks.^[6]

History

Development

UTF-1 was developed in the early 1990s by the ISO/IEC JTC1/SC2/WG2 committee, which was synchronizing the emerging ISO/IEC 10646 Universal Character Set (UCS) with the Unicode Standard from the Unicode Consortium, established in 1991. UTF-1 emerged as an experimental proposal during the initial standardization phase and was registered as ISO IR 178 in 1992. It was formally specified as an informative Annex G in the first edition of ISO/IEC 10646-1, published in 1993.^[8]^[10] A core design decision for UTF-1 involved the use of modulo 190 arithmetic (derived from 256 minus 66 disallowed byte values) to transform UCS code points into sequences of safe bytes, thereby avoiding conflicts with ISO 2022 escape sequences and ensuring compatibility with legacy 8-bit systems. This arithmetic mapping reserved 66 byte values—specifically the C0 controls (0x00–0x1F), space (0x20), DEL (0x7F), and C1 controls (0x80–0x9F)—to prevent misinterpretation as control codes or formatting shifts in environments like email or terminal emulators. The approach prioritized protection of existing protocols over encoding efficiency, reflecting the era's emphasis on backward compatibility during the transition from ASCII-based systems.^[10] UTF-1 was engineered to support up to 2³¹ − 1 code points, encompassing the complete UCS repertoire defined in early ISO 10646, which targeted a 31-bit code space for global character coverage. The encoding used variable-length sequences of 1 to 5 bytes per character, with initial refinements in the 1993 specification focusing on these control protections to facilitate integration with ISO 2022 multi-byte environments; it was registered as ISO IR 178 for invocation via the escape sequence ESC 2/5 4/2 (0x25 0x42). This capacity and protective mechanisms positioned UTF-1 as a bridge between 7-bit ASCII and the expansive UCS, though its complexity limited practical implementation from the outset.^[10]^[11]

Standardization and Decline

UTF-1 was initially included as an encoding form in the first edition of ISO/IEC 10646-1, published in 1993, where it was specified in Annex G as a variable-length transformation format for the Universal Character Set (UCS).^[12] However, it was never fully ratified as a standard encoding beyond this provisional status in early drafts.^[8] By 1996, UTF-1 was removed from ISO/IEC 10646 through Amendment 4, which deleted Annex G due to identified design flaws that rendered it unsuitable for ongoing use.^[8] This removal aligned the standard with more robust alternatives, and UTF-1 received its last formal mention in the documentation for Unicode Standard Version 2.0, published that same year, where it was noted solely for historical purposes.^[1] The decline of UTF-1 accelerated with the introduction of UTF-8 in 1993, designed by Ken Thompson and Rob Pike as a superior, ASCII-compatible encoding that addressed UTF-1's limitations in efficiency and synchronization.^[13] UTF-8's adoption by the Internet Engineering Task Force (IETF) and integration into ISO/IEC 10646 via Amendment 2 further marginalized UTF-1, relegating it to historical status without widespread implementation.^[1] Although considered in some early Unix-related efforts by X/Open around 1992, any potential use was brief and quickly abandoned following the UTF-8 proposal, as vendors shifted to the new format to avoid compatibility issues.^[14]

Technical Specification

Encoding Ranges

UTF-1 divides the Unicode code point space into distinct ranges, each associated with a specific number of bytes for encoding, to support variable-length representation while aiming for compatibility with ASCII. The range from U+0000 to U+009F is encoded using a single byte, directly mapping to byte values 0x00 through 0x9F.^[15] For code points in the range U+00A0 to U+4015, UTF-1 employs two-byte encodings; for instance, the no-break space at U+00A0 is represented as the byte sequence 0xA0 0xA0. Code points from U+4016 to U+38E2D require three bytes, while the remaining range from U+38E2E to U+7FFFFFFF uses five bytes to accommodate the full 31-bit code space of ISO/IEC 10646.^[15] The following table summarizes the encoding ranges and byte lengths in UTF-1:

Code Point Range	Byte Length	Example Encoding
U+0000–U+009F	1	U+007F → 0x7F
U+00A0–U+4015	2	U+00A0 → 0xA0 0xA0
U+4016–U+38E2D	3	(Specific sequences via transformation)
U+38E2E–U+7FFFFFFF	5	(Specific sequences via transformation)

A core aspect of UTF-1's design is the restriction of certain bytes in multi-byte sequences to 190 values, achieved through a modulo 190 operation and a transformation function T that maps to safe bytes in the ranges 0x21–0x7E and 0xA0–0xFF, excluding C0 and C1 control codes as well as other protected bytes like DEL (0x7F), ensuring safer transmission in certain environments. This mapping applies to trailing bytes in multi-byte sequences beyond the initial range.^[15]

Byte Transformation Process

The byte transformation process in UTF-1 encodes Unicode code points into variable-length byte sequences using arithmetic operations involving division and modulo by 190, combined with offsets and a transformation function T to ensure bytes fall within safe values. The function T(z) for z in 0 to 189 is defined as: if z < 94 then z + 0x21 else z + 0x42. This maps to the 190-symbol alphabet of safe bytes (0x21–0x7E and 0xA0–0xFF), skipping control characters and protected bytes for compatibility with existing systems.^[15] For the single-byte range (U+0000 to U+009F), the encoding is the code point value itself (0x00 to 0x9F). For the two-byte range (U+00A0 to U+4015), there are two subranges. For U+00A0 to U+00FF, the encoding is 0xA0 followed by the low byte of the code point (0xA0 to 0xFF). For U+0100 to U+4015, let y = U - 0x0100; the lead byte is 0xA1 + ⌊y / 190⌋, and the trailing byte is T(y mod 190). For example, encoding U+00A1: since it is in U+00A0–U+00FF, it is 0xA0 0xA1. For a higher example, U+0100 (y=0): lead 0xA1 + 0 = 0xA1, trail T(0) = 0 + 0x21 = 0x21, so 0xA1 0x21.^[15] For the three-byte range (U+4016 to U+38E2D), let y = U - 0x4016; the lead byte is 0xF6 + ⌊y / (190²)⌋, the second byte is T(⌊y / 190⌋ mod 190), and the trailing byte is T(y mod 190). For the five-byte range (U+38E2E to U+7FFFFFFF), let y = U - 0x38E2E; the lead byte is 0xFC + ⌊y / (190⁴)⌋, followed by four trailing bytes: T(⌊y / 190³⌋ mod 190), T(⌊y / 190²⌋ mod 190), T(⌊y / 190⌋ mod 190), T(y mod 190). Decoding reverses these operations by subtracting offsets, applying inverse T (which requires identifying the byte range to compute z), and combining with multiplication by 190 powers plus the base to recover U.^[15]

Properties

Compatibility Features

UTF-1 maintains full backwards compatibility with US-ASCII by encoding characters in the range U+0000 to U+007F as single bytes identical to their ASCII values (0x00–0x7F), allowing seamless processing of basic Latin text in legacy systems without alteration.^[5] This design ensures that ASCII-only applications can interpret the initial bytes of UTF-1 streams correctly, preserving interoperability for common Western scripts. A key compatibility aspect involves protecting 66 byte values—specifically 0x00 through 0x20 (including the space character) and 0x7F through 0x9F (encompassing DEL and C1 controls)—to prevent conflicts with ISO/IEC 2022 escape sequences and control functions.^[16] By design, UTF-1 avoids these protected bytes in all non-initial positions of multi-byte sequences, safeguarding against unintended interpretation as protocol controls in environments supporting ISO 2022-based encodings like EUC or ISO 2022-JP. The encoding algorithm achieves this through modulo 190 arithmetic, derived from the 256 possible byte values minus the 66 protected ones, which restricts subsequent bytes to 190 safe values, mapped to ranges 0x21–0x7E (94 values aligned with 7-bit printable characters, excluding controls) and 0xA0–0xFF (96 values).^[5] This selective mapping ensures that extension bytes fall only into non-control ranges, such as 0x21–0x7E and 0xA0–0xFF, thereby maintaining structural integrity when transmitted through systems expecting ISO 2022 compliance. This protection facilitates 7-bit safe transmission, where non-ASCII characters are encoded in a manner that resembles escape-initiated sequences without employing actual escape bytes, allowing UTF-1 data to traverse 7-bit channels without corruption or misinterpretation as controls.^[5]

Synchronization and Error Handling

UTF-1's encoding scheme lacks self-synchronization, meaning that the boundaries between character sequences cannot be reliably determined without parsing the entire preceding stream. This property arises from its use of variable-length byte sequences, where subsequent bytes in a multi-byte character are drawn from the printable Latin-1 range without distinct lead or continuation byte patterns that signal sequence starts or ends. As a result, an error in a single byte, such as bit corruption during transmission, can misalign the decoder's state, causing all following characters to be misinterpreted until a natural synchronization point like a control character (e.g., newline or tab) is encountered.^[17] The absence of fixed patterns for sequence initiation exacerbates parsing challenges, requiring decoders to maintain a stateful process that re-parses from the beginning of the input upon detecting an invalid sequence. In UTF-1, multi-byte encodings involve transformations that incorporate modulo 190 arithmetic to map Unicode code points into safe byte values, but this does not provide cues for resynchronization after errors. Consequently, error propagation is severe, as damaged bytes cannot be easily skipped or isolated, leading to widespread corruption in the decoded output.^[17] In noisy transmission channels, such as early network protocols or storage media prone to bit errors, UTF-1's design results in poor error recovery, where partial or invalid sequences demand full stream retransmission or manual intervention rather than local correction. This limitation stems directly from the encoding's reliance on sequential integrity without built-in redundancy for boundary detection, making it unsuitable for robust data interchange compared to more resilient formats.^[17]

Limitations and Abandonment

Performance Drawbacks

One major performance drawback of UTF-1 stems from its reliance on modulo 190 arithmetic to encode values beyond the initial byte range, as the scheme reserves 66 byte values (0x00–0x20, 0x7F, and 0x80–0x9F) for control, formatting, and non-printing purposes, leaving 190 possible values per byte for data.^[17]^[18]^[5] This non-power-of-2 modulus requires integer division and modulo operations during both encoding and decoding, which were computationally expensive on hardware from the early 1990s compared to simple bit shifts or masking used in other encodings.^[17]^[18] Decoding UTF-1 further exacerbates this inefficiency, as each byte in a multi-byte sequence necessitates division and modulo computations to reconstruct the original code point, making the process significantly slower than fixed-width alternatives like UCS-2, which rely on direct byte alignment without such arithmetic. These operations contributed to overall processing times that were notably higher in contemporary implementations, hindering adoption in performance-sensitive applications.^[17] UTF-1 also exhibits storage inefficiency for high code points, employing up to 5-byte sequences to represent values in the full 31-bit UCS space, which increases overhead for extended character sets relative to more compact variable-length schemes. Additionally, the reuse of ASCII printing characters (0x21–0x7E) within multi-byte sequences introduces conflicts in environments like Unix, where such bytes could mimic path separators (e.g., "/") or command delimiters, complicating filename handling and shell processing without additional validation.^[17] The lack of self-synchronization in UTF-1 amplifies these slowdowns, as error recovery or random access requires rescanning from the sequence start, though this is addressed in detail elsewhere.^[17]

Reasons for Replacement

The cumulative flaws of UTF-1, including its inefficiency in encoding longer sequences with up to five bytes per character, poor searchability that necessitated full decoding for substring matching, and heightened error proneness due to its avoidance of control codes and complex byte transformations, rendered it unsuitable for practical deployment.^[5] These issues compounded to make UTF-1 less reliable for text processing compared to emerging alternatives. A strategic shift in encoding design favored UTF-8, which was proposed in 1992 and adopted in 1993, offering superior self-synchronization to recover from byte errors without full redecoding, preservation of ASCII compatibility without reusing printing characters for non-ASCII purposes, and simpler arithmetic based on bit shifts and masking operations for variable-length encoding.^[5]^[19] This made UTF-8 more efficient and versatile for internet protocols and file systems, where rapid parsing and error resilience were critical. In 1996, the Unicode Consortium deprecated UTF-1 entirely through Amendment 4 to ISO/IEC 10646-1, which removed Annex G defining the format, citing its overall unsuitability for modern applications including network transmission and storage systems.^[8] No modern software supports UTF-1 natively; while rare legacy conversion tools persist for historical data migration, they are confined to specialized archival contexts.^[20]

Comparisons

To UTF-8

UTF-1 and UTF-8 both provide variable-width encodings for Unicode code points while maintaining compatibility with ASCII for the basic 128 characters, but they diverge in their approaches to byte efficiency and structural design. UTF-1 encodes code points using up to 5 bytes, whereas UTF-8 limits encodings to 1–4 bytes; this results in variable storage overhead for UTF-1, with 3 bytes for many supplementary characters (e.g., U+10000 to ~U+38E2D) versus 4 bytes in UTF-8, but 5 bytes for higher code points (e.g., U+38E2E to U+10FFFF) versus 4 bytes in UTF-8.^[5] A representative example illustrates this variability: the ASCII character U+0041 ('A') is represented as the single byte 0x41 in both formats, preserving direct compatibility. However, the emoji U+1F600 (grinning face, decimal 128512) requires 3 bytes in UTF-1 compared to 4 bytes in UTF-8. For a higher supplementary character like U+10A000 (decimal 1060864), UTF-1 requires 5 bytes versus 4 in UTF-8, highlighting potential inefficiency in UTF-1 for certain ranges.^[5] In terms of synchronization, UTF-8 employs distinct bit patterns for lead bytes (starting with 0xC0–0xFF for multi-byte sequences) and continuation bytes (0x80–0xBF), allowing decoders to self-synchronize by skipping invalid sequences and locating the next valid lead byte after an error. UTF-1 lacks this robust self-synchronization because its continuation bytes draw from a broader range, including values that overlap with single-byte interpretations.^[5] Both encodings are ASCII-safe, treating bytes 0x00–0x7F as unchanged single-byte characters, but UTF-8 further ensures transparency by excluding printable ASCII bytes (0x21–0x7E) from multi-byte continuation roles, avoiding misinterpretation in file systems or protocols where such bytes denote filenames or paths. UTF-1 reuses printable ASCII characters in multi-byte sequences, potentially causing compatibility issues in these environments.^[5] These design choices contributed to UTF-8's efficiency and reliability, driving its universal adoption, while UTF-1's drawbacks led to its rapid abandonment in favor of UTF-8.^[1]

To UCS-2 and UTF-16

UTF-1 employs a variable-width encoding scheme that represents Unicode code points using sequences of 1 to 5 bytes, allowing for greater compactness in storing ASCII-range characters (U+0000 to U+009F) within a single byte while extending to up to 5 bytes for the full 31-bit Unicode repertoire (up to U+7FFFFFFF).^[5] In comparison, UCS-2 utilizes a fixed-width 2-byte (16-bit) format that directly maps code points in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF), limiting it to approximately 65,536 characters without support for supplementary planes. UTF-16 extends UCS-2 by introducing surrogate pairs, resulting in a variable-width scheme of 2 or 4 bytes per character, but it requires additional logic to handle the surrogates for code points beyond the BMP. This makes UTF-1 more space-efficient for predominantly ASCII text, as it avoids the consistent 2-byte overhead of UCS-2 and the surrogate complexity of UTF-16 for basic characters, though it introduces intricacy in handling higher-range surrogates equivalently through longer byte sequences.^[5] The simplicity of UCS-2 lies in its straightforward 16-bit direct mapping, where each code unit corresponds one-to-one with a BMP code point without requiring computational transformations during encoding or decoding. UTF-1, however, incorporates arithmetic operations such as integer division and modulo 190 to distribute code point values across byte positions, followed by a transformation function T(z) to map remainders into printable byte ranges (avoiding control codes).^[5] These modulo operations add decoding overhead absent in the fixed-width UCS-2 or the bit-shift-based surrogate handling in UTF-16, contributing to UTF-1's relative complexity and slower performance in software implementations.^[18] A key limitation of UCS-2 is its restriction to the BMP, necessitating the adoption of UTF-16 extensions to cover the complete Unicode space, whereas UTF-1 natively supports the entire 31-bit range without surrogate mechanisms. Despite this, UTF-1's design suits 8-bit byte channels better than UCS-2's 16-bit unit requirement, which often demands byte serialization and potential byte-order marking for unambiguous transmission. However, UTF-1's decoding process is slower due to the arithmetic computations involved, contrasting with the more efficient fixed-width access in UCS-2 for BMP content.^[18]

v t e Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Baudot and Murray Fieldata ASCII ISO/IEC 646 BCDIC Teletex and Videotex/Teletext T.51/ISO/IEC 6937 ITU T.61 ITU T.101 World System Teletext background sets Transcode
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi Adaptations Welsh Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 5426-2 5427 5428 6438 6862
National standards	ArmSCII Big5 BraSCII BSCII CNS 11643 DIN 66003 ELOT 927 GOST 10859 GB 2312 GB 12345 GB 12052 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1564 LST 1590-4 PASCII Shift JIS SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC
Mac OS Code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Farsi (Persian) Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	437 737 850 858 861 862 863 864 865 866 867 868 869 899 904 932 936 942 949 950 951 1040 1043 1046 1098 1115 1116 1117 1118 1127 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX code pages	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1133
Windows code pages	CER-GS 932 936 (GBK) 950 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC code pages	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030 DIN 91379 BOCU-1 CESU-8 SCSU TACE16 Comparison of Unicode encodings
TeX typesetting system	Cork LY1 OML OMS OT1
Miscellaneous code pages	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code
Control character	Morse prosigns C0 and C1 control codes ISO/IEC 6429 JIS X 0211 Unicode control, format and separator characters Whitespace characters
Related topics	CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length encoding
Character sets

History

UTF-1

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

UTF-1

Design

See also

References

UTF-1

Overview

Definition

Historical Context

History

Development

Standardization and Decline

Technical Specification

Encoding Ranges

Byte Transformation Process

Properties

Compatibility Features

Synchronization and Error Handling

Limitations and Abandonment

Performance Drawbacks

Reasons for Replacement

Comparisons

To UTF-8

To UCS-2 and UTF-16

References

Add your contribution

Related Hubs

Contribute something