Recent from talks
Nothing was collected or created yet.
List of binary codes
View on WikipediaThis is a list of some binary codes that are (or have been) used to represent text as a sequence of binary digits "0" and "1". Fixed-width binary codes use a set number of bits to represent each character in the text, while in variable-width binary codes, the number of bits may vary from character to character.
Five-bit binary codes
[edit]Several different five-bit codes were used for early punched tape systems.
Five bits per character only allows for 32 different characters, so many of the five-bit codes used two sets of characters per value referred to as FIGS (figures) and LTRS (letters), and reserved two characters to switch between these sets. This effectively allowed the use of 60 characters.
Standard five-bit standard codes are:
- International Telegraph Alphabet No. 1 (ITA1) – Also commonly referred to as Baudot code[1]
- International Telegraph Alphabet No. 2 (ITA2) – Also commonly referred to as Murray code[1][2]
- American Teletypewriter code (USTTY) – A variant of ITA2 used in the USA[2]
- DIN 66006 – Developed for the presentation of ALGOL/ALCOR programs on paper tape and punch cards
The following early computer systems each used its own five-bit code:
- J. Lyons and Co. LEO (Lyon's Electronic Office)
- English Electric DEUCE
- University of Illinois at Urbana-Champaign ILLIAC
- ZEBRA
- EMI 1100
- Ferranti Mercury, Pegasus, and Orion systems[3]
The steganographic code, commonly known as Bacon's cipher uses groups of 5 binary-valued elements to represent letters of the alphabet.
Six-bit binary codes
[edit]Six bits per character allows 64 distinct characters to be represented.
Examples of six-bit binary codes are:
- International Telegraph Alphabet No. 4 (ITA4)[4]
- Six-bit BCD (Binary Coded Decimal), used by early mainframe computers.
- Six-bit ASCII subset of the primitive seven-bit ASCII
- Braille – Braille characters are represented using six dot positions, arranged in a rectangle. Each position may contain a raised dot or not, so Braille can be considered to be a six-bit binary code.
See also: Six-bit character codes
Seven-bit binary codes
[edit]Examples of seven-bit binary codes are:
- International Telegraph Alphabet No. 3 (ITA3) – derived from the Moore ARQ code, and also known as the RCA
- ASCII – The ubiquitous ASCII code was originally defined as a seven-bit character set. The ASCII article provides a detailed set of equivalent standards and variants. In addition, there are various extensions of ASCII to eight bits (see Eight-bit binary codes)
- CCIR 476 – Extends ITA2 from 5 to 7 bits, using the extra 2 bits as check digits[4]
- International Telegraph Alphabet No. 4 (ITA4)[4]
Eight-bit binary codes
[edit]- Extended ASCII – A number of standards extend ASCII to eight bits by adding a further 128 characters, such as:
- EBCDIC – Used in early IBM computers and current IBM i and System z systems.
10-bit binary codes
[edit]- AUTOSPEC – Also known as Bauer code. AUTOSPEC repeats a five-bit character twice, but if the character has odd parity, the repetition is inverted.[4]
- Decabit – A datagram of electronic pulses which are transmitted commonly through power lines. Decabit is mainly used in Germany and other European countries.
16-bit binary codes
[edit]- UCS-2 – An obsolete encoding capable of representing the basic multilingual plane of Unicode
32-bit binary codes
[edit]- UTF-32/UCS-4 – A four-bytes-per-character representation of Unicode.
Variable-length binary codes
[edit]- UTF-8 – Encodes characters in a way that is mostly compatible with ASCII but can also encode the full repertoire of Unicode characters with sequences of up to four 8-bit bytes.
- UTF-16 – Extends UCS-2 to cover the whole of Unicode with sequences of one or two 16-bit elements
- GB 18030 – A full-Unicode variable-length code designed for compatibility with older Chinese multibyte encodings
- Huffman coding – A technique for expressing more common characters using shorter bit strings than are used for less common characters
Data compression systems such as Lempel–Ziv–Welch can compress arbitrary binary data. They are therefore not binary codes themselves but may be applied to binary codes to reduce storage needs.
Other
[edit]- Morse code is a variable-length telegraphy code, which traditionally uses a series of long and short pulses to encode characters. It relies on gaps between the pulses to provide separation between letters and words, as the letter codes do not have the "prefix property". This means that Morse code is not necessarily a binary system, but in a sense may be a ternary system, with a 10 for a "dit" or a "dot", a 1110 for a dash, and a 00 for a single unit of separation. Morse code can be represented as a binary stream by allowing each bit to represent one unit of time. Thus a "dit" or "dot" is represented as a 1 bit, while a "dah" or "dash" is represented as three consecutive 1 bits. Spaces between symbols, letters, and words are represented as one, three, or seven consecutive 0 bits. For example, "NO U" in Morse code is "— .
— — — ", which could be represented in binary as "1110100011101110111000000010101110". If, however, Morse code is represented as a ternary system, "NO U" would be represented as "1110|10|00|1110|1110|1110|00|00|00|10|10|1110".. . —
See also
[edit]References
[edit]- ^ a b Alan G. Hobbs (1999-03-05). "Five-unit codes". NADCOMM Museum. Archived from the original on 1999-11-04.
- ^ a b Gil Smith (2001). "Teletypewriter Communication Codes" (PDF).
- ^ "Paper Tape Readers & Punches". The Ferranti Orion Web Site. Archived from the original on 2011-07-21.
- ^ a b c d "Telecipher Devices". John Savard's Home Page.
List of binary codes
View on GrokipediaFundamentals of Binary Codes
Definition and Classification
Binary codes are numerical representations of information using sequences of binary digits, known as bits, where each bit is either 0 or 1, serving as the fundamental unit of data in digital systems.[8] These codes encode characters, numbers, symbols, or instructions to facilitate storage, processing, and transmission in computers and other digital devices, aligning with the binary states of electronic circuits such as on and off.[9] In essence, binary codes translate abstract data into a machine-readable format that underpins all modern computing operations.[8] Binary codes are broadly classified into fixed-length and variable-length types based on the number of bits assigned to each symbol. Fixed-length codes allocate a constant number of bits to every symbol, such as 7 or 8 bits per character in the ASCII standard, enabling straightforward encoding and decoding.[9] Variable-length codes, in contrast, assign varying bit lengths to symbols, typically shorter codes to more frequent symbols for greater efficiency, as seen in Huffman coding or Morse code representations.[10] Fixed-length codes offer advantages in simplicity, as both encoders and decoders know the exact bit count per symbol in advance, facilitating easy alignment, parallel transmission, and reduced risk of synchronization errors without needing additional separators.[10] However, they can waste space for infrequently used symbols, as all symbols receive equal bit allocation regardless of usage frequency, leading to inefficiency in data compression.[11] Variable-length codes address this by optimizing average code length but introduce complexity in decoding, requiring mechanisms like prefix properties or stop bits to identify symbol boundaries and avoid framing errors.[10] Common bit widths for binary codes range from 4 bits, used for basic hexadecimal representations, to 32 bits for more complex data types in early computing architectures.[9] In applications like telegraphy, 5-bit fixed-length codes such as Baudot enabled efficient transmission of text over limited bandwidth, while variable-length schemes like Morse code (up to 4-5 bits per character) minimized signaling time.[12] These widths laid the groundwork for digital communication and computing systems, balancing simplicity with the need for representing diverse symbols.[12]Historical Evolution
The development of binary codes traces its roots to 19th-century telegraphy, where Samuel Morse introduced Morse code in the 1830s as a variable-length binary signaling system using dots and dashes to represent letters and numbers for electrical transmission over wires.[13] This precursor to fixed binary encodings enabled efficient long-distance communication but lacked standardization for mechanical processing. In the 1870s, Émile Baudot advanced the field with his 5-bit code for telegraphic multiplexing, assigning fixed combinations of five binary units (mark or space) to characters, which supported simultaneous transmission of multiple messages and marked the shift toward compact, machine-readable formats.[14] The transition to computing in the late 19th and early 20th centuries built on these foundations through Herman Hollerith's punched cards for the 1890 U.S. Census, which used 12 binary positions (hole or no hole) per column to encode demographic data, enabling automated tabulation and processing of over 60 million cards.[15] By the 1940s, early electronic computers like ENIAC (completed in 1945) used decimal representations with ring counters for numerical handling. Subsequent machines, such as EDVAC, adopted binary representations, with many employing 6- to 7-bit codes derived from binary-coded decimal (BCD) for character and numerical handling, facilitating programmable operations in scientific and military applications.[16] Post-World War II efforts focused on standardization to support interoperable systems. In 1963, the American Standards Association published ASCII (ASA X3.4-1963) as a 7-bit code encoding 128 characters, including letters, digits, and controls, to unify data exchange across teleprinters, computers, and networks.[17] Concurrently, IBM developed the 8-bit EBCDIC in the 1960s for its System/360 mainframes, extending BCD principles to 256 characters while prioritizing punch-card compatibility and internal efficiency.[18] In the modern era, the limitations of fixed-length codes for global languages prompted the Unicode Consortium's formation in 1991, introducing a universal standard with variable- and fixed-width encodings up to 21 bits per character to encompass over 140,000 symbols across scripts.[19] This evolved into UTF-8 in 1992, a backward-compatible variable-length encoding using 1 to 4 bytes, designed by Ken Thompson and Rob Pike to optimize ASCII storage while supporting multilingual text.[20] Key events included ARPANET's adoption of ASCII in the early 1970s for protocols like Telnet, enabling standardized text transmission across its growing nodes.[21] Meanwhile, EBCDIC persisted in IBM mainframes through the 1970s and beyond, sustaining legacy applications in enterprise computing despite the rise of ASCII-based systems.[22]Short Fixed-Length Binary Codes
Five-Bit Binary Codes
Five-bit binary codes represent one of the earliest forms of structured binary encoding, utilizing 32 possible combinations (2^5) to encode a limited set of characters, primarily for alphanumeric representation in telegraphy and steganography. These codes emerged in the 17th century as conceptual tools and evolved into practical systems by the late 19th century, enabling efficient transmission over wire-based networks despite the constraints of only 32 distinct symbols, which necessitated mechanisms like shifting to access additional characters.[23] An early precursor to modern binary codes is Francis Bacon's bilateral cipher, introduced in his 1623 work De Augmentis Scientiarum, which employed a 5-bit binary system for steganographic purposes. In this method, each letter of the Roman alphabet (excluding J and V, with I/J and U/V combined to fit 24 symbols) is assigned a unique 5-bit sequence using 'A' for 0 and 'B' for 1, hidden within innocuous text by varying typefaces or patterns to conceal the message. For example, 'A' is encoded as AAAAA (00000 in binary), while 'B' is AAAAB (00001), allowing the secret to blend seamlessly with carrier text for covert communication. This approach marked an innovative use of binary principles in cryptography long before electrical transmission.[24] The practical application of 5-bit codes in communication began with Émile Baudot's printing telegraph system, patented in 1874 and detailed in 1877 publications, which introduced uniform-length binary sequences for letters, figures, and controls over telegraph lines. Standardized as International Telegraph Alphabet No. 1 (ITA1) in 1929 by the International Telegraph Union (now ITU), it supported 32 symbols including 26 uppercase letters, basic punctuation, and shift controls, with four positions reserved for national variants to accommodate limited non-Latin characters. This was superseded by International Telegraph Alphabet No. 2 (ITA2), also known as the Baudot-Murray code, adopted internationally in 1931 following Donald Murray's 1901 improvements to Baudot's design, which rearranged codes for more efficient printing on paper tape.[23][25] ITA2's structure relies on a 5-bit serial transmission, where shift mechanisms double the effective character set beyond 32 symbols: the Letters shift (LTRS, binary 11111) selects alphabetic mode for the subsequent characters until overridden, while the Figures shift (FIGS, binary 11011) switches to numerals, punctuation, and symbols. Controls such as Null (NUL, 00000), Line Feed (LF, 11110), and Carriage Return (CR, 01011) occupy dedicated codes, ensuring compatibility with early teletypewriters and allowing transmission speeds up to 100 baud on shared lines. A representative encoding in ITA2 letters mode assigns 'A' the binary sequence 00011, transmitted as five sequential marks or spaces on a 5-unit tape or wire signal.[25] In the mid-20th century, 5-bit codes like ITA2 were adapted for early computing peripherals, particularly punch tape systems that stored and input data for machines such as the UNIVAC series in the 1950s, where 11/16-inch tape accommodated five-hole patterns for the 32 combinations. This limited encoding supported only the basic Latin alphabet and numerals, restricting applications to text-based input/output without diacritics or extended symbols, and was common in systems like the IAS computer family for interfacing with teleprinters. The inherent 32-symbol constraint highlighted the need for more bits in later codes but provided a reliable medium for binary data transfer in resource-limited environments.[26][27] Telegraph transmissions using these 5-bit codes faced notable error rates due to noise on long-distance lines, with historical measurements on circuits up to 100 baud showing character error probabilities often exceeding 1 in 1,000 under adverse conditions, necessitating parity checks or manual retransmissions for accuracy. This vulnerability underscored the codes' legacy in prompting advancements in error detection, while their simplicity facilitated widespread adoption in global telegraph networks until the 1960s.[28][29]Six-Bit Binary Codes
Six-bit binary codes emerged in the mid-20th century as an advancement over five-bit encodings, providing 64 possible character combinations to support expanded alphanumeric representations in computing and communication systems. These codes facilitated the inclusion of uppercase letters, digits, and a range of symbols without relying on shift mechanisms, which were common limitations in earlier five-bit telegraph codes like the International Telegraph Alphabet No. 2. This increased capacity proved essential for handling more complex data in military and early commercial applications, where efficiency in transmission and storage was paramount.[30][31] The Fieldata code, developed by the U.S. Army Signal Corps in the late 1950s, exemplifies early six-bit encodings designed for military use. It encoded 64 characters, encompassing uppercase letters, digits 0-9, and various punctuation and mathematical symbols, to standardize data collection and transmission in secure communications. Fieldata was integral to projects like the MOBIDIC (MObile DIgital Computer), a portable system deployed in the 1950s for field army operations, including the processing and relay of battlefield intelligence. Its adoption across compatible hardware from multiple manufacturers ensured interoperability in data links, supporting applications such as radar data transmission where rapid, error-resistant encoding was critical.[30][32] In the 1960s, variants like Transcode extended six-bit principles for teletypewriter and remote data entry systems, incorporating additional control characters for efficient message handling. Transcode, utilized in IBM's communication protocols such as those for the 2780 Remote Job Entry station, optimized transmission by reducing overhead compared to longer codes, transmitting alphanumeric data with 26 uppercase letters, 10 digits, and 28 symbols plus controls. Similarly, Control Data Corporation's (CDC) six-bit display code, introduced with the CDC 6000 series in 1964, served as a Transcode-like variant for peripherals and teletype interfaces in CDC computers, enabling direct solenoid control in print heads for six-bit alphabets. These codes prioritized compactness, allowing six-bit words to represent full characters in early peripherals like punch card readers and teleprinters.[33][18] IBM's early six-bit variants, known as Binary Coded Decimal Interchange Code (BCDIC), were tailored for punch card systems before the transition to eight-bit standards in the late 1960s. Employed in machines like the IBM 1401 from 1959, BCDIC used six bits per character to encode zones and numeric values, supporting business data processing with uppercase letters, digits, and basic symbols. For instance, the letter 'A' was encoded as 100001 in binary, corresponding to the B (zone) and 1 (numeric) bits in the 1401's bit weighting scheme. This format bridged punched card mechanics—where 12-row columns mapped to six bits—with binary computation, enhancing throughput in early peripherals without the need for multi-column representations. The advantages of six-bit codes over five-bit predecessors were evident in these applications: they doubled the symbol set to 64, accommodating radar telemetry and peripheral I/O for military logistics and commercial tabulation without introducing complex shifting, thereby improving data density and operational speed.[34][35][32]Seven-Bit Binary Codes
Seven-bit binary codes utilize seven bits to represent 128 unique combinations (2^7), serving as a foundational standard for encoding text characters and control signals in early computing and data transmission systems.[36] These codes emerged as an advancement over earlier six-bit encodings used in systems like IBM's BCD, providing sufficient capacity for basic alphanumeric and punctuation sets while conserving bandwidth in telegraphic and computing applications.[17] The American Standard Code for Information Interchange (ASCII), first published in 1963 by the X3.2 subcommittee of the American Standards Association (ASA), standardized 128 codes for the English alphabet, digits, punctuation, and special symbols.[17] It allocates codes 0–31 for non-printable control characters (such as NULL, BEL, and ESC), 33–126 for 94 printable characters (including uppercase and lowercase letters, digits 0–9, and symbols like ! and @), and 127 for the delete (DEL) character.[37] This structure ensured interoperability among diverse hardware from manufacturers like IBM, Honeywell, and RCA, facilitating data exchange in telecommunications and computing.[17] In 1967, the International Organization for Standardization (ISO) introduced Recommendation 646 (later formalized as ISO/IEC 646) as an international adaptation of ASCII, maintaining core compatibility while permitting national variants to replace certain invariant symbols with locale-specific characters.[38] For instance, the British variant (ISO-IR-4) substitutes the pound sterling symbol (£) for the number sign (#) at code 35, and the backslash () for the vertical bar (|) at code 124, allowing better support for regional currencies and typography without altering the fundamental 128-code framework.[39] These variants, such as ISO-IR-6 for French (replacing certain symbols with è, à, and û) and ISO-IR-8 for German (with § and ß), numbered over 20 by the 1980s, promoting global standardization while accommodating linguistic diversity.[39] In terms of encoding, seven-bit codes assign each character a unique binary sequence across bit positions b6 (most significant) to b0 (least significant), with no inherent parity but often paired with an eighth bit for even or odd parity in transmission to detect single-bit errors.[40] A representative example is the uppercase letter 'A', encoded as1000001 in binary (bit positions b6=1, b5=0, ..., b0=1), which corresponds to decimal value 65.[37] This binary-to-decimal mapping enabled straightforward implementation in hardware like teleprinters and early computers.
The enduring legacy of seven-bit codes lies in their role as the bedrock for protocols in email (e.g., SMTP) and the internet, where ASCII ensures reliable text handling in headers and basic content, paving the way for subsequent eight-bit extensions in systems requiring multilingual support.[41]
Medium Fixed-Length Binary Codes
Eight-Bit Binary Codes
Eight-bit binary codes represent an expansion of the 7-bit ASCII framework, utilizing all 256 possible combinations to encode extended character sets including accented letters, symbols, and control characters for diverse linguistic needs.[42] The Extended Binary Coded Decimal Interchange Code (EBCDIC), developed by IBM in 1963–1964 and announced alongside the System/360 mainframe computers, stands as one of the pioneering 8-bit encoding schemes.[42] EBCDIC organizes its 256 code points into distinct zones for numeric digits (hex F0–F9), uppercase letters (hex C1–C9, D1–D9, and E2–E9), lowercase letters (hex 81–89, 91–99, and A2–A9), and punctuation, featuring a non-contiguous alphabetic ordering that evolved from earlier 6-bit BCDIC systems.[42][43] For instance, the uppercase 'A' is assigned the binary code 11000001 (hex C1), separating letters from numerals to facilitate legacy punched-card compatibility.[43] IBM produced over 57 national variants of EBCDIC to accommodate regional scripts, though its irregular layout posed challenges for interoperability with ASCII-based systems.[42] In response to the need for international standardization, the ISO/IEC 8859 series emerged starting in 1987, providing 8-bit extensions to ASCII by reserving the upper 128 code points (hex 80–FF) for language-specific characters while preserving the lower 128 for basic ASCII compatibility.[42][44] The inaugural part, ISO/IEC 8859-1 (Latin-1), targets Western European languages and incorporates diacritics such as á, ç, and ñ, enabling support for up to 191 graphic characters across French, German, Spanish, and similar tongues.[42][44] Subsequent parts in the series, like 8859-2 for Central European languages, followed this model, promoting uniform data exchange in computing and telecommunications.[44] Microsoft's Windows-1252, introduced in the early 1990s as part of its code page system, serves as a proprietary extension of ISO/IEC 8859-1, filling undefined gaps in Latin-1 with additional printable symbols such as curly quotes, em dashes, and the euro sign (€ at hex 80).[45] This encoding became the default for Windows applications and web content in Western locales until the widespread adoption of UTF-8 in the late 2000s, supporting legacy text processing in environments like email and HTML.[45] These 8-bit codes found primary applications in mainframe computing (EBCDIC on IBM systems), early personal computers, and internationalized software, where their 256 slots facilitated the inclusion of diacritics, line-drawing graphics, and box characters essential for regional text rendering and terminal displays.[42] Despite their limitations in handling non-Latin scripts, they bridged the gap between 7-bit ASCII and modern universal encodings, influencing data interchange in legacy systems through the 1990s.[45]Ten-Bit Binary Codes
Ten-bit binary codes emerged as a specialized extension in mid-20th-century data transmission, bridging the gap between shorter fixed-length codes and emerging standards by providing enhanced redundancy for reliability in noisy channels. These codes were particularly suited to radioteleprinter systems, where the additional bits enabled error detection without resorting to more complex variable-length schemes. Unlike the 32-symbol capacity of 5-bit International Telegraph Alphabet No. 2 (ITA2) codes or the 128 symbols of 7-bit ASCII, ten-bit formats allowed for structured repetition to support industrial applications requiring robust communication over long distances.[46] A key example is the Bauer code, developed in the 1960s and also referred to as AUTOSPEC or Autospec-Bauer. This synchronous code was classified under the U10 family of ten-unit radioteleprinter systems by the International Radio Consultative Committee (CCIR) in 1966, emphasizing its role in single-channel data transmission for automatic reception testing. It was primarily used in high-frequency (HF) radio links, operating at baud rates such as 62.3, 68.5, 102.7, or 137 Bd with frequency-shift keying (FSK) modulation and a 270 Hz shift.[46][47] The structure of the Bauer code consists of 10 bits per character: the first five bits encode an ITA2 symbol, while the second five bits repeat this sequence, inverted if the original has odd parity to enable forward error correction. For instance, an even-parity ITA2 character like the letter "A" (binary 11000) would be transmitted as 11000 followed by 11000; an odd-parity character like "B" (binary 10011) becomes 10011 followed by 01100 (inverted). This repetition-based design improved transmission integrity in industrial settings without introducing variable complexity.[48] In practice, AUTOSPEC facilitated industrial data transmission, notably by British coastal stations communicating with North Sea oil rigs starting in the late 1960s, where reliable messaging was critical for operational specifications and coordination. The code's fixed 10-bit length offered advantages over 8-bit predecessors by incorporating built-in redundancy for error-prone environments like offshore radio paths, achieving higher effective density for specialized symbols. However, its adoption remained limited due to the widespread shift toward 7- and 8-bit standards like ASCII in the 1970s, which better supported general computing and international interoperability.[47]Long Fixed-Length Binary Codes
Sixteen-Bit Binary Codes
Sixteen-bit binary codes emerged to overcome the constraints of eight-bit encodings, which could only represent 256 characters and struggled with diverse non-Latin scripts requiring thousands of glyphs.[49] These 16-bit schemes expanded capacity to 65,536 code points, enabling broader multilingual support in computing systems. The Universal Character Set-2 (UCS-2), formalized in 1993 under ISO/IEC 10646-1, provides a straightforward fixed-length 16-bit encoding for the initial 65,536 code points of the Universal Character Set. Each character is represented by two bytes, with options for big-endian or little-endian byte serialization to ensure interoperability across platforms. UTF-16, introduced in 1996 as part of Unicode 2.0, builds on UCS-2 by adding surrogate pairs to access the complete Unicode space of 1,114,112 code points. It encodes most characters in a single 16-bit unit but uses two 16-bit units (32 bits total) for others via high surrogates (U+D800 to U+DBFF) paired with low surrogates (U+DC00 to U+DFFF), allowing representation of characters in supplementary planes without altering legacy UCS-2 processing for the Basic Multilingual Plane.[50] This mechanism ensures backward compatibility while extending coverage.[50] UCS-2 and UTF-16 found prominent applications in the 1990s Windows NT kernel for internal text handling and in Java's string implementations, where UTF-16 serves as the native format for efficient manipulation of international text.[51]Thirty-Two-Bit Binary Codes
Thirty-two-bit binary codes represent a category of fixed-length encodings capable of addressing vast character spaces, particularly suited for comprehensive coverage of the Unicode standard without the need for surrogate pairs required in shorter formats. These codes allocate 32 bits (4 bytes) per character, enabling direct mapping of any Unicode code point to a single code unit and simplifying internal processing tasks. Unlike 16-bit encodings that pair units for extended ranges, 32-bit codes offer straightforward indexing and uniform memory usage, though at the cost of higher storage demands.[52] The primary example is UTF-32, a Unicode encoding form that directly encodes all Unicode scalar values from U+0000 to U+10FFFF using a single 32-bit unsigned integer per code point, formalized in Unicode Technical Report #19 and incorporated into the Unicode Standard starting with version 3.1 in 2001.[52] This approach ensures a one-to-one correspondence between code points and code units, making it ideal for applications requiring fixed-width access, such as string manipulation in programming languages. However, its fixed 4-byte length per character results in significant memory overhead, approximately four times that of ASCII for Latin text, limiting its use primarily to in-memory representations rather than storage or transmission.[49] As of Unicode 17.0 (released September 2025), the standard continues to expand within the defined code space. In the 1980s, ISO working groups began developing a 32-bit character code as an extension of national and regional standards, predating the formalization of UCS-4, the 31-bit (padded to 32-bit) precursor defined in ISO/IEC 10646-1:1993.[53][54] In practice, UTF-32 finds application in internal software processing, such as XML parsers that convert variable-length inputs to fixed-width formats for efficient parsing and validation.[55] Endianness poses a key consideration, with UTF-32BE (big-endian) placing the most significant byte first and UTF-32LE (little-endian) reversing this order; a byte order mark (BOM, U+FEFF) is often prepended to resolve ambiguity in cross-platform environments.[49] Languages like Python 3 and C (viachar32_t) leverage UTF-32 internally for Unicode string operations, prioritizing simplicity over compactness.[51]
The theoretical capacity of a 32-bit code is 4,294,967,296 distinct values (2^{32}), far exceeding Unicode's current limit of 1,114,112 assigned code points. For example, the Latin capital 'A' (U+0041) encodes as the 32-bit value 0x00000041, or in binary:
00000000 00000000 01000001
00000000 00000000 01000001
Variable-Length Binary Codes
Prefix and Huffman Codes
Prefix codes are a class of variable-length binary codes designed for instantaneous decodability, meaning that no codeword is a prefix of another codeword in the set. This property allows the decoder to identify the end of a codeword as soon as it is fully received, without needing to look ahead at subsequent bits. The validity of such codes is governed by the Kraft inequality, which states that for a prefix code with codeword lengths (where indexes the symbols), the sum must hold; this condition ensures that the code can be represented by a binary tree without overlaps.[56] Huffman coding, introduced by David A. Huffman in 1952, provides an optimal method for constructing prefix codes that minimize the average codeword length for a given symbol probability distribution. The algorithm builds a binary tree by starting with leaf nodes for each symbol weighted by its frequency, then repeatedly merging the two nodes with the lowest frequencies into a parent node until a single root remains; bits 0 and 1 are assigned to the branches leading to the children. This greedy approach yields codewords whose lengths are inversely proportional to symbol probabilities, achieving the theoretical minimum average length bounded by the entropy. The resulting code is uniquely decodable and satisfies the Kraft inequality with equality for complete trees.[57] For example, in encoding English text using letter frequencies, Huffman coding typically assigns shorter codes to common letters like 'e' (appearing about 12.7% of the time) and longer codes to rare letters such as 'z' or 'q', achieving an average of approximately 4.3 bits per letter compared to 5 bits in fixed-length coding, reducing redundancy by closely approximating the source entropy .[58] Arithmetic coding serves as a related variant that encodes an entire sequence into a single fractional number between 0 and 1, allowing even finer granularity than integer-bit prefix codes like Huffman.[59] Huffman codes find widespread application in lossless data compression, such as in the DEFLATE algorithm used by ZIP files, where dynamic Huffman trees encode literals and distances to achieve efficient reduction in file sizes. They are also integral to fax transmission standards, like ITU-T T.4 Group 3, which employs fixed Huffman tables for run-length encoding of black-and-white images to minimize bandwidth over telephone lines. Unlike fixed-length codes, which waste bits on infrequent symbols, prefix codes like Huffman adapt to source statistics for substantial savings in storage and transmission.Unicode Variable Encodings
Unicode variable encodings are standardized methods for representing the Unicode character set using variable-length binary sequences, enabling efficient storage and transmission of multilingual text while optimizing for common scripts like ASCII. These encodings prioritize backward compatibility, security, and compactness, with byte structures that allow decoders to determine sequence lengths from lead bytes. They form the backbone of global text handling in computing, particularly on the web and in internationalized software. UTF-8, devised by Ken Thompson in September 1992 at Bell Labs, encodes Unicode code points using 1 to 4 bytes per character, with the number of bytes determined by the code point's value.[60] It maintains full ASCII compatibility by using a single 8-bit byte (ranging from 0x00 to 0x7F) for the first 128 Unicode code points (U+0000 to U+007F), ensuring seamless integration with legacy systems.[60] Subsequent bytes in multi-byte sequences are continuation bytes, always starting with the bits10xxxxxx to distinguish them from lead bytes. For example, the Euro sign (€, U+20AC) is encoded as the three-byte sequence 11100010 10000010 10101100 (hexadecimal E2 82 AC).[60]
UTF-8 decoding relies on the lead byte to specify the sequence length: bytes starting with 0xxxxxxx are single-byte (1 byte total), 110xxxxx indicate two bytes, 1110xxxx three bytes, and 11110xxx four bytes, followed by the appropriate number of continuation bytes.[60] To enhance security, overlong encodings—where a code point uses more bytes than necessary, such as representing U+0000 as a two-byte sequence—are strictly forbidden, preventing potential exploits like buffer overflows or canonicalization attacks in string processing.[60] By 2025, UTF-8 dominates web content, used by 98.8% of websites with known character encodings due to its efficiency and universal support.[61]
UTF-16, defined in 1996 and standardized in RFC 2781 (2000), is a variable-length encoding that represents Unicode code points using one or two 16-bit units (2 or 4 bytes). Code points from U+0000 to U+FFFF are encoded directly in a single 16-bit unit (BMP or Basic Multilingual Plane). For code points beyond U+FFFF (Supplementary Planes), UTF-16 uses surrogate pairs: a high surrogate (U+D800 to U+DBFF) followed by a low surrogate (U+DC00 to U+DFFF), allowing representation of the full Unicode range up to 1,114,112 code points. UTF-16 is widely used internally in many programming languages (e.g., Java strings) and operating systems (e.g., Windows), balancing efficiency for common characters with support for rare ones, though it lacks ASCII byte-level compatibility.[50]
GB 18030, introduced as a national standard in China in 2000 (GB 18030-2000) and made mandatory for operating systems sold in China starting September 1, 2001, is a variable-length encoding that extends the earlier GBK standard to provide full support for the Unicode character set, particularly emphasizing Chinese ideographs. The current version, GB/T 18030-2022 (published 2022, effective August 1, 2023), aligns with Unicode 11.0 and supports over 87,000 Han ideographs.[62][63] It uses 1 to 4 bytes per character: single bytes for ASCII, double bytes for legacy GBK characters, and four-byte sequences (two 16-bit words) for additional Unicode mappings not covered by GBK, ensuring compatibility while expanding coverage.
UTF-7, developed in the mid-1990s as a 7-bit-safe transformation format for Unicode, encodes text to use only ASCII characters, making it suitable for protocols like email that traditionally restrict to 7-bit data.[64] It employs a base64-like scheme for non-ASCII characters, embedding them within ASCII-safe sequences delimited by '+' and '-' characters, while direct ASCII (A-Z, a-z, 0-9, and select symbols) remains unencoded.[64] Standardized in RFC 2152 (1997), UTF-7 was designed for human readability and transport safety but has largely been superseded by UTF-8 in modern applications due to its added complexity.[64]
