Recent from talks
Contribute something
Nothing was collected or created yet.
Code point
View on WikipediaA code point, codepoint or code position is a particular position in a table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dimensional (sheets in a workbook), etc... in any number of dimensions.
Technically, a code point is a unique position in a quantized n-dimensional space, where the position has been assigned a semantic meaning. The table has discrete (whole) and positive positions (1, 2, 3, 4, but not fractions).
Code points are used in a multitude of formal information processing and telecommunication standards.[1][2] For example ITU-T Recommendation T.35[3] contains a set of country codes for telecommunications equipment (originally fax machines) which allow equipment to indicate its country of manufacture or operation. In T.35, Argentina is represented by the code point 0x07, Canada by 0x20, Gambia by 0x41, etc.
In character encoding
[edit]Code points are commonly used in character encoding, where a code point is a numerical value that maps to a specific character. In character encoding code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols, control characters, or formatting.[4] The set of all possible code points within a given encoding/character set make up that encoding's codespace.[5][6]
For example, the character encoding scheme ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.
In Unicode
[edit]For Unicode, the particular sequence of bits is called a code unit – for the UCS-4 encoding, any code point is encoded as 4-byte (octet) binary numbers, while in the UTF-8 encoding, different code points are encoded as sequences from one to four bytes long, forming a self-synchronizing code. See comparison of Unicode encodings for details. Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. However, code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.[citation needed]
The distinction between a code point and the corresponding abstract character is not pronounced in Unicode but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.[citation needed]
History
[edit]The concept of a code point dates to the earliest standards for digital information processing and digital telecommunications.
In Unicode, code points are part of Unicode's solution to a difficult conundrum faced by character encoding developers in the 1980s.[7] If they added more bits per character to accommodate larger character sets, that design decision would also constitute an unacceptable waste of then-scarce computing resources for Latin script users (who constituted the vast majority of computer users at the time), since those extra bits would always be zeroed out for such users.[8] The code point avoids this problem by breaking the old idea of a direct one-to-one correspondence between characters and particular sequences of bits.
See also
[edit]References
[edit]- ^ ETSI TS 101 773 (section 4), https://www.etsi.org/deliver/etsi_ts/101700_101799/101773/01.02.01_60/ts_101773v010201p.pdf
- ^ RFC4190 (section 1), https://datatracker.ietf.org/doc/html/rfc4190
- ^ "T.35 : Procedure for the allocation of ITU-T defined codes for non-standard facilities".
- ^ "The Unicode® Standard Version 11.0 – Core Specification" (PDF). Unicode Consortium. 30 June 2018. p. 23. Archived from the original (PDF) on 19 September 2018. Retrieved 25 December 2018.
Format: Invisible but affects neighboring characters; includes line/paragraph separators
- ^ Unicode. "Glossary of Unicode Terms". unicode.org. Retrieved 20 March 2023.
- ^ "The Unicode® Standard Version 11.0 – Core Specification" (PDF). Unicode Consortium. 30 June 2018. p. 22. Archived from the original (PDF) on 19 September 2018. Retrieved 25 December 2018.
On a computer, abstract characters are encoded internally as numbers. To create a complete character encoding, it is necessary to define the list of all characters to be encoded and to establish systematic rules for how the numbers represent the characters. The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encodedcharacter.
- ^ Constable, Peter (13 June 2001). "Understanding Unicode™ - I". NRSI: Computers & Writing Systems. Archived from the original (html) on 16 September 2010. Retrieved 25 December 2018.
By the early 1980s, the software industry was starting to recognise the need for a solution to the problems involved with using multiple character encoding standards. Some particularly innovative work was begun at Xerox. The Xerox Star workstation used a multi-byte encoding that allowed it to support a single character set with potentially millions of characters.
- ^ Mark Davis; Ken Whistler (23 March 2001). "Unicode Technical Standard #10 UNICODE COLLATION ALGORITHM". Unicode Consortium. Archived from the original (html) on 25 August 2001. Retrieved 25 December 2018.
6.2 Large Weight Values
External links
[edit]Code point
View on GrokipediaFundamentals
Definition
A code point is a numerical index or value, typically an integer, used in a character encoding scheme to uniquely identify an abstract character from a predefined repertoire.[4][5] Code points serve as mappings between human-readable characters and machine-readable binary data, enabling the systematic representation and processing of text in computing systems.[3] For example, in the ASCII encoding, the code point 65 represents the uppercase letter 'A'.[6] In Unicode, the same character is identified by the code point U+0041.[7] A code point designates an abstract character, which is a semantic unit independent of its specific encoded form or visual appearance.[5]Relation to Characters and Glyphs
An abstract character represents a unit of information with semantic value, serving as the smallest component of written language independent of any particular encoding scheme or visual rendering method. For example, the abstract character for the letter 'é' embodies the concept of an 'e' with an acute accent, regardless of how it is stored digitally or displayed. In the Unicode Standard, each such abstract character is uniquely identified by a code point, which is a non-negative integer assigned within the Unicode codespace.[8][1] In distinction to abstract characters, a glyph is the particular visual image or shape used to depict a character during rendering or printing. Glyphs are defined by font technologies and can differ widely; for instance, the abstract character 'A' might be rendered as a serif glyph in one typeface or a sans-serif variant in another. The Unicode Standard specifies that glyphs are not part of the encoding model itself but result from the interpretation of code points by rendering engines.[8][5] A code point generally maps to a single abstract character, but the transition from abstract character to glyph introduces variability based on context and presentation rules. One abstract character may correspond to multiple glyphs, such as in the case of positional forms in cursive scripts like Arabic, where the shape adapts to initial, medial, or final positions. Conversely, ligatures can combine multiple abstract characters—each with its own code point—into a single composite glyph, as seen with 'fi' forming a unified shape in many fonts to improve readability.[8][5] Unicode normalization addresses scenarios where distinct code point sequences encode the same abstract character, enabling consistent text processing across systems. For example, the precomposed 'é' (U+00E9) is canonically equivalent to the sequence 'e' (U+0065) followed by a combining acute accent (U+0301), allowing normalization forms like NFC (which favors precomposed characters) or NFD (which decomposes them) to standardize representations without altering semantic meaning. This equivalence ensures that applications can interchange text reliably while preserving the underlying abstract character.[9][8]Representation in Encodings
Code Units vs. Code Points
In character encodings, a code unit represents the smallest fixed-size unit of storage or transmission for text data, typically defined by the bit width of the encoding form. For instance, UTF-8 employs 8-bit code units (bytes), while UTF-16 uses 16-bit code units.[1] These code units serve as the basic building blocks for representing sequences of text, allowing computers to process and interchange Unicode data efficiently across different systems.[10] The primary distinction between code points and code units lies in their roles and granularity: a code point is a single numerical value (from 0 to 10FFFF in hexadecimal) assigned to an abstract character in the Unicode standard, whereas code units are the encoded bits that collectively form one or more code points.[1] In fixed-width encodings like UTF-32, each code point corresponds directly to one code unit (a 32-bit value), simplifying access. However, in variable-width encodings such as UTF-8 and UTF-16, a single code point often requires multiple code units to encode, particularly for characters beyond the Basic Multilingual Plane. This multi-unit representation enables compact storage but introduces complexity in parsing text streams. A concrete example illustrates this difference: the Unicode code point U+1F600, which maps to the grinning face emoji (😀), is encoded as four 8-bit code units in UTF-8 (hexadecimal F0 9F 98 80, or bytes 240, 159, 152, 128) and as two 16-bit code units in UTF-16 (hexadecimal D83D DE00, forming a surrogate pair).[11] In UTF-16, the first unit (D83D) is a high surrogate and the second (DE00) a low surrogate, together representing the full code point; treating them separately would yield invalid or unintended characters. When processing text, algorithms must correctly decode sequences of code units into complete code points to ensure accurate interpretation of abstract characters. Failure to handle multi-unit code points properly—such as by assuming each code unit is an independent character—can result in errors like mojibake, where encoded text is misinterpreted and rendered as garbled symbols during decoding with an incompatible scheme.[4] This underscores the need for encoding-aware software to normalize and validate input, preventing data corruption in applications ranging from web browsers to file systems.Fixed-Width Encodings
Fixed-width encodings are character encoding schemes in which each code point is represented using a consistent number of code units, such as bits or bytes, resulting in sequences of uniform length for all characters. This direct one-to-one mapping between code points and fixed-size code units simplifies the representation process, as no variable-length sequences are required to encode different characters.[12] The primary advantages of fixed-width encodings include ease of implementation and processing, since there is no need for complex decoding algorithms to determine character boundaries. They also enable efficient random access to individual characters within a text stream, allowing the position of the nth character to be computed in constant time by simple arithmetic on the code unit offsets. These properties make fixed-width encodings particularly suitable for applications with small character repertoires, where simplicity outweighs storage efficiency.[13][14] However, fixed-width encodings have significant limitations due to their uniform sizing, which caps the total number of representable code points at the power of two corresponding to the width (e.g., 128 for 7 bits or 256 for 8 bits). This restricts their ability to accommodate large or diverse character sets, such as those required for multilingual text, often necessitating multiple incompatible variants for different languages. For the full Unicode range, UTF-32 is a fixed-width encoding using 32-bit code units, providing direct mapping for all 1,114,112 possible code points without surrogates or variable lengths, though it uses more storage for ASCII-range text compared to variable-width forms.[2] Prominent examples include ASCII, a 7-bit encoding supporting 128 code points from 0x00 to 0x7F, primarily for English-language text and control characters. ISO/IEC 8859-1, an 8-bit extension of ASCII, provides 256 code points for Western European languages, with the first 128 matching ASCII.[15] EBCDIC, another 8-bit scheme developed by IBM, uses a different bit assignment for characters and remains in use on mainframe systems.[16] Windows-1252, a Microsoft variant of ISO/IEC 8859-1, also employs 8 bits but includes additional printable characters in the upper range for enhanced Western European support.[17]Variable-Width Encodings
Variable-width encodings represent Unicode code points using a varying number of code units, allowing for more efficient storage of text with predominantly low-range characters while supporting the full range up to U+10FFFF.[2] This approach contrasts with fixed-width encodings like UTF-32, which allocate uniform space regardless of the code point value.[2] By adjusting the number of code units based on the code point's magnitude, these encodings optimize space for common scripts such as Latin and Cyrillic, which fit into fewer units, while extending to rarer or higher-range characters with additional units.[2] UTF-8, a widely used variable-width encoding, employs 8-bit code units and determines the sequence length from the leading bits of the first byte.[2] Code points in the range U+0000 to U+007F (basic Latin and ASCII) are encoded in a single byte, ensuring compatibility with legacy ASCII systems.[2] For U+0080 to U+07FF (e.g., extended Latin, Greek, Cyrillic, Arabic), two bytes are used; U+0800 to U+FFFF (including most of the Basic Multilingual Plane, or BMP) require three bytes; and U+10000 to U+10FFFF (supplementary planes) use four bytes.[2] The encoding algorithm ensures that continuation bytes (always 10xxxxxx in binary) follow the lead byte, which specifies the total length, enabling a self-synchronizing property where parsers can detect sequence boundaries efficiently, even after data corruption, by examining at most four bytes backward.[2]| Code Point Range | Bytes in UTF-8 | Example Characters |
|---|---|---|
| U+0000–U+007F | 1 | Basic Latin (A–Z) |
| U+0080–U+07FF | 2 | Extended Latin, Greek, Cyrillic, Arabic |
| U+0800–U+FFFF | 3 | Devanagari, Thai, BMP Han |
| U+10000–U+10FFFF | 4 | Emoji, supplementary CJK |
