Recent from talks
Nothing was collected or created yet.
ISO/IEC 2022
View on Wikipedia
| Language | Various. |
|---|---|
| Standard | |
| Classification | Stateful system of encodings (with stateless pre-configured subsets) |
| Transforms / Encodes | US-ASCII and, depending on implementation: |
| Succeeded by | ISO/IEC 10646 (Unicode) |
| Other related encodings | Stateful subsets: Pre-configured versions: |
ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35,[1][2] the ANSI standard ANSI X3.41[3] and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.[4]
ISO 2022 specifies a general structure which character encodings can conform to, dedicating particular ranges of bytes (0x00–1F and 0x7F–9F) to be used for non-printing control codes[5] for formatting and in-band instructions (such as line breaks or formatting instructions for text terminals), rather than graphical characters. It also specifies a syntax for escape sequences, multiple-byte sequences beginning with the ESC control code, which can likewise be used for in-band instructions.[6] Specific sets of control codes and escape sequences designed to be used with ISO 2022 include ISO/IEC 6429, portions of which are implemented by ANSI.SYS and terminal emulators.
ISO 2022 itself also defines particular control codes and escape sequences which can be used for switching between different coded character sets (for example, between ASCII and the Japanese JIS X 0208) so as to use multiple in a single document,[7] effectively combining them into a single stateful encoding (a feature less important since the advent of Unicode). It is designed to be usable in both 8-bit environments and 7-bit environments (those where only seven bits are usable in a byte, such as e-mail without 8BITMIME).[8]
Encodings and conformance
[edit]The ASCII character set supports the ISO Basic Latin alphabet (equivalent to the English alphabet), and does not provide good support for languages which use additional letters, or which use a different writing system altogether. Other writing systems with relatively few characters, such as Greek, Cyrillic, Arabic or Hebrew, as well as forms of the Latin script using diacritics or letters absent from the ISO Basic Latin alphabet, have historically been represented on personal computers with different 8-bit, single byte, extended ASCII encodings, which follow ASCII when the most significant bit is 0 (i.e. bytes 0x00–7F, when represented in hexadecimal), and include additional characters for a most significant bit of 1 (i.e. bytes 0x80–FF). Some of these, such as the ISO 8859 series, conform to ISO 2022,[9][10] while others such as DOS code page 437 do not, usually due to not reserving the bytes 0x80–9F for control codes.
Certain East Asian languages, specifically Chinese, Japanese, and Korean (collectively "CJK"), are written using far more characters than the maximum of 256 which can be represented in a single byte, and were first represented on computers with language-specific double-byte encodings or variable-width encodings; some of these (such as the Simplified Chinese encoding GB 2312) conform to ISO 2022, while others (such as the Traditional Chinese encoding Big5) do not. Control codes in ISO 2022 are always represented with a single byte, regardless of the number of bytes used for graphical characters. CJK encodings used in 7-bit environments which use ISO 2022 mechanisms to switch between character sets are often given names starting with "ISO-2022-", most notably ISO-2022-JP, although some other CJK encodings such as EUC-JP also make use of ISO 2022 mechanisms.[11][12]
Since the first 256 code points of Unicode were taken from ISO 8859-1, Unicode inherits the concept of C0 and C1 control codes from ISO 2022, although it adds other non-printing characters besides the ISO 2022 control codes. However, Unicode transformation formats such as UTF-8 generally deviate from the ISO 2022 structure in various ways, including:
- Using 8-bit bytes, but not representing the C1 codes in their single-byte forms specified in ISO 2022 (most UTFs, one exception being the obsolete UTF-1)
- Representing all characters, including control codes, with multiple bytes (e.g. UTF-16, UTF-32)
- Mixing bytes with the most significant bit set and unset within the coded representation for a single code point (e.g. UTF-1, GB 18030)
ISO 2022 escape sequences do, however, exist for switching to and from UTF-8 as a "coding system different from that of ISO 2022",[13] which are supported by certain terminal emulators such as xterm.[14]
Overview
[edit]Elements
[edit]ISO/IEC 2022 specifies the following:
- An infrastructure of multiple character sets with particular structures which may be included in a single character encoding system, including multiple graphical character sets and multiple sets of both primary (C0) and secondary (C1) control codes,[15]
- A format for encoding these sets, assuming that 8 bits are available per byte,[16]
- A format for encoding these sets in the same encoding system when only 7 bits are available per byte,[17] and a method for transforming any conformant character data to pass through such a 7-bit environment,[8]
- The general structure of ANSI escape codes,[6] and
- Specific escape code formats for identifying individual character sets,[7] for announcing the use of particular encoding features or subsets,[18] and for interacting with or switching to other encoding systems.[18]
Code versions
[edit]A specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation. Although many of the mechanisms defined by the ISO/IEC 2022 standard are infrequently used, several established encodings are based on a subset of the ISO/IEC 2022 system.[19] In particular, 7-bit encoding systems using ISO/IEC 2022 mechanisms include ISO-2022-JP (or JIS encoding), which has primarily been used in Japanese-language e-mail. 8-bit encoding systems conforming to ISO/IEC 2022 include ISO/IEC 4873 (ECMA-43), which is in turn conformed to by ISO/IEC 8859,[9][10] and Extended Unix Code, which is used for East Asian languages.[11] More specialised applications of ISO 2022 include the MARC-8 encoding system used in MARC 21 library records.[3]
Designation escape sequences
[edit]The escape sequences for switching to particular character sets or encodings are registered with the ISO-IR registry (except for those set apart for private use, the meanings of which are defined by vendors, or by protocol specifications such as ARIB STD-B24) and follow the patterns defined within the standard. Character encodings making use of these escape sequences require data to be processed sequentially in a forward direction, since the correct interpretation of the data depends on previously encountered escape sequences.
Specific profiles such as ISO-2022-JP may impose extra conditions, such as that the current character set is reset to US-ASCII before the end of a line. Furthermore, the escape sequences declaring the national character sets may be absent if a specific ISO-2022-based encoding permits or requires this, and dictates that particular national character sets are to be used. For example, ISO-8859-1 states that no defining escape sequence is needed.
Multi-byte characters
[edit]To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646's property that a seven-bit character representation will normally be able to represent 94 graphic (printable) characters (in addition to space and 33 control characters); if only the C0 control codes (narrowly defined) are excluded, this can be expanded to 96 characters. Using two bytes, it is thus possible to represent up to 8,836 (94×94) characters; and, using three bytes, up to 830,584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes (although EUC-TW's unregistered G2 does, as does the similarly unregistered CCCII).
For the two-byte character sets, the code point of each character is normally specified in so-called row-cell or kuten[a] form, which comprises two numbers between 1 and 94 inclusive, specifying a row[b] and cell[c] of that character within the zone. For a three-byte set, an additional plane[d] number is included at the beginning.[20] The escape sequences do not only declare which character set is being used, but also whether the set is single-byte or multi-byte (although not how many bytes it uses if it is multi-byte), and also whether each byte has 94 or 96 permitted values.
Code structure
[edit]Notation and nomenclature
[edit]ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated"[21] into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked"[22] to interpret bytes in the stream.
Encoding byte values ("bit combinations") are often given in column-line notation, where two decimal numbers in the range 00–15 (each corresponding to a single hexadecimal digit) are separated by a slash.[23] Hence, for instance, codes 2/0 (0x20) through 2/15 (0x2F) inclusive may be referred to as "column 02". This is the notation used in the ISO/IEC 2022 / ECMA-35 standard itself.[24] They may be described elsewhere using hexadecimal, as is often used in this article, or using the corresponding ASCII characters,[25] although the escape sequences are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.
Byte values from the 7-bit ASCII graphic range (hexadecimal 0x20–0x7F), being on the left side of a character code table, are referred to as "GL" codes (with "GL" standing for "graphics left") while bytes from the "high ASCII" range (0xA0–0xFF), if available (i.e. in an 8-bit environment), are referred to as the "GR" codes ("graphics right").[5] The terms "CL" (0x00–0x1F) and "CR" (0x80–0x9F) are defined for the control ranges, but the CL range always invokes the primary (C0) controls, whereas the CR range always either invokes the secondary (C1) controls or is unused.[5]
Fixed coded characters
[edit]The delete character DEL (0x7F), the escape character ESC (0x1B) and the space character SP (0x20) are designated "fixed" coded characters[26] and are always available when G0 is invoked over GL, irrespective of what character sets are designated. They may not be included in graphical character sets, although other sizes or types of whitespace character may be.[27]
General syntax of escape sequences
[edit]Sequences using the ESC (escape) character take the form ESC [I...] F, where the ESC character is followed by zero or more intermediate bytes[28] (I) from the range 0x20–0x2F, and one final byte[29] (F) from the range 0x30–0x7E.[30]
The first I byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function. In all types of escape sequences, F bytes in the range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties.[31]
Control functions from some sets may make use of further bytes following the escape sequence proper. For example, the ISO 6429 control function "Control Sequence Introducer", which can be represented using an escape sequence, is followed by zero or more bytes in the range 0x30–0x3F, then zero or more bytes in the range 0x20–0x2F, then by a single byte in the range 0x40–0x7E, the entire sequence being called a "control sequence".[32]
Graphical character sets
[edit]Each of the four working sets G0 through G3 may be a 94-character set or a 94n-character multi-byte set. Additionally, G1 through G3 may be a 96- or 96n-character set.
In a 96- or 96n-character set, the bytes 0x20 through 0x7F when GL-invoked, or 0xA0 through 0xFF when GR-invoked, are allocated to and may be used by the set. In a 94- or 94n-character set, the bytes 0x20 and 0x7F are not used.[33] When a 96- or 96n-character set is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available until a 94- or 94n-character set (such as the G0 set) is invoked in GL.[5] 96-character sets cannot be designated to G0.
Registration of a set as a 96-character set does not necessarily mean that the 0x20/A0 and 0x7F/FF bytes are actually assigned by the set; some examples of graphical character sets which are registered as 96-sets but do not use those bytes include the G1 set of I.S. 434,[34] the box drawing set from ISO/IEC 10367,[35] and ISO-IR-164 (a subset of the G1 set of ISO-8859-8 with only the letters, used by CCITT).[36]
Combining characters
[edit]Characters are expected to be spacing characters, not combining characters, unless specified otherwise by the graphical set in question.[37] ISO 2022 / ECMA-35 also recognizes the use of the backspace and carriage return control characters as means of combining otherwise spacing characters, as well as the CSI sequence "Graphic Character Combination" (GCC)[37] (CSI 0x20 (SP) 0x5F (_)).[38]
Use of the backspace and carriage return in this manner is permitted by ISO/IEC 646 but prohibited by ISO/IEC 4873 / ECMA-43[39] and by ISO/IEC 8859,[40][41] on the basis that it leaves the graphical character repertoire undefined. ISO/IEC 4873 / ECMA-43 does, however, permit the use of the GCC function provided that the sequence of characters is kept the same and merely displayed in one space, rather than being over-stamped to form a character with a different meaning.[42]
Control character sets
[edit]Control character sets are classified as "primary" or "secondary" control code sets,[43] respectively also called "C0" and "C1" control code sets.[44]
A C0 control set must contain the ESC (escape) control character at 0x1B[45] (a C0 set containing only ESC is registered as ISO-IR-104),[46] whereas a C1 control set may not contain the escape control whatsoever.[33] Hence, they are entirely separate registrations, with a C0 set being only a C0 set and a C1 set being only a C1 set.[44]
If codes from the C0 set of ISO 6429 / ECMA-48, i.e. the ASCII control codes, appear in the C0 set, they are required to appear at their ISO 6429 / ECMA-48 locations.[45] Inclusion of transmission control characters in the C0 set, besides the ten included by ISO 6429 / ECMA-48 (namely SOH, STX, ETX, EOT, ENQ, ACK, DLE, NAK, SYN and ETB),[47] or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard.[45][33]
A C0 control set is invoked over the CL range 0x00 through 0x1F,[48] whereas a C1 control function may be invoked over the CR range 0x80 through 0x9F (in an 8-bit environment) or by using escape sequences (in a 7-bit or 8-bit environment),[43] but not both. Which style of C1 invocation is used must be specified in the definition of the code version.[49] For example, ISO/IEC 4873 specifies CR bytes for the C1 controls which it uses (SS2 and SS3).[50] If necessary, which invocation is used may be communicated using announcer sequences.
In the latter case, single control functions from the C1 control code set are invoked using "type Fe" escape sequences,[33] meaning those where the ESC control character is followed by a byte from columns 04 or 05 (that is to say, ESC 0x40 (@) through ESC 0x5F (_)).[51]
Other control functions
[edit]Additional control functions are assigned to "type Fs" escape sequences (in the range ESC 0x60 (`) through ESC 0x7E (~)); these have permanently assigned meanings rather than depending on the C0 or C1 designations.[51][52] Registration of control functions to type "Fs" sequences must be approved by ISO/IEC JTC 1/SC 2.[52] Other single control functions may be registered to type "3Ft" escape sequences (in the range ESC 0x23 (#) [I...] 0x40 (@) through ESC 0x23 (#) [I...] 0x7E (~)),[53] although no "3Ft" sequences are currently assigned (as of 2019).[54] Some of these are specified in ECMA-35 (ISO 2022 / ANSI X3.41), others in ECMA-48 (ISO 6429 / ANSI X3.64).[55] ECMA-48 refers to these as "independent control functions".[56]
| Code | Hex | Abbr. | Name | Effect[54] |
|---|---|---|---|---|
ESC ` |
1B 60 |
DMI | Disable manual input | Disables some or all of the manual input facilities of the device. |
ESC a |
1B 61 |
INT | Interrupt | Interrupts the current process. |
ESC b |
1B 62 |
EMI | Enable manual input | Enables the manual input facilities of the device. |
ESC c |
1B 63 |
RIS | Reset to initial state | The device's display and input subsystems revert to the same state as when it's just been powered on.[57] Connections to clients are unaffected. |
ESC d |
1B 64 |
CMD | Coding method delimiter | Used when interacting with an outer coding / representation system, see below. |
ESC n |
1B 6E |
LS2 | Locking shift two | Shift function, see below. |
ESC o |
1B 6F |
LS3 | Locking shift three | Shift function, see below. |
ESC | |
1B 7C |
LS3R | Locking shift three right | Shift function, see below. |
ESC } |
1B 7D |
LS2R | Locking shift two right | Shift function, see below. |
ESC ~ |
1B 7E |
LS1R | Locking shift one right | Shift function, see below. |
Escape sequences of type "Fp" (ESC 0x30 (0) through ESC 0x3F (?)) or of type "3Fp" (ESC 0x23 (#) [I...] 0x30 (0) through ESC 0x23 (#) [I...] 0x3F (?)) are reserved for single private use control codes, by prior agreement between parties.[58] Several such sequences of both types are used by DEC terminals such as the VT100, and are thus supported by terminal emulators.[14]
Shift functions
[edit]By default, GL codes specify G0 characters and GR codes (where available) specify G1 characters; this may be otherwise specified by prior agreement. The set invoked over each area may also be modified with control codes referred to as shifts, as shown in the table below.[59]
An 8-bit code may have GR codes specifying G1 characters, i.e. with its corresponding 7-bit code using Shift In and Shift Out to switch between the sets (e.g. JIS X 0201),[60] although some instead have GR codes specifying G2 characters, with the corresponding 7-bit code using a single-shift code to access the second set (e.g. T.51).[61]
The codes shown in the table below are the most common encodings of these control codes, conforming to ISO/IEC 6429. The LS2, LS3, LS1R, LS2R and LS3R shifts are registered as single control functions and are always encoded as the escape sequences listed below,[54] whereas the others are part of a C0 or C1 control code set (as shown below, SI (LS0) and SO (LS1) are C0 controls and SS2 and SS3 are C1 controls), meaning that their coding and availability may vary depending on which control sets are designated: they must be present in the designated control sets if their functionality is used.[48][49] The C1 controls themselves, as mentioned above, may be represented using escape sequences or 8-bit bytes, but not both.
Alternative encodings of the single-shifts as C0 control codes are available in certain control code sets. For example, SS2 and SS3 are usually available at 0x19 and 0x1D respectively in T.51[61] and T.61.[62] This coding is currently recommended by ISO/IEC 2022 / ECMA-35 for applications requiring 7-bit single-byte representations of SS2 and SS3,[63] and may also be used for SS2 only,[64] although older code sets with SS2 at 0x1C also exist,[65][66][67] and were mentioned as such in an earlier edition of the standard.[68] The 0x8E and 0x8F coding of the single shifts as shown below is mandatory for ISO/IEC 4873 levels 2 and 3.[69]
| Code | Hex | Abbr. | Name | Effect |
|---|---|---|---|---|
SI |
0F |
SI LS0 |
Shift In Locking shift zero |
GL encodes G0 from now on[70][71] |
SO |
0E |
SO LS1 |
Shift Out Locking shift one |
GL encodes G1 from now on[70][71] |
ESC n |
1B 6E |
LS2 | Locking shift two | GL encodes G2 from now on[70][71] |
ESC o |
1B 6F |
LS3 | Locking shift three | GL encodes G3 from now on[70][71] |
CR area: SS2Escape code: ESC N |
CR area: 8EEscape code: 1B 4E |
SS2 | Single shift two | GL or GR (see below) encodes G2 for the immediately following character only[72] |
CR area: SS3Escape code: ESC O |
CR area: 8FEscape code: 1B 4F |
SS3 | Single shift three | GL or GR (see below) encodes G3 for the immediately following character only[72] |
ESC ~ |
1B 7E |
LS1R | Locking shift one right | GR encodes G1 from now on[73] |
ESC } |
1B 7D |
LS2R | Locking shift two right | GR encodes G2 from now on[73] |
ESC | |
1B 7C |
LS3R | Locking shift three right | GR encodes G3 from now on[73] |
Although officially considered shift codes and named accordingly, single-shift codes are not always viewed as shifts,[12] and they may simply be viewed as prefix bytes (i.e. the first bytes in a multi-byte sequence),[11] since they do not require the encoder to keep the currently active set as state, unlike locking shift codes. In 8-bit environments, either GL or GR, but not both, may be used as the single-shift area. This must be specified in the definition of the code version.[72] For instance, ISO/IEC 4873 specifies GL, whereas packed EUC specifies GR. In 7-bit environments, only GL is used as the single-shift area.[74][75] If necessary, which single-shift area is used may be communicated using announcer sequences.
The names "locking shift zero" (LS0) and "locking shift one" (LS1) refer to the same pair of C0 control characters (0x0F and 0x0E) as the names "shift in" (SI) and "shift out" (SO). However, the standard refers to them as LS0 and LS1 when they are used in 8-bit environments and as SI and SO when they are used in 7-bit environments.[59]
The ISO/IEC 2022 / ECMA-35 standard permits, but discourages, invoking G1, G2 or G3 in both GL and GR simultaneously.[76]
Registration of graphical and control code sets
[edit]The ISO International register of coded character sets to be used with escape sequences (ISO-IR) lists graphical character sets, control code sets, single control codes and so forth which have been registered for use with ISO/IEC 2022. The procedure for registering codes and sets with the ISO-IR registry is specified by ISO/IEC 2375. Each registration receives a unique escape sequence, and a unique registry entry number to identify it.[77][78] For example, the CCITT character set for Simplified Chinese is known as ISO-IR-165.
Registration of coded character sets with the ISO-IR registry identifies the documents specifying the character set or control function associated with an ISO/IEC 2022 non‑private-use escape sequence. This may be a standard document; however, registration does not create a new ISO standard, does not commit the ISO or IEC to adopt it as an international standard, and does not commit the ISO or IEC to add any of its characters to the Universal Coded Character Set.[79]
ISO-IR registered escape sequences are also used encapsulated in a Formal Public Identifier to identify character sets used for numeric character references in SGML (ISO 8879). For example, the string ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0 can be used to identify the International Reference Version of ISO 646-1983,[80] and the HTML 4.01 specification uses ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6 to identify Unicode.[81] The textual representation of the escape sequence, included in the third element of the FPI, will be recognised by SGML implementations for supported character sets.[80]
Character set designations
[edit]Escape sequences to designate character sets take the form ESC I [I...] F. As mentioned above, the intermediate (I) bytes are from the range 0x20–0x2F, and the final (F) byte is from the range 0x30–0x7E. The first I byte (or, for a multi-byte set, the first two) identifies the type of character set and the working set it is to be designated to, whereas the F byte (and any additional I bytes) identify the character set itself, as assigned in the ISO-IR register (or, for the private-use escape sequences, by prior agreement).
Additional I bytes may be added before the F byte to extend the F byte range. This is currently only used with 94-character sets, where codes of the form ESC ( ! F have been assigned.[82] At the other extreme, no multibyte 96-sets have been registered, so the sequences below are strictly theoretical.
As with other escape sequence types, the range 0x30–0x3F is reserved for private-use F bytes,[31] in this case for private-use character set definitions (which might include unregistered sets defined by protocols such as ARIB STD-B24[83] or MARC-8,[3] or vendor-specific sets such as DEC Special Graphics).[84] However, in a graphical set designation sequence, if the second I byte (for a single-byte set) or the third I byte (for a double-byte set) is 0x20 (space), the set denoted is a "dynamically redefinable character set" (DRCS) defined by prior agreement,[85] which is also considered private use.[31] A graphical set being considered a DRCS implies that it represents a font of exact glyphs, rather than a set of abstract characters.[86] The manner in which DRCS sets and associated fonts are transmitted, allocated and managed is not stipulated by ISO/IEC 2022 / ECMA-35 itself, although it recommends allocating them sequentially starting with F byte 0x40 (@);[87] however, a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as World System Teletext.[88]
There are also three special cases for multi-byte codes. The code sequences ESC $ @, ESC $ A, and ESC $ B were all registered when the contemporary version of the standard allowed multi-byte sets only in G0, so must be accepted in place of the sequences ESC $ ( @ through ESC $ ( B to designate to the G0 character set.[89]
There are additional (rarely used) features for switching control character sets, but this is a single-level lookup, in that (as noted above) the C0 set is always invoked over CL, and the C1 set is always invoked over CR or by using escape codes. As noted above, it is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible. The control set designation sequences (as opposed to the graphical set ones) may also be used from within ISO/IEC 10646 (UCS/Unicode), in contexts where processing ANSI escape codes is appropriate, provided that each byte in the sequence is padded to the code unit size of the encoding.[90]
A table of escape sequence I bytes and the designation or other function which they perform is below.[91]
| Code | Hex | Abbr. | Name | Effect | Example |
|---|---|---|---|---|---|
ESC SP F |
1B 20 F |
ACS | Announce code structure | Specifies code features used, e.g. working sets (see below).[92] | ESC SP L (ISO 4873 level 1) |
ESC ! F |
1B 21 F |
CZD | C0-designate | F selects a C0 control character set to be used.[93] | ESC ! @ (ASCII C0 codes) |
ESC " F |
1B 22 F |
C1D | C1-designate | F selects a C1 control character set to be used.[94] | ESC " C (ISO 6429 C1 codes) |
ESC # F |
1B 23 F |
- | (Single control function) | (Reserved for sequences for control functions, see above.) | ESC # 6 (private use: DEC Double Width Line)[95] |
|
|
GZDM4 | G0-designate multibyte 94-set | F selects a 94n-character set to be used for G0.[89] | ESC $ ( C (KS X 1001 in G0) |
ESC $ ) F |
1B 24 29 F |
G1DM4 | G1-designate multibyte 94-set | F selects a 94n-character set to be used for G1.[89] | ESC $ ) A (GB 2312 in G1) |
ESC $ * F |
1B 24 2A F |
G2DM4 | G2-designate multibyte 94-set | F selects a 94n-character set to be used for G2.[89] | ESC $ * B (JIS X 0208 in G2) |
ESC $ + F |
1B 24 2B F |
G3DM4 | G3-designate multibyte 94-set | F selects a 94n-character set to be used for G3.[89] | ESC $ + D (JIS X 0212 in G3) |
ESC $ , F |
1B 24 2C F |
- | (not used) | (not used)[f] | - |
ESC $ - F |
1B 24 2D F |
G1DM6 | G1-designate multibyte 96-set | F selects a 96n-character set to be used for G1.[89] | ESC $ - 1 (private use) |
ESC $ . F |
1B 24 2E F |
G2DM6 | G2-designate multibyte 96-set | F selects a 96n-character set to be used for G2.[89] | ESC $ . 2 (private use) |
ESC $ / F |
1B 24 2F F |
G3DM6 | G3-designate multibyte 96-set | F selects a 96n-character set to be used for G3.[89] | ESC $ / 3 (private use) |
ESC % F |
1B 25 F |
DOCS | Designate other coding system | Switches coding system, see below. | ESC % G (UTF-8) |
ESC & F |
1B 26 F |
IRR | Identify revised registration | Prefixes designation escape to denote revision.[g] | ESC & @ ESC $ B (JIS X 0208:1990 in G0) |
ESC ' F |
1B 27 F |
- | (not used) | (not used) | - |
ESC ( F |
1B 28 F |
GZD4 | G0-designate 94-set | F selects a 94-character set to be used for G0.[89] | ESC ( B (ASCII in G0) |
ESC ) F |
1B 29 F |
G1D4 | G1-designate 94-set | F selects a 94-character set to be used for G1.[89] | ESC ) I (JIS X 0201 Kana in G1) |
ESC * F |
1B 2A F |
G2D4 | G2-designate 94-set | F selects a 94-character set to be used for G2.[89] | ESC * v (ITU T.61 RHS in G2) |
ESC + F |
1B 2B F |
G3D4 | G3-designate 94-set | F selects a 94-character set to be used for G3.[89] | ESC + D (NATS-SEFI-ADD in G3) |
ESC , F |
1B 2C F |
- | (not used) | (not used)[h] | - |
ESC - F |
1B 2D F |
G1D6 | G1-designate 96-set | F selects a 96-character set to be used for G1.[89] | ESC - A (ISO 8859-1 RHS in G1) |
ESC . F |
1B 2E F |
G2D6 | G2-designate 96-set | F selects a 96-character set to be used for G2.[89] | ESC . B (ISO 8859-2 RHS in G2) |
ESC / F |
1B 2F F |
G3D6 | G3-designate 96-set | F selects a 96-character set to be used for G3.[89] | ESC / b (ISO 8859-15 RHS in G3) |
Note that the registry of F bytes is independent for the different types. The 94-character graphic set designated by ESC ( A through ESC + A is not related in any way to the 96-character set designated by ESC - A through ESC / A. And neither of those is related to the 94n-character set designated by ESC $ ( A through ESC $ + A, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes, ESC A is a way of specifying the C1 control code 0x81.)
Also note that C0 and C1 control character sets are independent; the C0 control character set designated by ESC ! A (which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by ESC " A (the CCITT attribute control set for Videotex).
Interaction with other coding systems
[edit]The standard also defines a way to specify coding systems that do not follow its own structure.
A sequence is also defined for returning to ISO/IEC 2022; the registrations which support this sequence as encoded in ISO/IEC 2022 comprise (as of 2019) various Videotex formats, UTF-8, and UTF-1.[99] A second I byte of 0x2F (/) is included in the designation sequences of codes which do not use that byte sequence to return to ISO 2022; they may have their own means to return to ISO 2022 (such as a different or padded sequence) or none at all.[100] All existing registrations of the latter type (as of 2019) are either transparent raw data, Unicode/UCS formats, or subsets thereof.[101]
| Code | Hex | Abbr. | Name | Effect |
|---|---|---|---|---|
ESC % @ |
1B 25 40 |
DOCS | Designate other coding system ("standard return") | Return to ISO/IEC 2022 from another encoding.[100] |
ESC % F |
1B 25 F |
Designate other coding system ("with standard return")[99] | F selects an 8-bit code; use ESC % @ to return.[100]
| |
ESC % / F |
1B 25 2F F |
Designate other coding system ("without standard return")[101] | F selects an 8-bit code; there is no standard way to return.[100] | |
ESC d |
1B 64 |
CMD | Coding method delimiter | Denotes the end of an ISO/IEC 2022 coded sequence.[102] |
Of particular interest are the sequences which switch to ISO/IEC 10646 (Unicode) formats which do not follow the ISO/IEC 2022 structure. These include UTF-8 (which does not reserve the range 0x80–0x9F for control characters), its predecessor UTF-1 (which mixes GR and GL bytes in multi-byte codes), and UTF-16 and UTF-32 (which use wider coding units).[99][101]
Several codes were also registered for subsets (levels 1 and 2) of UTF-8, UTF-16 and UTF-32, as well as for three levels of UCS-2.[101] However, the only codes currently specified by ISO/IEC 10646 are the level-3 codes for UTF-8, UTF-16 and UTF-32 and the unspecified-level code for UTF-8, with the rest being listed as deprecated.[103] ISO/IEC 10646 stipulates that the big-endian formats of UTF-16 and UTF-32 are designated by their escape sequences.[104]
| Unicode Format | Code(s) | Hex[103] | Deprecated codes | Deprecated hex[99][101][103] |
|---|---|---|---|---|
| UTF-1 | (UTF-1 not in current ISO/IEC 10646.) | ESC % B |
1B 25 42
| |
| UTF-8 | ESC % G, ESC % / I |
1B 25 47,[13] 1B 25 2F 49[105] |
ESC % / G, ESC % / H |
1B 25 2F 47, 1B 25 2F 48
|
| UTF-16 | ESC % / L |
1B 25 2F 4C[106] |
ESC % / @, ESC % / C, ESC % / E, ESC % / J, ESC % / K |
1B 25 2F 40, 1B 25 2F 43, 1B 25 2F 45, 1B 25 2F 4A, 1B 25 2F 4B
|
| UTF-32 | ESC % / F |
1B 25 2F 46 |
ESC % / A, ESC % / D |
1B 25 2F 41, 1B 25 2F 44
|
Of the sequences switching to UTF-8, ESC % G is the one supported by, for example, xterm.[14]
Although use of a variant of the standard return sequence from UTF-16 and UTF-32 is permitted, the bytes of the escape sequence must be padded to the size of the code unit of the encoding (i.e. 001B 0025 0040 for UTF-16), i.e. the coding of the standard return sequence does not conform exactly to ISO/IEC 2022. For this reason, the designations for UTF-16 and UTF-32 use a without-standard-return syntax.[107]
For specifying encodings by labels, the X Consortium's Compound Text format defines five private-use DOCS sequences.[108]
Code structure announcements
[edit]The sequence "announce code structure" (ESC SP (0x20) F) is used to announce a specific code structure, or a specific group of ISO 2022 facilities which are used in a particular code version. Although announcements can be combined, certain contradictory combinations (specifically, using locking shift announcements 16–23 with announcements 1, 3 and 4) are prohibited by the standard, as is using additional announcements on top of ISO/IEC 4873 level announcements 12–14[92] (which fully specify the permissible structural features). Announcement sequences are as follows:
| Number | Code | Hex | Code version feature announced[92] |
|---|---|---|---|
| 1 | ESC SP A |
1B 20 41 |
G0 in GL, GR absent or unused, no locking shifts. |
| 2 | ESC SP B |
1B 20 42 |
G0 and G1 invoked to GL by locking shifts, GR absent or unused. |
| 3 | ESC SP C |
1B 20 43 |
G0 in GL, G1 in GR, no locking shifts, requires an 8-bit environment. |
| 4 | ESC SP D |
1B 20 44 |
G0 in GL, G1 in GR if 8-bit, no locking shifts unless in a 7-bit environment. |
| 5 | ESC SP E |
1B 20 45 |
Shift functions preserved during 7-bit/8-bit conversion. |
| 6 | ESC SP F |
1B 20 46 |
C1 controls using escape sequences. |
| 7 | ESC SP G |
1B 20 47 |
C1 controls in CR region in 8-bit environments, as escape sequences otherwise. |
| 8 | ESC SP H |
1B 20 48 |
94-character graphical sets only. |
| 9 | ESC SP I |
1B 20 49 |
94-character and/or 96-character graphical sets. |
| 10 | ESC SP J |
1B 20 4A |
Uses a 7-bit code, even if an eighth bit is available for use. |
| 11 | ESC SP K |
1B 20 4B |
Requires an 8-bit code. |
| 12 | ESC SP L |
1B 20 4C |
Complies to ISO/IEC 4873 (ECMA-43) level 1. |
| 13 | ESC SP M |
1B 20 4D |
Complies to ISO/IEC 4873 (ECMA-43) level 2. |
| 14 | ESC SP N |
1B 20 4E |
Complies to ISO/IEC 4873 (ECMA-43) level 3. |
| 16 | ESC SP P |
1B 20 50 |
SI / LS0 used. |
| 18 | ESC SP R |
1B 20 52 |
SO / LS1 used. |
| 19 | ESC SP S |
1B 20 53 |
LS1R used in 8-bit environments, SO used in 7-bit environments. |
| 20 | ESC SP T |
1B 20 54 |
LS2 used. |
| 21 | ESC SP U |
1B 20 55 |
LS2R used in 8-bit environments, LS2 used in 7-bit environments. |
| 22 | ESC SP V |
1B 20 56 |
LS3 used. |
| 23 | ESC SP W |
1B 20 57 |
LS3R used in 8-bit environments, LS3 used in 7-bit environments. |
| 26 | ESC SP Z |
1B 20 5A |
SS2 used. |
| 27 | ESC SP [ |
1B 20 5B |
SS3 used. |
| 28 | ESC SP \ |
1B 20 5C |
Single-shifts invoke over GR. |
ISO/IEC 2022 code versions
[edit]
Six 7-bit ISO 2022 code versions (ISO-2022-CN, ISO-2022-CN-EXT, ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2 and ISO-2022-KR) are defined by IETF RFCs, of which ISO-2022-JP and ISO-2022-KR have been extensively used in the past.[109] A number of other variants are defined by vendors, including IBM.[110] Although UTF-8 is the preferred encoding in HTML5, legacy content in ISO-2022-JP remains sufficiently widespread that the WHATWG encoding standard retains support for it,[111] in contrast to mapping ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT[112] entirely to the replacement character,[113] due to concerns about code injection attacks such as cross-site scripting.[111][113]
8-bit code versions include Extended Unix Code.[11][12] The ISO/IEC 8859 encodings also follow ISO 2022, in a subset stipulated in ISO/IEC 4873.[9][10]
Japanese e-mail versions
[edit]ISO-2022-JP
[edit]ISO-2022-JP is a widely used encoding for Japanese, in particular in e-mail. It was introduced for use on the JUNET network and later codified in IETF RFC 1468, dated 1993.[114] It has an advantage over other encodings for Japanese in that it does not require 8-bit clean transmission. Microsoft calls it Code page 50220.[115] It starts in ASCII and includes the following escape sequences:
ESC ( Bto switch to ASCII (1 byte per character)ESC ( Jto switch to JIS X 0201-1976 (ISO/IEC 646:JP) Roman set (1 byte per character)ESC $ @to switch to JIS X 0208-1978 (2 bytes per character)ESC $ Bto switch to JIS X 0208-1983 (2 bytes per character)
Use of the two characters added in JIS X 0208-1990 is permitted, but without including the IRR sequence, i.e. using the same escape sequence as JIS X 0208-1983.[114] Also, due to being registered before designating multi-byte sets except to G0 was possible, the escapes for JIS X 0208 do not include the second I-byte (.[89]
The RFC notes that some existing systems did not distinguish ESC ( B from ESC ( J, or did not distinguish ESC $ @ from ESC $ B, but stipulates that the escape sequences should not be changed by systems simply relaying messages such as e-mails.[114] The WHATWG Encoding Standard referenced by HTML5 handles ESC ( B and ESC ( J distinctly, but treats ESC $ @ the same as ESC $ B when decoding, and uses only ESC $ B for JIS X 0208 when encoding.[116] The RFC also notes that some past systems had made erroneous use of the sequence ESC ( H to switch away from JIS X 0208, which is actually registered for ISO-IR-11 (a Swedish variant of ISO 646 and World System Teletext).[114][i]
Versions with halfwidth katakana
[edit]Use of ESC ( I to switch to the JIS X 0201-1976 Kana set (1 byte per character) is not part of the ISO-2022-JP profile,[114] but is also sometimes used. Python allows it in a variant which it labels ISO-2022-JP-EXT (which also incorporates JIS X 0212 as described below, completing coverage of EUC-JP);[117][118] this is close in both name and structure to an encoding denoted ISO-2022-JPext by DEC, which furthermore adds a two-byte user-defined region accessed with ESC $ ( 0 to complete the coverage of Super DEC Kanji.[119] The WHATWG/HTML5 variant permits decoding JIS X 0201 katakana in ISO-2022-JP input, but converts the characters to their JIS X 0208 equivalents upon encoding.[116] Microsoft's code page for ISO-2022-JP with JIS X 0201 kana additionally permitted is Code page 50221.[115]
Other, older variants known as JIS7 and JIS8 build directly on the 7-bit and 8-bit encodings defined by JIS X 0201 and allow use of JIS X 0201 kana from G1 without escape sequences, using Shift Out and Shift In or setting the eighth bit (GR-invoked), respectively.[120] They are not widely used;[120] JIS X 0208 support in extended 8-bit JIS X 0201 is more commonly achieved via Shift JIS. Microsoft's code page for JIS X 0201-based ISO 2022 with single-byte katakana via Shift Out and Shift In is Code page 50222.[115]
ISO-2022-JP-2
[edit]ISO-2022-JP-2 is a multilingual extension of ISO-2022-JP, defined in RFC 1554 (dated 1993), which permits the following escape sequences in addition to the ISO-2022-JP ones. The ISO/IEC 8859 parts are 96-character sets which cannot be designated to G0, and are accessed from G2 using the 7-bit escape sequence form of the single-shift code SS2:[121]
ESC $ Ato switch to GB 2312-1980 (2 bytes per character)ESC $ ( Cto switch to KS X 1001-1992 (2 bytes per character)ESC $ ( Dto switch to JIS X 0212-1990 (2 bytes per character)ESC . Ato switch to ISO/IEC 8859-1 high part, Extended Latin 1 set (1 byte per character) [designated to G2]ESC . Fto switch to ISO/IEC 8859-7 high part, Basic Greek set (1 byte per character) [designated to G2]
ISO-2022-JP with the ISO-2022-JP-2 representation of JIS X 0212, but not the other extensions, was subsequently dubbed ISO-2022-JP-1 by RFC 2237, dated 1997.[122]
IBM Japanese TCP
[edit]IBM implements nine 7-bit ISO 2022 based encodings for Japanese, each using a different set of escape sequences: IBM-956, IBM-957, IBM-958, IBM-959, IBM-5052, IBM-5053, IBM-5054, IBM-5055 and ISO-2022-JP, which are collectively termed "TCP/IP Japanese coded character sets".[123] CCSID 9148 is the standard (RFC 1468) ISO-2022-JP.[124]
| Code page / CCSID | ACRI definition number | Escape sequences for ACRI[110] |
|---|---|---|
| 956[125] | TCP-01 |
|
| 957[126] | TCP-02 |
|
| 958[127] | TCP-03 |
|
| 959[128] | TCP-04 |
|
| 5052[129] | TCP-05 |
|
| 5053[130] | TCP-06 |
|
| 5054[131] | TCP-07 |
|
| 5055[132] | TCP-08 |
|
| 9148[124] | TCP-16 |
|
JIS X 0213
[edit]The JIS X 0213 standard, first published in 2000, defines an updated version of ISO-2022-JP, without the ISO-2022-JP-2 extensions, named ISO-2022-JP-3. The additions made by JIS X 0213 compared to the base JIS X 0208 standard resulted in a new registration being made for the extended JIS plane 1, while the new plane 2 received its own registration. The further additions to plane 1 in the 2004 edition of the standard resulted in an additional registration being added to a further revision of the profile, dubbed ISO-2022-JP-2004. In addition to the basic ISO-2022-JP designation codes, the following designations are recognized:
ESC ( Ito switch to JIS X 0201-1976 Kana set (1 byte per character)ESC $ ( Oto switch to JIS X 0213-2000 Plane 1 (2 bytes per character)ESC $ ( Pto switch to JIS X 0213-2000 Plane 2 (2 bytes per character)ESC $ ( Qto switch to JIS X 0213-2004 Plane 1 (2 bytes per character, ISO-2022-JP-2004 only)
Other 7-bit versions
[edit]ISO-2022-KR is defined in RFC 1557, dated 1993.[133] It encodes ASCII and the Korean double-byte KS X 1001-1992,[134][135] previously named KS C 5601-1987. Unlike ISO-2022-JP-2, it makes use of the Shift Out and Shift In characters to switch between them, after including ESC $ ) C once at the start of a line to designate KS X 1001 to G1.[133]
ISO-2022-CN and ISO-2022-CN-EXT are defined in RFC 1922, dated 1996. They are 7-bit encodings making use both of the Shift Out and Shift In functions (to shift between G0 and G1), and of the 7-bit escape code forms of the single-shift functions SS2 and SS3 (to access G2 and G3).[136] They support the character sets GB 2312 (for simplified Chinese) and CNS 11643 (for traditional Chinese).
The basic ISO-2022-CN profile uses ASCII as its G0 (shift in) set, and also includes GB 2312 and the first two planes of CNS 11643 (due to these two planes being sufficient to represent all traditional Chinese characters from common Big5, to which the RFC provides a correspondence in an appendix):[136]
ESC $ ) Ato switch to GB 2312-1980 (2 bytes per character) [designated to G1]ESC $ ) Gto switch to CNS 11643-1992 Plane 1 (2 bytes per character) [designated to G1]ESC $ * Hto switch to CNS 11643-1992 Plane 2 (2 bytes per character) [designated to G2]
The ISO-2022-CN-EXT profile permits the following additional sets and planes.[136]
ESC $ ) Eto switch to ISO-IR-165 (2 bytes per character) [designated to G1]ESC $ + Ito switch to CNS 11643-1992 Plane 3 (2 bytes per character) [designated to G3]ESC $ + Jto switch to CNS 11643-1992 Plane 4 (2 bytes per character) [designated to G3]ESC $ + Kto switch to CNS 11643-1992 Plane 5 (2 bytes per character) [designated to G3]ESC $ + Lto switch to CNS 11643-1992 Plane 6 (2 bytes per character) [designated to G3]ESC $ + Mto switch to CNS 11643-1992 Plane 7 (2 bytes per character) [designated to G3]
The ISO-2022-CN-EXT profile further lists additional Guobiao standard graphical sets as being permitted, but conditional on their being assigned registered ISO 2022 escape sequences:[136]
- GB 12345 in G1
- GB 7589 or GB 13131 in G2
- GB 7590 or GB 13132 in G3
The character after the ESC (for single-byte character sets) or ESC $ (for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character ( (0x28) designates a 94-character set to the G0 character set, whereas ), * or + (0x29–0x2B) designates to the G1–G3 character sets.
ISO-2022-KR and ISO-2022-CN are used less frequently than ISO-2022-JP, and are sometimes deliberately not supported due to security concerns. Notably, the WHATWG Encoding Standard used by HTML5 maps ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT (as well as HZ-GB-2312) to the "replacement" decoder,[112] which maps all input to the replacement character (�), in order to prevent certain cross-site scripting and related attacks, which utilize a difference in encoding support between the client and server.[113] Although the same security concern (allowing sequences of ASCII bytes to be interpreted differently) also applies to ISO-2022-JP and UTF-16, they could not be given this treatment due to being much more frequently used in deployed content.[111]
In April 2024, a security flaw[137] was found in the implementation of ISO-2022-CN-EXT in glibc, which lead to recommendations to disable the encoding entirely on Linux systems.[138]
ISO/IEC 4873
[edit]
A subset of ISO 2022 applied to 8-bit single-byte encodings is defined by ISO/IEC 4873, also published by Ecma International as ECMA-43. ISO/IEC 8859 defines 8-bit codes for ISO/IEC 4873 (or ECMA-43) level 1.[9][10]
ISO/IEC 4873 / ECMA-43 defines three levels of encoding:[139]
- Level 1, which includes a C0 set, the ASCII G0 set, an optional C1 set and an optional single-byte (94-character or 96-character) G1 set. G0 is invoked over GL, and G1 is invoked over GR. Use of shift functions is not permitted.
- Level 2, which includes a (94-character or 96-character) single-byte G2 and/or G3 set in addition to a mandatory G1 set. Only the single-shift functions SS2 and SS3 are permitted (i.e. locking shifts are forbidden), and they invoke over the GL region (including 0x20 and 0x7F in the case of a 96-set). SS2 and SS3 must be available in C1 at 0x8E and 0x8F respectively. This minimal required C1 set for ISO 4873 is registered as ISO-IR-105.[69]
- Level 3, which permits the GR locking-shift functions LS1R, LS2R and LS3R in addition to the single shifts, but otherwise has the same restrictions as level 2.
Earlier editions of the standard permitted non-ASCII assignments in the G0 set, provided that the ISO/IEC 646 invariant positions were preserved, that the other positions were assigned to spacing (not combining) characters, that 0x23 was assigned to either £ or #, and that 0x24 was assigned to either $ or ¤.[140] For instance, the 8-bit encoding of JIS X 0201 is compliant with earlier editions. This was subsequently changed to fully specify the ISO/IEC 646:1991 IRV / ISO-IR No. 6 set (ASCII).[141][142][143]
The use of the ISO/IEC 646 IRV (synchronised with ASCII since 1991) at ISO/IEC 4873 Level 1 with no C1 or G1 set, i.e. using the IRV in an 8-bit environment in which shift codes are not used and the high bit is always zero, is known as ISO 4873 DV, in which DV stands for "Default Version".[144]
In cases where duplicate characters are available in different sets, the current edition of ISO/IEC 4873 / ECMA-43 only permits using these characters in the lowest numbered working set which they appear in.[145] For instance, if a character appears in both the G1 set and the G3 set, it must be used from the G1 set. However, use from other sets is noted as having been permitted in earlier editions.[143]
ISO/IEC 8859 defines complete encodings at level 1 of ISO/IEC 4873, and does not allow for use of multiple ISO/IEC 8859 parts together. It stipulates that ISO/IEC 10367 should be used instead for levels 2 and 3 of ISO/IEC 4873.[9][10] ISO/IEC 10367:1991 includes G0 and G1 sets matching those used by the first 9 parts of ISO/IEC 8859 (i.e. those which existed as of 1991, when it was published), and some supplementary sets.[146]
Character set designation escape sequences are used for identifying or switching between versions during information interchange only if required by a further protocol, in which case the standard requires an ISO/IEC 2022 announcer sequence specifying the ISO/IEC 4873 level, followed by a complete set of escapes specifying the character set designations for C0, C1, G0, G1, G2 and G3 respectively (but omitting G2 and G3 designations for level 1), with an F-byte of 0x7E denoting an empty set. Each ISO/IEC 4873 level has its own single ISO/IEC 2022 announcer sequence, which are as follows:[147]
| Code | Hex | Announcement |
|---|---|---|
ESC SP L |
1B 20 4C |
ISO 4873 Level 1 |
ESC SP M |
1B 20 4D |
ISO 4873 Level 2 |
ESC SP N |
1B 20 4E |
ISO 4873 Level 3 |
Extended Unix Code
[edit]Extended Unix Code (EUC) is an 8-bit variable-width character encoding system used primarily for Japanese, Korean, and simplified Chinese. It is based on ISO 2022, and only character sets which conform to the ISO 2022 structure can have EUC forms. Up to four coded character sets can be represented (in G0, G1, G2 and G3). The G0 set is invoked over GL, the G1 set is invoked over GR, and the G2 and G3 sets are (if present) invoked using the single shifts SS2 and SS3, which are used as CR bytes (i.e. 0x8E and 0x8F respectively) and invoke over GR (not GL).[11] Locking shift codes are not used.[12]
The code assigned to the G0 set is ASCII, or the country's national ISO 646 character set such as KS-Roman (KS X 1003) or JIS-Roman (the lower half of JIS X 0201).[11] Hence, 0x5C (backslash in US-ASCII) is used to represent a Yen sign in some versions of EUC-JP and a Won sign in some versions of EUC-KR.
G1 is used for a 94x94 coded character set represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes (i.e. SS3 plus two bytes) whereas a single character in EUC-TW can take up to four bytes (i.e. SS2 plus three bytes).
The EUC code itself does not make use of the announcer or designation sequences from ISO 2022; however, it corresponds to the following sequence of four announcer sequences, with meanings breaking down as follows.[148]
| Individual sequence | Hexadecimal | Feature of EUC denoted |
|---|---|---|
ESC SP C |
1B 20 43 |
ISO-8 (8-bit, G0 in GL, G1 in GR) |
ESC SP Z |
1B 20 5A |
G2 accessed using SS2 |
ESC SP [ |
1B 20 5B |
G3 accessed using SS3 |
ESC SP \ |
1B 20 5C |
Single-shifts invoke over GR |
Compound Text (X11)
[edit]The X Consortium defined an ISO 2022 profile named Compound Text as an interchange format in 1989.[149] This uses only four control codes: HT (0x09), NL (newline, coded as LF, 0x0A), ESC (0x1B) and CSI (in its 8-bit representation 0x9B),[150] with the SDS (CSI … ]) CSI sequence being used for bidirectional text control.[151] It is an 8-bit code using G0 and G1 for GL and GR, and follows ISO-8859-1 in its initial state.[152] The following F-bytes are used:
| Escape sequence type | Final byte | Graphical set |
|---|---|---|
| GZD4, G1D4 (for 94-character sets) | B (0x42) |
ASCII |
I (0x49) |
JIS X 0201 katakana | |
J (0x4A) |
JIS X 0201 Roman | |
| G1D6 (for 96-character sets) | A (0x41) |
ISO-8859-1 high part |
B (0x42) |
ISO-8859-2 high part | |
C (0x43) |
ISO-8859-3 high part | |
D (0x44) |
ISO-8859-4 high part | |
F (0x46) |
ISO-8859-7 high part | |
G (0x47) |
ISO-8859-6 high part | |
H (0x48) |
ISO-8859-8 high part | |
L (0x4C) |
ISO-8859-5 high part | |
M (0x4D) |
ISO-8859-9 high part | |
| GZDM4, G1DM4 (for 2-byte sets) | A (0x41) |
GB 2312 |
B (0x42) |
JIS X 0208 | |
C (0x43) |
KS C 5601 |
For specifying encodings by labels, X11 Compound Text defines five private-use DOCS sequences: ESC % / 0 (1B 25 2F 30) for variable-length encodings, and ESC % / 1 through ESC % / 4 for fixed-length encodings using one through four bytes respectively. Rather than using another escape sequence to return to ISO 2022, the two bytes following the initial escape sequence specify the remaining length in bytes, coded in base-128 using bytes 0x80–FF. The encoding label is included in ISO 8859-1 before the encoded text, and terminated with STX (0x02).[108]
Comparison with other encodings
[edit]Advantages
[edit]- As ISO/IEC 2022's entire range of graphical character encodings can be invoked over GL, the available glyphs are not significantly limited by an inability to represent GR and C1, such as in a system limited to 7-bit encodings. It accordingly enables the representation of large set of characters in such a system. Generally, this 7-bit compatibility is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.
- As compared to Unicode, ISO/IEC 2022 sidesteps Han unification by using sequence codes to switch between discrete encodings for different East Asian languages. This avoids the issues[citation needed] associated with unification, such as difficulty supporting multiple CJK languages with their associated character variants in a single document and font.
Disadvantages
[edit]- Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a backup to the previous escape sequence before the bytes following the escape sequence can be interpreted.
- Due to the stateful nature of ISO/IEC 2022, an identical and equivalent character may be encoded in different character sets, which may be designated to any of G0 through G3, which may be invoked using single shifts or by using locking shifts to GL or GR. Consequently, characters can be represented in multiple ways, meaning that two visually identical and equivalent strings can not be reliably compared for equality.
- Some systems, like DICOM and several e-mail clients, use a variant of ISO-2022 (e.g. "ISO 2022 IR 100"[154]) in addition to supporting several other encodings.[155] This type of variation makes it difficult to portably transfer text between computer systems.
- UTF-1, the multi-byte Unicode transformation format compatible with ISO/IEC 2022's representation of 8-bit control characters, has various disadvantages in comparison with UTF-8, and switching from or to other charsets, as supported by ISO/IEC 2022, is typically unnecessary in Unicode documents.
- Because of its escape sequences, it is possible to construct attack byte sequences in which a malicious string (such as cross-site scripting) is masked until it is decoded to Unicode, which may allow it to bypass sanitisation.[156] Use of this encoding is thus treated as suspicious by malware protection suites,[157][better source needed] and 7-bit ISO 2022 data (except for ISO-2022-JP) is mapped in its entirety to the replacement character in HTML5 to prevent attacks.[112][113] Restricted ISO 2022 8-bit code versions which do not use designation escapes or locking shift codes, such as Extended Unix Code, do not share this problem.
- Concatenation can pose issues. Profiles such as ISO-2022-JP specify that the stream starts in the ASCII state and must end in the ASCII state.[114] This is necessary to ensure that characters in concatenated ISO-2022-JP and/or ASCII streams will be interpreted in the correct set. This has the consequence that if a stream that ends in a multi-byte character is concatenated with one that starts with a multi-byte character, a pair of escape codes are generated switching to ASCII and immediately away from it. However, as stipulated in Unicode Technical Report #36 ("Unicode Security Considerations"), pairs of ISO 2022 escape sequences with no characters between them should generate a replacement character ("�") to prevent them from being used to mask malicious sequences such as cross-site scripting.[158] Implementing this measure, e.g. in Mozilla Thunderbird, has led to interoperability issues, with unexpected "�" characters being generated where two ISO-2022-JP streams have been concatenated.[156]
See also
[edit]Footnotes
[edit]- ^ Japanese: 区点, romanized: kuten; simplified Chinese: 区位; traditional Chinese: 區位; pinyin: qūwèi; Korean: 행렬; Hanja: 行列; RR: haeng-nyeol
- ^ traditional Chinese: 區; simplified Chinese: 区; pinyin: qū; Japanese pronunciation: ku; lit. 'zone'; Korean: 행; Hanja: 行; RR: haeng
- ^ Japanese: 点, romanized: ten, lit. 'point'; Chinese: 位; pinyin: wèi; lit. 'position'; Korean: 열; Hanja: 列; RR: yeol
- ^ Japanese: 面, romanized: men, lit. 'face'
- ^ a b Specified for F bytes 0x40 (
@), 0x41 (A) and 0x42 (B) only, for historical reasons.[89] Some implementations, such as the SoftBank 2G emoji encoding, use additional escapes of this form for non-ISO-2022-compliant purposes.[96] - ^ Listed by MARC-8.[3] See footnote for
ESC , Fbelow for background. - ^ F, adjusted to the range 1-63, indicates which (upwardly compatible) revision of the immediately-following registration is needed, so that old systems know that they are old.[97]
- ^ In earlier editions, 96-character sets did not exist, and the escape codes now used for 96-character sets were reserved as space for additional 94-character sets. Accordingly, the
ESC 0x1B 0x2Csequence was defined in early editions of the standard as designating further 94-character sets to G0.[98] Since 96-character sets cannot be designated to G0, this first I byte is not used by the current edition of the standard. However, it is still listed by MARC-8.[3] - ^ See also, for instance, Printronix (2012), OKI® Programmer's Reference Manual (PDF), p. 26, archived from the original (PDF) on 2019-09-25, retrieved 2019-09-25 for a more recent system which uses
ESC ( Hto switch to ASCII from a DBCS.
References
[edit]- ^ ECMA-35 (1994), Brief History
- ^ ECMA-35 (1994), p. 51, annex D
- ^ a b c d e "Technique 2: Using standard alternate graphic character sets". MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Library of Congress. 2007-12-05. Archived from the original on 2020-07-22. Retrieved 2020-07-19.
- ^ "ECMA-35: Character code structure and extension techniques (web page)". Ecma International. Archived from the original on 2022-04-25. Retrieved 2022-04-27.
- ^ a b c d ECMA-35 (1994), pp. 15–16, chapter 8.1
- ^ a b ECMA-35 (1994), chapter 13
- ^ a b ECMA-35 (1994), chapters 12, 14
- ^ a b ECMA-35 (1994), chapter 11
- ^ a b c d e ISO/IEC FDIS 8859-10 (1998), p. 1, chapter 1 ("Scope")
- ^ a b c d e ECMA-144 (2000), p. 1, chapter 1 ("Scope")
- ^ a b c d e f Lunde (2008), pp. 242–245, Chapter 4 ("Encoding Methods"), section "EUC encoding"
- ^ a b c d Lunde (2008), pp. 253–255, Chapter 4 ("Encoding Methods"), section "EUC versus ISO-2022 encodings".
- ^ a b ISO-IR-196 (1996)
- ^ a b c Moy, Edward; Gildea, Stephen; Dickey, Thomas. "Controls beginning with ESC". XTerm Control Sequences. Archived from the original on 2019-10-10. Retrieved 2019-10-04.
- ^ ECMA-35 (1994), chapters 6, 7
- ^ ECMA-35 (1994), chapter 8
- ^ ECMA-35 (1994), chapter 9
- ^ a b ECMA-35 (1994), chapter 15
- ^ Lunde (2008), pp. 228–234, Chapter 4 ("Encoding Methods"), section "ISO-2022 encoding"
- ^ Lunde (2008), pp. 19–20, Chapter 1 ("CJKV Information Processing Overview"), section "What are Row-Cell and Plane-Row-Cell?"
- ^ ECMA-35 (1994), p. 4, definition 4.11
- ^ ECMA-35 (1994), p. 5, definition 4.18
- ^ See, for instance, ISO-IR-14 (1975), defining the G0 designation of the JIS X 0201 Roman set as
ESC 2/8 4/10. - ^ ECMA-35 (1994), p. 5, chapter 5.1
- ^ See, for instance, RFC 1468 (1993), defining the G0 designation of the JIS X 0201 Roman set as
ESC ( J. - ^ ECMA-35 (1994), p. 7, chapter 6.2
- ^ ECMA-35 (1994), p. 10, chapter 6.3.2
- ^ ECMA-35 (1994), p. 4, definition 4.17
- ^ ECMA-35 (1994), p. 4, definition 4.14
- ^ ECMA-35 (1994), p. 28, chapter 13.1
- ^ a b c ECMA-35 (1994), p. 33, chapter 13.3.3
- ^ ECMA-48 (1991), pp. 24–26, chapter 5.4
- ^ a b c d ECMA-35 (1994), p. 11, chapter 6.4.3
- ^ ISO-IR-208 (1999)
- ^ ISO-IR-155 (1990)
- ^ ISO-IR-164 (1992)
- ^ a b ECMA-35 (1994), p. 10, chapter 6.3.3
- ^ Google Inc. (2014). "ansi.go, line 134". ANSI escape sequence library for Go. Archived from the original on 2022-04-30. Retrieved 2019-09-14.
- ^ ECMA-43 (1991), p. 5, chapter 7 ("Specification of the characters of the 8-bit code")
- ^ ISO/IEC FDIS 8859-10 (1998), p. 3, chapter 6 ("Specification of the coded character set")
- ^ ECMA-144 (2000), p. 3, chapter 6 ("Specification of the coded character set")
- ^ ECMA-43 (1991), p. 19, annex C ("Composite graphic characters")
- ^ a b ECMA-35 (1994), p. 10, chapter 6.4.1
- ^ a b ECMA-35 (1994), p. 11, chapter 6.4.4
- ^ a b c ECMA-35 (1994), p. 11, chapter 6.4.2
- ^ ISO-IR-104 (1985)
- ^ ISO-IR-1 (1975)
- ^ a b ECMA-35 (1994), p. 19, chapter 8.5.1
- ^ a b ECMA-35 (1994), p. 19, chapter 8.5.2
- ^ ECMA-43 (1991), p. 8, chapter 7.6 ("C1 set")
- ^ a b ECMA-35 (1994), p. 29, chapter 13.2.1
- ^ a b ECMA-35 (1994), p. 12, chapter 6.5.1
- ^ ECMA-35 (1994), p. 12, chapter 6.5.2
- ^ a b c ISO-IR, p. 19, chapter 2.7 ("Single control functions")
- ^ ECMA-35 (1994), p. 12, chapter 6.5.4
- ^ ECMA-48 (1991), chapter 5.5
- ^ ISO/TC 97/SC 2 (1976-12-30). Reset to Initial State (RIS) (PDF). ITSCJ/IPSJ. ISO-IR-35.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ ECMA-35 (1994), p. 12, chapter 6.5.3
- ^ a b ECMA-35 (1994), p. 14, chapter 7.3, table 2
- ^ ISO-IR-14 (1975)
- ^ a b ITU-T (1995-08-11). Recommendation T.51 (1992) Amendment 1. Archived from the original on 2020-08-02. Retrieved 2019-12-25.
- ^ ISO-IR-106 (1985)
- ^ ECMA-35 (1994), p. 15, chapter 7.3, note 23
- ^ ISO-IR-140 (1987)
- ^ ISO-IR-7 (1975)
- ^ ISO-IR-26 (1976)
- ^ ISO-IR-36 (1977)
- ^ ECMA-35 (1980), p. 8, chapter 5.1.7
- ^ a b ISO-IR-105 (1985)
- ^ a b c d ECMA-35 (1994), p. 17, chapter 8.3.1
- ^ a b c d ECMA-35 (1994), p. 23, chapter 9.3.1
- ^ a b c ECMA-35 (1994), p. 19, chapter 8.4
- ^ a b c ECMA-35 (1994), p. 17, chapter 8.3.2
- ^ ECMA-35 (1994), pp. 23–24, chapter 9.4
- ^ ECMA-35 (1994), p. 27, chapter 11.1
- ^ ECMA-35 (1994), p. 17, chapter 8.3.3
- ^ ECMA-35 (1994), p. 47, annex B
- ^ ISO-IR, p. 2, chapter 1 ("Introduction")
- ^ ISO/IEC 2375 (2003)
- ^ a b "Handling of the SGML declaration in SP". SP: an SGML System Conforming to International Standard ISO 8879.
- ^ "20: SGML Declaration of HTML 4". HTML 4.01 Specification. W3C.
- ^ ISO-IR, p. 10, chapter 2.2 ("94-Character graphic character set with second Intermediate byte")
- ^ ARIB STD-B24 (2008), p. 39, part 2, Table 7-3
- ^ Mascheck, Sven; Le Breton, Stefan; Hamilton, Richard L. "About the 'alternate linedrawing character set'". ~sven_mascheck/. Archived from the original on 2019-12-29. Retrieved 2020-01-08.
- ^ ECMA-35 (1994), p. 36, chapter 14.4
- ^ ECMA-35 (1994), p. 36, chapter 14.4.2, note 48
- ^ ECMA-35 (1994), p. 36, chapter 14.4.2, note 47
- ^ ETS 300 706 (1997), p. 103, chapter 14 ("Dynamically Re-definable Characters")
- ^ a b c d e f g h i j k l m n o p q ECMA-35 (1994), pp. 35–36, chapter 14.3.2
- ^ ISO/IEC 10646 (2017), pp. 19–20, chapter 12.4 ("Identification of control function set")
- ^ ECMA-35 (1994), p. 32, table 5
- ^ a b c ECMA-35 (1994), pp. 37–41, chapter 15.2
- ^ ECMA-35 (1994), p. 34, chapter 14.2.2
- ^ ECMA-35 (1994), p. 34, chapter 14.2.3
- ^ Digital. "DECDWL—Double-Width, Single-Height Line". VT510 Video Terminal Programmer Information. Archived from the original on 2020-08-02. Retrieved 2020-01-17.
- ^ Kawasaki, Yusuke (2010). "Encode::JP::Emoji::Encoding". Encode-JP-Emoji. Line 268. Archived from the original on 2022-04-30. Retrieved 2020-05-28.
- ^ ECMA-35 (1994), pp. 36–37, chapter 14.5
- ^ ECMA-35 (1980), pp. 14–15, chapter 5.3.7
- ^ a b c d ISO-IR, p. 20, chapter 2.8.1 ("Coding systems with Standard return")
- ^ a b c d ECMA-35 (1994), pp. 41–42, chapter 15.4
- ^ a b c d e ISO-IR, p. 21, chapter 2.8.2 ("Coding systems without Standard return")
- ^ ECMA-35 (1994), p. 41, chapter 15.3
- ^ a b c ISO/IEC 10646 (2017), p. 19, chapter 12.2 ("Identification of a UCS encoding scheme")
- ^ ISO/IEC 10646 (2017), pp. 18–19, chapter 12.1 ("Purpose and context of identification")
- ^ ISO-IR-192 (1996)
- ^ ISO-IR-195 (1996)
- ^ ISO/IEC 10646 (2017), p. 20, chapter 12.5 ("Identification of the coding system of ISO/IEC 2022")
- ^ a b Scheifler (1989), § Non-Standard Character Set Encodings
- ^ Lunde (2008), pp. 229–230, Chapter 4 ("Encoding Methods"), section "ISO-2022 encoding" "Those encodings that have been extensively used in the past, or continue to be used today for some purposes, have been highlighted."
- ^ a b "Additional Coding-related Required Information". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2015-01-07.
- ^ a b c WHATWG Encoding Standard, section 2 ("Security background")
- ^ a b c WHATWG Encoding Standard, chapter 4.2 ("Names and labels"), anchor "replacement"
- ^ a b c d WHATWG Encoding Standard, section 14.1 ("replacement")
- ^ a b c d e f RFC 1468 (1993)
- ^ a b c "Code Page Identifiers". Windows Dev Center. Microsoft. Archived from the original on 2019-06-16. Retrieved 2019-09-16.
- ^ a b WHATWG Encoding Standard, section 12.2 ("ISO-2022-JP")
- ^ Chang, Hye-Shik. "Modules/cjkcodecs/_codecs_iso2022.c, line 1122". cPython source tree. Python Software Foundation. Archived from the original on 2022-04-30. Retrieved 2019-09-15.
- ^ "codecs — Codec registry and base classes § Standard Encodings". Python 3.7.4 documentation. Python Software Foundation. Archived from the original on 2019-07-28. Retrieved 2019-09-16.
- ^ "2: Codesets and Codeset Conversion". DIGITAL UNIX Technical Reference for Using Japanese Features. Digital Equipment Corporation, Compaq.[dead link]
- ^ a b Lunde (2008), pp. 236–238, Chapter 4 ("Encoding Methods"), section "The predecessor of ISO-2022-JP encoding—JIS encoding"
- ^ RFC 1554 (1993)
- ^ RFC 2237 (1997)
- ^ "PQ02042: New Function to Provide C/370 iconv() Support for Japanese ISO-2022-JP". IBM. 2021-01-19. Archived from the original on 2022-01-04. Retrieved 2022-01-04.
- ^ a b "CCSID 9148". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
- ^ "CCSID 956". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-12-02.
- ^ "CCSID 957". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-30.
- ^ "CCSID 958". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-12-01.
- ^ "CCSID 959". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-12-02.
- ^ "CCSID 5052". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
- ^ "CCSID 5053". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
- ^ "CCSID 5054". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
- ^ "CCSID 5055". IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
- ^ a b RFC 1557 (1993)
- ^ "KS X 1001:1992" (PDF). Archived (PDF) from the original on 2007-09-26. Retrieved 2007-07-12.
- ^ ISO-IR-149 (1988)
- ^ a b c d RFC 1922 (1996)
- ^ "CVE-2024-2961".
- ^ "GLIBC Vulnerability on Servers Serving PHP".
- ^ ECMA-43 (1991), pp. 9–10, chapter 8 ("Levels")
- ^ ECMA-43 (1985), pp. 7–11, chapter 7.3 ("The G0 set")
- ^ ECMA-43 (1991), pp. 6–8, chapter 7.4 ("G0 set")
- ^ ECMA-43 (1991), p. 11, chapter 10.3 ("Identification of a version")
- ^ a b ECMA-43 (1991), p. 23, annex E ("Main differences between the second edition (1985) and the present (third) edition of this ECMA Standard")
- ^ IPTC (1995). The IPTC Recommended Message Format (PDF) (5th ed.). IPTC TEC 7901. Archived (PDF) from the original on 2022-01-25. Retrieved 2020-01-14.
- ^ ECMA-43 (1991), pp. 10, chapter 9.2 ("Unique coding of characters")
- ^ van Wingen, Johan W (1999). "8. Code Extension, ISO 2022 and 2375, ISO 4873 and 10367". Character sets. Letters, tokens and codes. Terena. Archived from the original on 2020-08-01. Retrieved 2019-10-02.
- ^ ECMA-43 (1991), pp. 10–11, chapter 10 ("Identification of version and level")
- ^ IBM. "Character Data Representation Architecture (CDRA)". IBM. pp. 157–162. Archived from the original on 2019-06-23. Retrieved 2020-06-18.
- ^ Scheifler (1989)
- ^ Scheifler (1989), § Control Characters
- ^ Scheifler (1989), § Directionality
- ^ Scheifler (1989), § Standard Character Set Encodings
- ^ Scheifler (1989), § Approved Standard Encodings
- ^ "DICOM PS3.2 2016d - Conformance; D.6.2 Character Sets; D.6 Support of Character Sets". Archived from the original on 2020-02-16. Retrieved 2020-05-21.
- ^ "DICOM ISO 2022 variation". Archived from the original on 2013-04-30. Retrieved 2009-07-25.
- ^ a b Sivonen, Henri (2018-12-17). "(UNSUBMITTED DRAFT) No U+FFFD Generation for Zero-Length ASCII-State Content between ISO-2022-JP Escape Sequences" (PDF). Archived (PDF) from the original on 2019-02-21. Retrieved 2019-02-21.
- ^ "935453 - Gather telemetry about HZ and other encodings we might try to remove". Archived from the original on 2017-05-19. Retrieved 2018-06-18.
- ^ Davis, Mark; Suignard, Michel (2014-09-19). "3.6.2 Some Output For All Input". Unicode Technical Report #36: Unicode Security Considerations (revision 15). Unicode Consortium. Archived from the original on 2019-02-22. Retrieved 2019-02-21.
Standards and registry indices cited
[edit]- ARIB (2008). ARIB STD-B24: Data Coding and Transmission Specification for Digital Broadcasting (PDF) (ARIB Standard). 5.2-E1. Vol. 1. Archived (PDF) from the original on 2017-07-10. Retrieved 2017-07-10.
- ECMA (1980). ECMA-35: Extension of the 7-bit Coded Character Set (PDF) (ECMA Standard) (2nd ed.).
- ECMA (1994). ECMA-35: Character Code Structure and Extension Techniques (PDF) (ECMA Standard) (6th ed.).
- ECMA (1985). ECMA-43: 8-Bit Coded Character Set Structure and Rules (PDF) (ECMA Standard) (2nd ed.).
- ECMA (1991). ECMA-43: 8-Bit Coded Character Set Structure and Rules (PDF) (ECMA Standard) (3rd ed.).
- ECMA (1991). ECMA-48: Control Functions for Coded Character Sets (PDF) (ECMA Standard) (5th ed.).
- ECMA (2000). ECMA-144: 8-Bit Single-Byte Coded Graphic Character sets: Latin Alphabet No. 6 (PDF) (ECMA Standard) (3rd ed.).
- European Broadcasting Union (1997). ETS 300 706: Enhanced Teletext specification (PDF) (European Telecommunications Standards). ETSI.
- ISO/IEC JTC 1/SC 2 (2003). ISO/IEC 2375:2003: Information technology — Procedure for registration of escape sequences and coded character sets. ISO.
{{cite book}}: CS1 maint: numeric names: authors list (link) - ISO/IEC JTC 1/SC 2 (1998-02-12). ISO/IEC FDIS 8859-10: Information Technology — 8-bit single-byte coded graphic character sets — Part 10: Latin alphabet No. 6 (PDF) (Final Draft International Standard).
{{cite book}}: CS1 maint: numeric names: authors list (link) - ISO/IEC JTC 1/SC 2 (2017). ISO/IEC 10646: Information technology — Universal Coded Character Set (UCS) (ISO Standard) (5th ed.). ISO.
{{cite book}}: CS1 maint: numeric names: authors list (link) - ISO-IR: ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences (PDF) (Registry Index). ITSCJ/IPSJ. Archived from the original (PDF) on 2023-05-12. Retrieved 2023-05-12.
- Scheifler, Robert W. (1989). Compound Text Encoding (X Consortium Standard). X Consortium.
- van Kesteren, Anne. WHATWG Encoding Standard (WHATWG Living Standard). WHATWG.
Registered code sets cited
[edit]- ISO/TC 97/SC 2 (1975-12-01). ISO-IR-1: The set of control characters of the ISO 646 (PDF). ITSCJ/IPSJ.
{{citation}}: CS1 maint: numeric names: authors list (link) - Sveriges Standardiseringskommission (1975-12-01). ISO-IR-7: NATS Control set for newspaper text transmission (PDF). ITSCJ/IPSJ.
- Japanese Industrial Standards Committee (1975-12-01). ISO-IR-14: The Japanese Roman graphic set of characters (PDF). ITSCJ/IPSJ.
- IPTC (1976-03-25). ISO-IR-26: Control set for newspaper text transmission (PDF). ITSCJ/IPSJ.
- ISO/TC 97/SC 2 (1977-10-15). ISO-IR-36: The set of control characters of ISO 646, with IS4 replaced by Single Shift for G2 (SS2) (PDF). ITSCJ/IPSJ.
{{citation}}: CS1 maint: numeric names: authors list (link) - ISO/TC97/SC2/WG-7; ECMA (1985-08-01). ISO-IR-104: Minimum C0 set for ISO 4873 (PDF). ITSCJ/IPSJ.
{{citation}}: CS1 maint: numeric names: authors list (link) - ISO/TC97/SC2/WG-7; ECMA (1985-08-01). ISO-IR-105: Minimum C1 Set for ISO 4873 (PDF). ITSCJ/IPSJ.
{{citation}}: CS1 maint: numeric names: authors list (link) - ITU (1985-08-01). ISO-IR-106: Teletex Primary Set of Control Functions (PDF). ITSCJ/IPSJ.
- Úřad pro normalizaci a měřeni (1987-07-31). ISO-IR-140: The C0 Set of Control Characters of ISO 646, with EM replaced by SS2 (PDF). ITSCJ/IPSJ.
- Korea Bureau of Standards (1988-10-01). ISO-IR-149: Korean Graphic Character Set for Information Interchange (KS C 5601:1987) (PDF). ITSCJ/IPSJ.
- ISO/IEC/JTC1/SC2/WG3 (1990-04-16). ISO-IR-155: Basic Box-Drawings Set (PDF). ITSCJ/IPSJ.
{{citation}}: CS1 maint: numeric names: authors list (link) - CCITT (1992-07-13). ISO-IR-164: Hebrew Supplementary Set of Graphic Characters (PDF). ITSCJ/IPSJ.
- ECMA (1996-04-22). ISO-IR-192: UCS Transformation Format (UTF-8), implementation level 3, without standard return (PDF). ITSCJ/IPSJ.
- ECMA (1996-04-22). ISO-IR-195: UCS Transformation Format (UTF-16), implementation level 3, without standard return (PDF). ITSCJ/IPSJ.
- ECMA (1996-04-22). ISO-IR-196: UCS Transformation Format (UTF-8), with standard return (PDF). ITSCJ/IPSJ.
- National Standards Authority of Ireland (1999-12-07). ISO-IR-208: Ogham coded character set for information interchange (PDF). ITSCJ/IPSJ.
Internet Requests For Comment cited
[edit]- Murai, J.; Crispin, M.; van der Poel, E. (1993). "RFC 1468: Japanese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1468.
- Ohta, M.; Handa, K. (1993). "RFC 1554: ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP". Requests for Comments. IETF. doi:10.17487/rfc1554.
- Choi, U.; Chon, K.; Park, H. (1993). "RFC 1557: Korean Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1557.
- Zhu, HF.; Hu, DY.; Wang, ZG.; Kao, TC.; Chang, WCH.; Crispin, M. (1996). "RFC 1922: Chinese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1922.
- Tamaru, K. (1997). "RFC 2237: Japanese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc2237.
Other published works cited
[edit]- Lunde, Ken (2008). CJKV Information Processing (2nd ed.). O'Reilly Media. ISBN 9780596514471.
Further reading
[edit]- Lunde, Ken (1998). CJKV Information Processing. Cambridge, Massachusetts: O'Reilly & Associates. ISBN 1-56592-224-7.
External links
[edit]- ISO/IEC 2022:1994
- ISO/IEC 2022:1994/Cor 1:1999
- ECMA-35, equivalent to ISO/IEC 2022 and freely downloadable.
- International Register of Coded Character Sets to be Used with Escape Sequences, a full list of assigned character sets and their escape sequences
- History of Character Codes in North America, Europe, and East Asia from 1999, rev. 2004
- Ken Lunde's CJK.INF: a document on encoding Chinese, Japanese, and Korean (CJK) languages, including a discussion of the various variants of ISO/IEC 2022.
ISO/IEC 2022
View on GrokipediaIntroduction and Overview
Purpose and Scope
ISO/IEC 2022 is an international standard published in 1994 by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), specifying the structure of 7-bit and 8-bit codes for the coding of character sets in information technology systems.[1] It cancels and replaces the third edition from 1986, with the 1994 version being technically almost identical but completely rearranged and rewritten for improved usability and clarity.[8] The standard focuses on providing a uniform framework for code extension techniques, enabling the representation of diverse character sets within a single coding environment. The primary purpose of ISO/IEC 2022 is to facilitate the interchange of multilingual text and data between information-processing systems by allowing the designation and invocation of multiple graphical character sets and control function sets within a single coded character stream.[8] This is achieved through structured methods for extending elementary 7-bit and 8-bit codes, ensuring compatibility across different environments while supporting the integration of extended character repertoires.[1] By standardizing these extension mechanisms, the standard promotes efficient data processing and communication, particularly in scenarios requiring support for non-Latin scripts or specialized symbols. The scope of ISO/IEC 2022 encompasses the definition of code structures and extension techniques for coded character sets but explicitly excludes the assignment of specific characters, which is addressed by separate standards such as ISO/IEC 646 for basic 7-bit codes.[8] It emphasizes 8-bit codes due to their broader adoption but includes provisions for 7-bit environments, with techniques designed primarily for sequential forward processing of data streams.[1] Conformance to the standard is defined at two levels: full conformance, which requires support for all specified code elements and extension techniques, and limited conformance, which permits subsets under defined conditions.[8] Systems claiming conformance must handle escape sequences appropriately to switch between designated sets, ensuring no use of reserved code positions and compatibility with both 7-bit and 8-bit transmission modes.[8]Historical Development
The development of ISO/IEC 2022 traces its origins to the early 1970s, when ECMA International and ISO began collaborative work on standardizing international character sets to address limitations in early computing environments. This effort built directly on the 7-bit coded character set defined in ISO/IEC 646 (first published in 1973), which provided a foundation for basic information interchange but lacked mechanisms for extending to multilingual support.[9] ECMA's Technical Committee 1 (TC1) played a pivotal role, publishing the inaugural edition of ECMA-35 in 1971, which introduced techniques for code structure and extension to accommodate diverse graphic characters while maintaining compatibility with 7-bit channels.[3] Subsequent revisions to ECMA-35 in 1980, 1982, and 1985 incorporated advancements in character encoding, with the 1985 edition serving as the technical basis for the first ISO adoption.[3] ISO published its initial edition, ISO 2022:1986, as the third international edition, formalizing these extension techniques under ISO/IEC JTC 1/SC 2.[10] The standard was revised in 1994 as ISO/IEC 2022, primarily reorganizing content for improved clarity and usability without introducing major technical alterations; this fourth edition remains the current version, last reviewed and confirmed in 2020.[1] Influences on ISO/IEC 2022 include CCITT (now ITU-T) recommendations for telematic services, such as those outlined in T.61 (1988), which required robust code extension for international data transmission in services like Teletex and Videotex. The standard was later integrated into Internet protocols, notably through RFC 1468 (1993), which defined ISO-2022-JP as a variant for encoding Japanese characters in email and news messages, enabling 7-bit compatibility for non-Latin scripts.[11] ISO/IEC 2022 supersedes its 1986 predecessor and complements ISO/IEC 10646 (first edition 1993), the universal character set later harmonized with Unicode, by providing escape-based mechanisms to invoke subsets of its repertoire, though it predates that standard's development.[12] Key drivers for adoption stemmed from the necessity to encode East Asian languages—particularly complex ideographic systems like those in Japanese, Chinese, and Korean—within constrained 7-bit telecommunications channels prevalent in the 1980s and 1990s.[13] This led to national variants, such as the Japanese Industrial Standard JIS X 0202 (1991), which adopted ISO/IEC 2022's framework to support JIS character sets like JIS X 0208 for kanji and kana in applications ranging from email to terminal communications.[14]Core Principles
Basic Elements
ISO/IEC 2022 establishes a framework for character encoding through distinct code elements that separate control functions from graphical representations, enabling flexible multilingual text processing. The core elements include two control sets, C0 and C1, each comprising 32 positions for non-printing functions such as line feeds and tabs, and four graphical sets, G0 through G3, which hold printable characters like letters and symbols. C0 is positioned in bit combinations 00/00 to 01/15 (decimal 0–31), while C1 occupies 08/00 to 09/15 (128–159) in 8-bit environments.[3] Graphical sets G0–G3 are structured to fit within the remaining code table areas: G0 in 02/00–07/15 (32–127), with G1–G3 assignable to 10/00–15/15 (160–255) for 8-bit codes. Each set typically contains 94 characters in its primary form, though G1–G3 may extend to 96, excluding control positions to avoid overlap. Character positions within these sets are defined by bit combinations b7 b6 b5 b4 / b3 b2 b1 b0, where the slash separates the quadrant from the row and column in the code table; 7-bit implementations limit access to 128 positions per set by constraining b7 to 0, effectively halving the addressable space compared to 8-bit codes.[3] A key distinction in ISO/IEC 2022 lies between designation and invocation: designation assigns a specific coded character set to one of the G0–G3 slots using escape sequences, while invocation activates a designated set for immediate use in the GL (graphics left, columns 2–7) or GR (graphics right, columns 10–15) areas. This separation allows pre-loading multiple sets before switching. In 7-bit mode, invocation relies on shifting functions like SI (Shift In) to select G0 or SO (Shift Out) for G1, requiring explicit toggles for each character from alternate sets; 8-bit mode supports locking shifts (e.g., LS0 for G0 in GL) and single-shift sequences for direct, non-persistent access without per-character overhead.[3] Basic conformance to ISO/IEC 2022 mandates support for control functions defined in ISO/IEC 6429, ensuring interoperability for essential operations like character spacing and formatting across coded environments.[15]Escape Sequences and Designation
ISO/IEC 2022 employs escape sequences as the primary mechanism for designating and invoking character sets, enabling the dynamic selection and activation of different coded character sets within a single code space. These sequences begin with the ESCAPE character, represented by the hexadecimal value 1B (or bit combination 01/11 in the standard notation), followed by zero to two intermediate bytes and a mandatory final byte. The intermediate bytes, if present, are from columns 02/00 to 02/13 and serve to specify the type of function, while the final byte determines the specific action or set identification. For example, the sequence ESC (B designates the ISO/IEC 646 (ASCII) set for the G0 slot.[3] Designation in ISO/IEC 2022 involves assigning identified sets of graphic or control characters to specific code elements, such as G0 through G3 for graphics or C0 and C1 for controls. The final byte of the escape sequence identifies the registered set, using a notation like 02/08 followed by the final byte F, where F is the registry-assigned identifier. For instance, the sequence ESC ) A (ESC 02/11 02/01) designates the ISO/IEC 8859-1 Latin-1 set for G1. This process allows for the extensible use of multiple character sets, with designations referencing unique identifiers from the ISO International Register of Coded Character Sets to ensure unambiguous assignment.[3] Invocation sequences activate the designated sets for use in the code stream, distinguishing between temporary single shifts and permanent locking shifts. In 7-bit environments, the Shift In (SI, 00/15) and Shift Out (SO, 00/14) functions invoke G0 and G1 sets, respectively, for single characters or until the next shift. Locking shifts, such as LS0 (00/14) and LS1 (00/15) for G0 and G1 in the GL area, with LS2 (08/14) and LS3 (08/15) for G2 and G3, permanently assign sets to the GL or GR areas until another locking shift occurs. Additional functions like SS2 (single shift to G2) and SS3 (to G3) support invocation of G2 and G3 sets, primarily in 8-bit codes.[3] The standard defines four G-set slots—G0, G1, G2, and G3—for holding designated graphic character sets, each capable of accommodating 94- or 96-character sets (or multiples thereof). G0 serves as the default for the GL area in unshifted states, while G1 is typically invoked via shifts; G2 and G3 are accessible only in 8-bit environments through specific locking or single-shift functions. Active set selection follows rules where the GL area (columns 02-07 across all rows, excluding controls) uses the invoked G0-G3 set, and the GR area (columns 10-15 across all rows, excluding controls) defaults to G1 but can be shifted to others. These slots, building on the basic elements of the code structure, allow flexible switching without fixed encodings.[3] Designations draw from the ISO registry procedure outlined in ISO 2375, which assigns unique final bytes and escape sequence codes to prevent conflicts and ensure international interoperability. This registry, maintained by ISO, catalogs coded character sets and their associated escape sequences, with updates reflecting new registrations for emerging standards.[3]Code Versions and Conformance
ISO/IEC 2022 defines implementation levels for its code structure, each representing a progressive extension of capabilities for handling character sets in 7-bit and 8-bit environments.[3] Level 1 provides the most basic implementation, limited to a 7-bit or 8-bit code using only the G0 graphic set invoked to the GL area, without support for C1 control characters, and restricted to the ISO/IEC 646 invariant character set for compatibility with simple telegraphic systems.[3] This level relies solely on C0 controls and the fixed G0 set, ensuring minimal overhead for environments like early international teleprinters.[3] Level 2 extends Level 1 by incorporating C1 control functions, typically represented via escape sequences in 7-bit codes, and allowing designation of user-defined graphic sets to G1 for invocation via shifting.[3] It supports both 7-bit shifting with SI/SO and initial 8-bit designation to GL/GR without locking shifts, enabling more flexible multilingual text while maintaining backward compatibility with basic 7-bit streams.[3] User-defined sets must be registered or predefined, but this level introduces the capability for national variants beyond ISO/IEC 646.[3] Level 3 builds on Level 2 by adding full support for additional graphic sets G2 and G3, and introducing multi-national modes through invocation functions like single shifts SS2 and SS3, along with locking shifts.[3] This allows for richer character repertoires in 8-bit environments, such as those requiring multiple 94- or 96-character sets, while optional C1 controls enhance formatting capabilities.[3] Level 3 emphasizes designation-based switching combined with persistent locking for efficiency in data streams.[3] Level 4 extends Level 3 by permitting redesignation of character sets at any point during the data interchange, supporting dynamic adjustments in complex multilingual applications.[3] Conformance to ISO/IEC 2022 is structured into three levels for both 7-bit and 8-bit codes, defining mandatory and optional features to ensure interoperability.[3] Level 1 offers simple conformance, mandating only C0 controls, the G0 set (94 characters), and basic invocation to GL, with no requirement for additional sets or shifts; optional elements include initial designation of G0.[3] This level suits basic national or invariant codes like ISO/IEC 646. Level 2 extends Level 1 by mandating support for user-defined areas, including designation and invocation of G1 (and optionally G2/G3 in 8-bit), SI/SO in 7-bit, or SS2/SS3 for single characters, allowing multi-set usage but without locking shifts.[3] Optional features here encompass C1 controls and 96-character sets. Level 3 provides full extensions, requiring all shift functions including locking shifts (LS0-LS3, LS1R-LS3R in 8-bit; LS2/LS3 in 7-bit), redesignation at any time, and support for all graphic sets, with variants for single-shift emphasis (e.g., Level 3A using GR for shifts).[3] Mandatory features across levels include the use of defined code tables, no private use of reserved escape sequences, and proper handling of fixed characters like SPACE and DELETE.[3] Optional features, such as advanced controls or additional sets, must be declared in conformance statements if used.[3] Implementations claim conformance to a specific level and version via optional announcement sequences at the start of a data stream, such as the Announcement of Code Structure (ACS) escape sequence (e.g., ESC 02/00 F followed by facility identifiers), which declares the adopted version, invoked sets, and shift states.[3] These announcements facilitate receiver preparation but are not required for basic streams.[3]Code Structure Details
Notation and Nomenclature
In ISO/IEC 2022, bits within a code element are denoted from the most significant bit (MSB) as b7 to the least significant bit (LSB) as b0 in both 7-bit and 8-bit codes.[3] Control characters are distinguished from graphic characters by their bit patterns; in 7-bit codes, control characters occupy positions where b7 = 0 (bit combinations 00 to 1F).[3] Escape sequences for designation and invocation are notated as ESC I F, where ESC is the escape control character (2/8 in graphic notation), I represents one or more intermediate bytes (from column 2, rows 0-15), and F is the final byte (from columns 3-7, excluding 7/15).[3] Graphic character positions within sets are referenced using a slash-separated notation for row and column, such as 2/5 or 3/13, where the first number indicates the row (0-7 for primary sets or 8-15 for supplementary) and the second the column (0-15).[3] Key terms in the standard include "coded character set" (CCS), defined as a set of unambiguous rules establishing a character set and the one-to-one correspondence between its characters and their bit combinations.[3] A "code element" refers to a bit combination representing a single character from the CCS.[3] The term "character" denotes a member of a set of elements used for data organization, control, or representation, encompassing both graphic and control types.[3] "Repertoire" specifies the collection of characters available for representation by one or more bit combinations in the CCS.[3] Control function sets are nomenclatured as C0 (the primary set, occupying code table columns 0 and 1) and C1 (the supplementary set, occupying columns 8 and 9, or represented via escape sequences in 7-bit environments).[3] Graphic character sets are designated G0 through G3, where G0 is the base set and G1-G3 are additional sets that can be invoked into variable positions (such as code table columns 2-7 for G0/G1 or 10-15 for G2/G3).[3] Fixed positions refer to invariant locations like C0 (always in columns 0-1) and the default G0 (initially in columns 2-7), while variable positions allow dynamic assignment of G1-G3 or C1 via escape sequences.[3] Character sets are classified as 94-character or 96-character based on their tabular structure: 94-character sets utilize 94 positions in columns 2/1 through 7/14 (or 10/1 through 15/14 for supplementary), deliberately excluding the C0/C1 space positions (column 0) and delete position (7/15 or 15/15) to avoid conflicts with control functions.[3] In contrast, 96-character sets employ all 96 positions across columns 2/0 through 7/15 (or 10/0 through 15/15), including those excluded positions for fuller repertoire coverage.[3] This distinction ensures compatibility in multi-national environments by aligning graphic characters away from control ranges.[3]Fixed Coded Characters
In ISO/IEC 2022, the code structure reserves specific invariant positions for control functions to ensure compatibility and consistent behavior across implementations. Positions 0 through 31 (bit combinations 00/00 to 01/15) and position 127 (bit combination 07/15) are permanently assigned to the C0 control set, as defined in ISO/IEC 6429. These positions must always represent the corresponding C0 control functions, such as NUL (null, position 0), ESC (escape, position 27 or 01/11), and DEL (delete, position 127), regardless of any code extensions or designations.[3] The DEL function, in particular, serves as a fixed deletion or padding character and is mandatory in all conforming codes.[3] In 8-bit implementations of ISO/IEC 2022, positions 128 through 159 (bit combinations 08/00 to 09/15) are designated for the supplementary C1 control set when the C1 facility is selected, providing additional control functions compatible with ISO/IEC 6429. If the C1 facility is not invoked, these positions remain unused, may be treated as undefined, or could be assigned to graphic characters such as space in certain code versions, but they cannot be repurposed for arbitrary controls.[3] This structure allows for controlled extension while preserving the integrity of the core control repertoire. Unlike control positions, there are no fixed graphic characters in the variable G0 through G3 positions of the code table; all such positions are assignable to different character sets via designation sequences, enabling flexibility for multilingual text. The only invariant graphic element is the space character (SP) at position 32 (bit combination 02/00), which must be present and identical across all graphic character sets. By default, the G0 set is designated as ISO/IEC 646 (International Reference Version), but this default does not impose fixed assignments beyond SP in the graphic areas.[3] Conformance to ISO/IEC 2022 requires that implementations never redefine or reassign the fixed control positions, ensuring interoperability and preventing conflicts with established control behaviors; unregistered or reserved codes in these areas are prohibited. The DEL character, as a fixed element, must be supported universally to maintain code robustness.[3] In code version 0, which restricts usage to the basic 7-bit structure, only the fixed C0 set in positions 0-31 and 127 is permitted, excluding C1 and any graphic set extensions.[3] This version emphasizes the invariant 7-bit controls for minimal, ASCII-compatible environments. For broader control capabilities, higher versions may incorporate swappable C0 or C1 sets, but the fixed positions remain unaltered.Graphical Character Sets
In ISO/IEC 2022, graphical character sets consist of printable characters arranged in a structured code table to facilitate consistent encoding across different implementations. These sets are divided into those containing 94 characters and those containing 96 characters. A 94-character set occupies positions 02/01 through 07/14 in the code table (or equivalently 10/01 through 15/14 for the right-hand side), deliberately avoiding the fixed space position at 02/00 (graphic) and the fixed delete position at 07/15 (control). In contrast, a 96-character set fills positions 02/00 through 07/15 (or 10/00 through 15/15), incorporating additional characters in those boundary positions to support fuller repertoires, particularly in environments where control functions are handled separately. This arrangement ensures that graphical sets align with the 94 positions available in the GL (positions 02-07) and GR (10-15) areas of the 8-bit code structure, promoting compatibility with 7-bit codes via shifting mechanisms. Note that 96-character sets may only be designated to G1, G2, or G3, not G0.[16] Designation of graphical character sets occurs through escape sequences that assign specific sets to the four possible slots: G0, G1, G2, and G3. The G0 set is invoked by default as the International Reference Version (IRV) defined in ISO/IEC 646, which provides a basic 94-character set of Latin-based graphics. Subsequent designations use escape sequences such as ESC 02/08 F to designate a 94-character set to G0 (G0 supports only 94-character sets); for G1, ESC 02/09 F for 94-character or ESC 02/13 F for 96-character; for G2, ESC 02/10 F for 94-character or ESC 02/14 F for 96-character; for G3, ESC 02/11 F for 94-character or ESC 02/15 F for 96-character. National character sets typically employ 94-character versions to avoid overlap with the C0 (00-1F) and C1 (80-9F) control ranges, ensuring safe integration in 7-bit environments, whereas 96-character sets are more common for international or multibyte preparations that require denser packing without control conflicts.[16] The repertoire of characters in these graphical sets must form subsets of the Universal Character Set (UCS) defined in ISO/IEC 10646, guaranteeing that no character is ambiguously represented across sets to prevent encoding errors in mixed environments. Specifically, rules prohibit overlapping assignments where the same UCS code point could map to different glyphs depending on the active set, with resolution prioritizing the lowest designated set (G0 over G1, etc.) if ambiguities arise, though unique coding is preferred. Invocation of a designated set for output rendering happens through shift functions: locking shifts like LS0 (to G0 in GL) or LS1 (to G1 in GL) establish a persistent active set, while single-shift functions such as SS2 (to G2 in GL) or SS3 (to G3 in GL) apply temporarily to the next character only. The active set thus determines the graphical interpretation of code positions in the GL or GR areas during data processing and display.[16]Control Character Sets
In ISO/IEC 2022, control character sets consist of two swappable sets, C0 and C1, which provide non-printable functions for text formatting, device control, and data transmission without incorporating any graphical characters in their positions.[3] These sets enable flexible handling of control operations in both 7-bit and 8-bit environments, ensuring compatibility across coded character sets while reserving specific code positions exclusively for controls.[3] The C0 set occupies 32 positions from code values 0 to 31 (bit combinations 00/00 to 01/15), and it is mandatory for conformance to the standard.[3] It aligns with the control functions defined in ISO/IEC 6429 (equivalent to ECMA-48), including essential characters such as LF (Line Feed, 00/10) for advancing to the next line, CR (Carriage Return, 00/13) for returning to the line start, and ESC (Escape, 01/11) for initiating code extensions.[17] Other examples include format effectors like HT (Horizontal Tabulation, 00/09) for spacing and VT (Vertical Tabulation, 00/11) for line positioning, as well as transmission controls such as NAK (Negative Acknowledgment, 01/05) for error signaling.[17] No graphical characters are permitted in C0 positions to maintain their role in basic text manipulation and protocol handling.[3] The C1 set comprises another 32 positions from 128 to 159 (bit combinations 08/00 to 09/15) in 8-bit codes and is optional, supporting advanced device-oriented functions from ISO/IEC 6429.[17] Examples include CSI (Control Sequence Introducer, 09/11) for parameterized commands like cursor movement and NEL (Next Line, 08/05) for combined line feed and carriage return.[17] Like C0, C1 positions exclude graphics, focusing instead on extended formatting (e.g., partial line down PLD at 08/11) and transmission aids (e.g., end of guarded area EPA at 09/07).[3][17] Designation of alternative C0 or C1 sets occurs via escape sequences, allowing substitution of the default ISO 6429 sets with registered alternatives.[3] For C0, the sequence is ESC 02/01 F (where F designates the specific set), and for C1, it is ESC 02/02 F; for instance, ESC P designates the ISO 6429 C1 set in certain implementations.[3] In version 1 and later of ISO/IEC 2022, C1 support is explicitly allowed in 8-bit codes, while 7-bit environments emulate C1 functions using the ESC Fe sequence (where Fe represents the C1 code's single-byte equivalent, such as ESC 05/11 for CSI).[3] This emulation ensures backward compatibility without expanding the code space.[3]Shift Functions
Shift functions in ISO/IEC 2022 provide mechanisms to invoke designated graphical character sets (G0 through G3) into specific positions within the code table, enabling the representation of characters from multiple sets in a single coded character set.[3] These functions are divided into single-shift functions, which temporarily invoke a set for the immediately following character, and locking-shift functions, which permanently assign a set to a position until another shift occurs.[3] Invocation occurs into the left half (GL, bit combination positions 02/01 through 07/14 or equivalent) or, in 8-bit codes, the right half (GR, positions 10/01 through 15/14).[3] The standard defines distinct behaviors for 7-bit and 8-bit code versions to ensure compatibility and efficient set switching. In 7-bit codes, which use only the GL area (7-bit combinations 20 through 7E hexadecimal), shift functions are limited to invoking G0 and G1 sets, with support for G2 and G3 via single shifts.[3] The single-shift functions SI (Shift In, code 0F hexadecimal or 00/15) and SO (Shift Out, code 0E hexadecimal or 00/14) actually function as locking shifts in this context: SI invokes the G0 set into GL, terminating any prior G1 invocation, while SO invokes the G1 set into GL, replacing G0.[3] These shifts remain in effect until the next SI or SO, allowing persistent use of the invoked set without reverting after each character.[3] For temporary access to G2 or G3 in 7-bit codes, SS2 (Single Shift Two, coded as ESC 0E or 8E in 8-bit representation) invokes the G2 set into GL for the next graphic character only, after which the previous shift state resumes; similarly, SS3 (Single Shift Three, ESC 0F or 8F) does the same for G3.[3] 8-bit codes extend this capability by utilizing both GL (20-7F hexadecimal, excluding controls) and GR (A0-FF hexadecimal), supporting invocation of G1, G2, or G3 into GR for broader repertoire access.[3] Locking-shift functions include LS0 (code 0F, invokes G0 into GL), LS1 (code 0E, invokes G1 into GL), LS2 (ESC 6E or equivalent, invokes G2 into GL), and LS3 (ESC 6F, invokes G3 into GL), all of which establish a permanent invocation until overridden.[3] For the right half, locking shifts LS1R (ESC 7E, invokes G1 into GR), LS2R (ESC 7C, invokes G2 into GR), and LS3R (ESC 7D, invokes G3 into GR) provide analogous permanent switching, exclusive to 8-bit environments.[3] Single shifts SS2 and SS3 in 8-bit codes invoke G2 or G3 into GL temporarily, mirroring their 7-bit behavior, with no direct single-shift mechanism for GR; instead, GR access relies on prior locking shifts.[3] The following table summarizes the primary shift functions, their codes, and invocation targets:| Function | Code (Hexadecimal) | Type | Invocation Target | Code Version |
|---|---|---|---|---|
| SI/LS0 | 0F | Locking | G0 into GL | 7-bit/8-bit |
| SO/LS1 | 0E | Locking | G1 into GL | 7-bit/8-bit |
| SS2 | ESC 0E (8E in 8-bit) | Single | G2 into GL (next char) | 7-bit/8-bit |
| SS3 | ESC 0F (8F in 8-bit) | Single | G3 into GL (next char) | 7-bit/8-bit |
| LS2 | ESC 6E | Locking | G2 into GL | 7-bit/8-bit |
| LS3 | ESC 6F | Locking | G3 into GL | 7-bit/8-bit |
| LS1R | ESC 7E | Locking | G1 into GR | 8-bit only |
| LS2R | ESC 7C | Locking | G2 into GR | 8-bit only |
| LS3R | ESC 7D | Locking | G3 into GR | 8-bit only |
Combining Characters
In ISO/IEC 2022, combining characters, also known as non-spacing diacritical marks, are handled as a subset of graphic characters within the designated G sets, allowing for the composition of accented or modified letters without dedicating unique code positions to every possible combination. This mechanism distinguishes between spacing characters, which occupy their own positions and advance the cursor, and non-spacing (combining) marks, which are applied sequentially to preceding or following base characters to form composite glyphs. The primary example is ISO/IEC 6937, a coded character set for European languages that can be designated via escape sequences in ISO/IEC 2022; here, non-spacing marks such as acute, grave, circumflex, and tilde accents are encoded in specific positions (e.g., column 12 of the code table) and transmitted before the base letter, with the receiving system responsible for rendering the final form by superimposing the mark on the base.[18][2] The application of combining marks occurs sequentially within the character stream, enabling multiple diacritics to modify a single base character, though the order of transmission determines the resolution of any ambiguities in rendering, as the standard does not specify stacking rules beyond basic superposition. There is no dedicated combining set separate from the G0–G3 graphical sets; instead, combining capability depends on the designated set, such as ISO/IEC 6937's use of 13 non-spacing marks (with three positions unassigned) to represent up to 333 distinct characters using only 181 bit combinations in an 8-bit environment compatible with ISO/IEC 2022 level 1. Some implementations exhibit dead-key behavior during input, where a diacritic is entered first and "awaits" a base, but transmission follows the sequential model, and standalone non-spacing marks must be followed by a space to avoid unintended combination with subsequent characters.[18][2] Support for combining characters is optional in basic ISO/IEC 2022 versions and is not universal across all registered sets, but it becomes essential for comprehensive Latin script extensions requiring extensive diacritics, as seen in ISO/IEC 6937's integration for multilingual European text interchange. Limitations include the absence of support for multi-level combining beyond simple sequences (e.g., no provision for enclosing or complex ligature marks) and reliance on the receiver's rendering engine to compute the composite form, which can lead to inconsistencies if the designated set lacks the necessary marks. This approach prioritizes efficient repertoire expansion over fixed precomposed encodings, aligning with ISO/IEC 2022's flexible designation framework.[2]Other Control Functions
In ISO/IEC 2022, transmission control functions extend the basic C0 set to manage data integrity and synchronization during interchange, including NAK (Negative Acknowledge, coded as 01/05), which signals an error or rejection of received data to the sender.[8] SYN (Synchronous Idle, coded as 01/06) maintains timing alignment in synchronous transmission systems by inserting idle characters when no data is being sent.[8] DLE (Data Link Escape, coded as 01/00) precedes special bit combinations to alter their interpretation, enabling supplementary controls for data transparency in communication protocols.[8] These functions, drawn from the C0 control character set, support error handling and linking in 7-bit or 8-bit codes as defined in ISO/IEC 6429.[15] Device control functions in ISO/IEC 2022 incorporate sequences introduced by CSI (Control Sequence Introducer, coded as 09/11 in the C1 set or ESC 05/11 in 7-bit form), which initiates parameterized commands for operations such as cursor positioning (e.g., via the CUP sequence) on presentation devices.[17] This aligns with ISO/IEC 6429 specifications for coded representations in extended codes, allowing flexible device management without altering the core character sets.[15] Format effectors beyond the fixed C0 positions include C1 functions like NEL (Next Line, coded as 08/05 or 0x85 in C1, or ESC 05/05 in 7-bit emulation), which advances the active position to the beginning of the following line, combining the effects of carriage return and line feed for efficient text formatting.[17] Such effectors operate within the supplementary C1 set and are invoked directly or via escape sequences in 8-bit environments per ISO/IEC 2022 structure.[8] Escape sequences in ISO/IEC 2022 enable extensions for private use through PU1 and PU2 areas, where user-defined control functions can be registered or implemented via final characters in the 0x70 to 0x7E range, allowing customization without conflicting with standard designations.[8] These private extensions follow the general escape format (ESC followed by intermediate and final bytes) and are intended for application-specific controls in registered user areas as outlined in ISO/IEC 6429.[15] The inclusion of these other control functions varies by version in ISO/IEC 2022: they are optional in Version 0 (basic 7-bit code with limited extensions), but fully supported in higher versions (1 through 4) that incorporate 8-bit codes, C1 sets, and parameterized sequences for comprehensive interchange.[8]Specific Implementations
ISO-2022-JP
ISO-2022-JP is a 7-bit character encoding method defined in RFC 1468 for transmitting Japanese text in Internet messages, including email and network news, in compliance with protocols like SMTP and NNTP.[11] It adheres to the principles of ISO/IEC 2022 by starting in the ASCII state and employing escape sequences to invoke specific character sets for Japanese characters, ensuring compatibility with 7-bit transport mechanisms.[11] This encoding is registered for use in MIME as the charset parameter "iso-2022-jp" and supports text consisting primarily of ASCII with embedded Japanese elements.[11] The structure of ISO-2022-JP defaults to the ASCII set (ISO-IR 6), from which it shifts using escape sequences to other designated sets without employing C1 control characters or locking shift functions like SO (Shift Out) or SI (Shift In).[11] Specifically, the encoding invokes the Roman subset of JIS X 0201 (ISO-IR 14) via the sequence ESC ( J for Latin characters with modifications such as yen sign for backslash, and switches to the JIS X 0208 (ISO-IR 87) set via ESC $ B for double-byte representations of kanji, hiragana, full-width katakana, and symbols.[11] JIS X 0201's Kana subset, which includes half-width katakana, is explicitly excluded to maintain simplicity and avoid compatibility issues.[11] Sequences like ESC ( B return to ASCII, and all lines must end in either ASCII or JIS X 0201 Roman state before a carriage return or line feed.[11] This design distinguishes full-width katakana (from JIS X 0208) but omits half-width variants, prioritizing full-width forms for standard Japanese typography in text.[11] In practice, ISO-2022-JP has been widely adopted for Japanese email and MIME-encoded content due to its efficient use of 7-bit channels and support for the approximately 7000 characters in JIS X 0208, encompassing over 6000 kanji along with hiragana, full-width katakana, and other graphical elements essential for Japanese writing.[11][19] However, it is limited to these core sets and does not include support for the supplementary characters in JIS X 0212, making it unsuitable for less common or specialized kanji.[11] The encoding operates in a manner akin to ISO/IEC 2022 Version 0, relying solely on G0 designation without multi-national or extended features, which ensures straightforward implementation but restricts flexibility for additional languages or revisions.[11]ISO-2022-JP-2
ISO-2022-JP-2 is a multilingual extension of the ISO-2022-JP encoding scheme, designed to support a broader range of Japanese characters for use in electronic mail and network news messages. Defined in RFC 1554, it incorporates the JIS X 0212-1990 standard, which provides supplementary kanji characters not covered in the base ISO-2022-JP that relies on JIS X 0208 and JIS X 0201.[20] This extension maintains compatibility with existing Japanese network practices while enabling the representation of rarer kanji forms essential for specialized texts.[20] The encoding operates as a 7-bit scheme with ASCII as the invariant base, utilizing escape sequences to designate and shift between character sets. For JIS X 0212, the designation sequence isESC $ ( D (where ESC is the escape character, followed by dollar sign, left parenthesis, and capital D), which designates the supplementary kanji set to the G0 graphics plane for invocation in the GL area.[20] JIS X 0208 remains designated using the standard ESC $ B sequence from ISO-2022-JP, allowing separate designations for the two kanji sets to avoid conflicts in multi-set environments.[20] Additional support includes ISO 8859-1 via ESC . A and single-shift access to the G2 plane with ESC N, but the scheme excludes C1 control characters to ensure simplicity and 7-bit safety.[20]
In practice, ISO-2022-JP-2 is employed in scenarios requiring access to the expanded kanji repertoire, such as email handling historical or technical Japanese documents with uncommon characters, while remaining backward compatible with ISO-2022-JP through subset usage—texts without JIS X 0212 characters decode identically under the base scheme.[20] This addition increases the total kanji coverage to approximately 13,000 characters, significantly enhancing expressiveness without altering the underlying ISO/IEC 2022 framework.[20] The encoding requires explicit shifts back to ASCII at the end of text to maintain interoperability.[20]
Halfwidth Katakana Variants
Halfwidth katakana variants of ISO/IEC 2022 are specialized implementations that incorporate the 7-bit katakana code table from JIS X 0201 (ISO-IR 13) into the G1 character set, allowing single-byte encoding of these characters for compatibility with early text processing systems.[21] These variants enable the invocation of halfwidth katakana via the Shift Out (SO) control function, which activates the G1 set without requiring escape sequences for designation in some configurations, though escape sequences like ESC ( I may be accepted in tolerant decoders.[22] This approach supports a 7-bit environment where ASCII remains in the G0 set, and katakana characters occupy positions corresponding to bytes 0x21 through 0x7E when shifted, providing a compact representation distinct from fullwidth forms in larger sets like JIS X 0208.[23] These variants emerged as extensions to the base ISO-2022-JP framework, particularly in systems like Windows code page 50221 (csISO2022JP), which explicitly allows 1-byte katakana alongside JIS X 0201 Roman and JIS X 0208 support.[23] Unlike the standard ISO-2022-JP defined in RFC 1468, which excludes JIS X 0201 katakana to avoid legacy errors from erroneous escape usage, these variants restore compatibility for halfwidth forms, often using SO/SI locking shifts for invocation in 7-bit streams. They were commonly employed in early Japanese computer terminals from the late 1960s onward, where JIS X 0201 provided the initial standard for phonetic katakana input and display on limited hardware.[24] In terms of structure, these variants maintain a strictly 7-bit operation, with no provision for fullwidth characters or multi-byte kanji, limiting the repertoire to ASCII plus the 63 halfwidth katakana glyphs from JIS X 0201.[21] This simplicity facilitated space-saving in katakana-heavy text, such as scientific notation or foreign terms, by using single bytes instead of escape-shifted double bytes required for fullwidth equivalents.[23] Their use persisted in pre-Unicode email and messaging systems during the 1980s and 1990s, where partial compatibility with ISO-2022-JP allowed interoperation, though modern decoders often map halfwidth katakana to fullwidth Unicode equivalents for consistency. Compared to the full ISO-2022-JP, these variants are notably simpler, omitting kanji and hiragana support to focus solely on ASCII and katakana, which reduced complexity in resource-constrained environments like 7-bit teletext or terminal emulations. While shift functions like SO provide the primary invocation mechanism, variants may tolerate additional ISO/IEC 2022 controls for flexibility in mixed environments.[23] Today, they serve primarily as legacy support in encoding libraries, ensuring round-trip conversion for historical data without introducing modern extensions.[22]JIS X 0213 and IBM Extensions
JIS X 0213, first published in 2000 and amended in 2004, represents a significant expansion of the JIS X 0208 character set standard for Japanese text encoding. It introduces approximately 4,000 additional characters, predominantly kanji, distributed across two planes to accommodate a broader repertoire of ideographs and symbols while maintaining compatibility with prior standards. Plane 1 revises JIS X 0208 by adding 1,249 new kanji and 659 other characters, along with glyph adjustments for 168 existing kanji, utilizing unused code space. Plane 2, meanwhile, incorporates 2,436 new kanji, many sourced from the supplementary JIS X 0212 set, enabling support for specialized and historical characters in modern applications.[25][26] The ISO-2022-JP-3 encoding scheme adapts ISO/IEC 2022 principles to incorporate JIS X 0213, functioning as a stateful 7-bit format that switches between character sets via escape sequences. It retains the structure of earlier variants like ISO-2022-JP but adds support for the new planes: the sequence ESC ( P selects Plane 2. This allows seamless integration with ASCII and legacy JIS sets, using escape sequences to designate and switch to specific planes during text processing. As a successor to ISO-2022-JP-2, which handled JIS X 0212, ISO-2022-JP-3 facilitates updated Japanese text handling in environments requiring 7-bit compatibility.[27][26] IBM's Japanese TCP encoding provides a proprietary extension to ISO-2022 for Japanese environments, employing 7-bit codes with shift mechanisms that emulate EUC-JP layouts while adhering to ISO-2022 escape and locking shift protocols. Designated under CCSID 290 in IBM systems, it supports extended kanji sets beyond standard JIS X 0208, including user-defined characters and variants for enterprise data processing. The hybrid 7-bit/8-bit structure uses locking shifts to assign multiple planes—such as GR for double-byte kanji—enabling efficient handling of mixed ASCII and CJK content in legacy mainframe applications.[28][29] In practice, JIS X 0213 and IBM extensions find primary use in specialized Japanese text processing, such as document preparation and printing workflows, where precise control over character planes ensures accurate rendering of extended ideographs. However, their adoption remains limited in internet protocols like email, where UTF-8 predominates for broader interoperability, confining these schemes to niche, high-fidelity scenarios in industrial and legacy systems.[26][27]Other 7-Bit Versions
ISO/IEC 2022's 7-bit implementations for non-Japanese languages typically adhere to a structure where the G0 set is designated as ASCII (ISO-IR 6), providing the base for invariant characters, while the G1 set is assigned a national variant character set via escape sequences. The Shift In (SI, code 0x0F) and Shift Out (SO, code 0x0E) control functions are mandatory for invoking G0 and G1 into the GL (graphics left) area, respectively, enabling dynamic switching between ASCII and the national set within a 7-bit code stream. This setup ensures compatibility with 7-bit transport protocols while accommodating additional graphic characters. European implementations, known as ISO-2022-IR variants, utilize escape sequences to designate national 7-bit character sets derived from ISO/IEC 646, such as the International Reference Version (IRV, ISO-IR 6, designated by ESC ( B) for basic Latin or the French variant (ISO-IR 69, designated by ESC ( F).[2][30] Other examples include the German variant (ISO-IR 21, designated by ESC ( k) and the Spanish variant (ISO-IR 17, designated by ESC ( S), allowing replacement of ambiguous ASCII positions (e.g., #, [, ], , ^, `, {, |, }, ~) with language-specific characters like £ or ñ.[30] These designations target the G1 set, with SI/SO shifts facilitating invocation, and were standardized to support early European text interchange over 7-bit networks.[2] The Korean implementation, ISO-2022-KR, defined in RFC 1557, extends this framework to encode the KS C 5601 standard (a 94×94 double-byte set for Hangul and Hanja). It begins in ASCII (G0), designates KS C 5601 to G1 via the escape sequence ESC $ ) C at the start of a line, and uses SO to shift to G1 for Korean characters (each represented as two 7-bit bytes in the range 0x21–0x7E) before returning to ASCII with SI. This method ensures full 7-bit compliance without requiring additional MIME encoding, supporting Korean text in email headers and bodies. For right-to-left scripts like Arabic and Hebrew, 7-bit ISO 2022 variants designate sets such as ISO-IR 127 (Arabic) or ISO-IR 138 (Hebrew) to G1, with SI/SO for invocation. These implementations handle logical ordering for right-to-left scripts, relying on rendering systems for visual reordering in mixed LTR/RTL contexts. These 7-bit versions were primarily used in legacy email systems for non-Latin scripts, enabling transmission over 7-bit channels before the widespread adoption of UTF-8, though their usage has declined significantly with modern Unicode-based protocols.ISO/IEC 4873
ISO/IEC 4873 is an international standard published in 1991 that defines an 8-bit coded character set for information interchange, derived from and compatible with the 7-bit ISO/IEC 646 standard.[31] It incorporates a subset of the code extension techniques outlined in ISO/IEC 2022 to enable the invocation of multiple graphic character sets within an 8-bit environment.[32] This standard builds directly on the 8-bit provisions of ISO/IEC 2022 (Version 2), simplifying the framework by fixing certain elements to promote interoperability while allowing flexibility for additional sets.[2] The code structure of ISO/IEC 4873 organizes characters into distinct elements: G0 is fixed as the 94-character ISO 646 set (equivalent to US-ASCII in the GL area, positions 02/01 through 07/14); G1 supports a national 94-character graphic set in the GR area; G2 provides a 96-character supplementary set (positions 10/00 through 15/15); and G3 is designated for the ISO/IEC 8859-1 Latin-1 set, enabling access to Western European characters such as accented letters.[32] Control elements include C0 (up to 30 control functions in positions 00/00 to 01/15, excluding DEL) and an optional C1 set (08/00 to 09/15).[33] The standard defines three implementation levels: Level 1 for basic C0 and optional G1; Level 2 adding single-shift functions (SS2/SS3) for G2/G3 and requiring C1; and Level 3 incorporating locking-shift right functions (LS1R, LS2R, LS3R) for persistent invocation of G1, G2, or G3 into the GR area.[2] Character set designations follow ISO/IEC 2022 conventions using escape sequences, such as ESC ( B to invoke the ISO 646 set into G0, ESC - A for a national G1 set (e.g., ISO-IR 100 for Latin-1 GR), and similar sequences for G2 and G3 from the ISO 2375 registry.[32] Locking shifts, including LS0 (ESC 0, invoking G0 into GL) and LS1R (single control function invoking G1 into GR), allow persistent mapping without repeated sequences, enhancing efficiency in data streams.[33] The standard supports combining characters through the GRAPHIC CHARACTER COMBINATION (GCC) control function as defined in ISO/IEC 6429, permitting non-spacing diacritics to overlay base characters for accented forms in European scripts.[32] Intended primarily for European multilingual terminals and data processing systems, ISO/IEC 4873 facilitates the handling of texts in multiple Western European languages by shifting between G sets for characters like ñ, ü, and œ, while maintaining ASCII compatibility in the GL area.[2] Although the 1986 edition was withdrawn in favor of the 1991 version, the standard remains influential in legacy systems and as a foundation for encodings like ISO/IEC 8859 parts, which conform to its Level 1 structure.[34]Extended Unix Code
Extended Unix Code (EUC) is a variable-width character encoding system developed for Unix-like operating systems, adapting the ISO/IEC 2022 framework to support multiple scripts, particularly CJK (Chinese, Japanese, and Korean), by utilizing shift mechanisms in an 8-bit environment without embedding escape sequences in the data stream.[35][36] EUC employs a fixed designation of character sets at the implementation or locale level, allowing seamless integration of ASCII with extended sets while maintaining compatibility with ISO/IEC 2022 principles through predefined shifts.[37][38] In EUC's structure, the G0 set is designated as US-ASCII, using single-byte characters where the most significant bit (MSB) is 0, ensuring 7-bit compatibility for basic Latin text.[35] The G1 set (CS1) handles primary extended characters, typically 2-byte sequences with the MSB set to 1 on the first byte, supporting dense mappings like JIS X 0208 for Japanese.[38] G2 (CS2) and G3 (CS3) provide supplementary support: G2 uses a single-byte SS2 (0x8E) prefix for additional single-byte characters, while G3 employs a multi-byte SS3 (0x8F) prefix, often for 2-byte extensions such as JIS X 0212 in Japanese encodings.[35] These designations are compiled into the system's locale configuration, eliminating the need for runtime escape sequences and enabling stateless decoding within the defined sets.[37] EUC found widespread adoption in Unix environments like Solaris and HP-UX for handling CJK text, where variants such as EUC-JP predefined the shifts for Japanese standards (JIS X 0208 in G1, half-width katakana in G2, and JIS X 0212 in G3), simplifying file handling and terminal display without dynamic state changes.[35][38] Similarly, EUC-KR mapped KS C 5601 to G1 for Korean, and EUC-CN or EUC-TW supported Chinese standards like GB 2312 or CNS 11643, respectively, allowing Unix applications to process multilingual data in a single stream.[36][38] One key advantage of EUC is its transparency to ISO/IEC 2022-compliant parsers; by prepending appropriate escape sequences to designate the fixed sets, EUC data can be interpreted as a static ISO/IEC 2022 subset, facilitating interoperability with escape-based systems.[35] However, EUC deviates from pure ISO/IEC 2022 by relying on fixed, locale-specific designations rather than allowing arbitrary runtime shifts via escapes, which limits flexibility for dynamic multi-script switching within a single document.[37] This fixed nature, while efficient for Unix locales, precludes support for the full range of ISO/IEC 2022's invocation and shifting capabilities.[36] Shift functions from ISO/IEC 2022 are emulated through these static mappings, ensuring consistent byte lengths and column widths for display purposes.[35]Compound Text
Compound Text is a format developed by the X Consortium in 1989 as an interchange standard for multilingual text within the X Window System (X11). It provides a mechanism to encapsulate multiple streams of text, each adhering to ISO/IEC 2022 encoding rules, allowing graphical applications to handle diverse scripts and languages seamlessly. Defined as part of X Version 11, the format uses an 8-bit environment where the G0 set is invoked for the left half (GL) and the G1 set for the right half (GR) of the code space.[39] The structure of Compound Text consists of a header followed by one or more segments. The header includes control sequences, such as ESC 02/03 V 03/00 to indicate the version (1.1), and a list of supported character sets, including ISO 8859-1 through 9, JIS X 0201, GB 2312, and others, designated via final characters like 04/02 for US-ASCII. Each subsequent segment represents a portion of text in ISO 2022 format, enabling independent encoding; for instance, one segment might combine ASCII characters in GL with Greek characters from ISO 8859-7 in GR. Segments are separated by null bytes, and the entire format supports null-terminated lists for properties like multiple text elements.[39] In X11 applications, Compound Text facilitates internationalization by permitting the mixture of scripts within a single text stream without necessitating global shifts in character set invocation, which is particularly useful for selections, clipboard operations, and window titles under the Inter-Client Communication Conventions Manual (ICCCM). Designations within segments employ standard ISO 2022 escape sequences, such as ESC 02/08 {I} F for GL or ESC 02/09 {I} F for GR, to switch sets dynamically. Orientation flags, using CSI sequences like CSI 03/01 05/13 for left-to-right or CSI 03/02 05/13 for right-to-left, allow nesting of bidirectional text directions on a stack-based model. Segments conform to ISO 2022 code versions up to Version 3 for compatibility.[39][40] Despite its utility, Compound Text has become obsolete in favor of UTF-8 for X11 text interchange, as modern toolkits like GTK deprecate conversion functions to it since version 4.18, prioritizing stateless Unicode encodings for broader compatibility and simplicity.[41]Advanced Features and Interactions
Multi-Byte Character Support
ISO/IEC 2022 supports multi-byte characters through the use of multi-byte graphic character sets, which are defined as 94n or 96n sets where n > 1, allowing each graphic character to be represented by a sequence of n bit combinations drawn from the designated set.[42] These sets are structured such that the first bit combination serves as the lead byte and subsequent ones as trail bytes, ensuring that the entire sequence corresponds to a single character position within the set.[42] The mechanism relies on the four graphical character sets (G0 through G3), each capable of being a multi-byte set, which are designated via escape sequences where the intermediate bytes indicate the structure and size of the set (e.g., 2/4 for two-byte 94×94 sets, with additional intermediates for longer sequences), while the final byte identifies the specific registered set. Invocation occurs through locking-shift functions (LS0 to LS3) to activate a set in the GL or GR code element areas, or single-shift functions (SS2 or SS3) for temporary use of a single multi-byte character.[42] Rules prohibit interleaving of bytes from different characters or sets within a multi-byte sequence, requiring contiguous transmission of the lead and trail bytes; the receiver identifies the character by referencing the currently active set determined by the shift state.[42] In 7-bit environments, multi-byte support uses shift-in (SI) and shift-out (SO) functions to switch between G0 and G1 sets for the GL area, with each byte of the sequence transmitted as a 7-bit code, potentially spanning multiple characters if not properly shifted.[42] This support is specified starting from the second edition of the standard and is optional for conformance levels, particularly in implementations adhering to higher levels (e.g., Level 3 or 4) that permit redesignation of sets. For example, a double-byte character is encoded by first designating and invoking the appropriate two-byte set (e.g., via an escape sequence followed by a locking shift), then transmitting the lead byte followed immediately by the trail byte from the invoked set's code table.[42] Triple-byte and longer sequences follow the same pattern but are rare in practice due to the predominance of two-byte sets for languages like those using CJK scripts.[42] A key limitation arises in 7-bit transmissions, where individual bytes of a multi-byte sequence may resemble ASCII graphic characters (e.g., a trail byte in the 0x20-0x7E range), creating potential ambiguities if the receiver loses shift context; these are resolved through explicit synchronization via control functions or agreed-upon protocol context to maintain awareness of the active multi-byte set.[42] Graphical sets in ISO/IEC 2022, including those for multi-byte encodings, are thus integral to enabling compact representation of large character repertoires beyond single-byte limitations.[42]Registration of Code Sets
The registration of code sets for ISO/IEC 2022, encompassing both graphical character sets and control function sets, is managed by a designated Registration Authority under the oversight of ISO/IEC JTC 1/SC 2, the subcommittee responsible for coded character sets.[43] This authority operates according to the procedures outlined in ISO/IEC 2375, which specifies the mechanisms for identifying but not standardizing escape sequences and associated coded character sets compatible with ISO/IEC 2022.[44] ISO/IEC 2375 was originally published in 1985 and revised through 2003, with its legacy maintained in the technical report ISO/IEC TR 2375:2024, which archives all previously registered entries without provisions for new additions.[45] The registration process begins with a proposal submitted by a Sponsoring Authority, such as an ISO/IEC committee or national standards body, detailing the character repertoire, code structure, and mappings to ISO/IEC 10646 (the Universal Coded Character Set).[46] The proposal must include comprehensive documentation of the escape sequences, ensure conformance to ISO/IEC 2022's technical requirements (such as 7-bit or 8-bit compatibility), and demonstrate alignment with ISO/IEC 10646 for character identities and names.[44] Upon receipt, the Registration Authority reviews the submission in consultation with the Joint Advisory Committee (RA-JAC), circulates it for a three-month public comment period, and, if approved, assigns a unique final byte value in the escape sequence (typically from the range 0x20–0x7F, allocated sequentially to avoid conflicts; for example, 0x2D has been used for certain user-defined sets).[46] The registered set is then published in the ISO International Register (ISO-IR), relating the escape sequence to its coded representation.[13] Requirements emphasize interoperability and stability: all characters must map unambiguously to ISO/IEC 10646 positions, and escape sequences must follow ISO/IEC 2022's structure without overlapping registered or reserved codes.[44] Documentation must specify invocation methods, character properties, and any subsets, with revisions permitted only for upward-compatible updates or error corrections, subject to RA approval.[46] The process excludes private use sequences, which are reserved in ISO/IEC 2022 for vendor-specific implementations; for instance, the Private Use 1 designation via the escape sequence ESC n (0x1B 6E) allows unregistered extensions without entering the official registry.[45] Major updates to the registry occurred primarily in the 1990s, with the last significant registrations documented around that period as ISO shifted focus toward ISO/IEC 10646 and Unicode.[45] ISO/IEC TR 2375:2024 serves as a historical compendium, listing all prior entries (such as ISO-IR 1 through 209) but confirming no active registration mechanism for new code sets, reflecting the standard's legacy status in modern encoding ecosystems.[45]Character Set Designations
In ISO/IEC 2022, character set designations are performed using escape sequences that specify the coded character set to be invoked into one of the available slots, allowing dynamic switching between different sets within a data stream. These sequences begin with the ESC (escape) control character (0x1B) and are followed by one or more intermediate bytes and a final byte, defining the particular set according to registered identifiers.[3] For single-byte character sets, typically 94- or 96-character sets, the syntax uses a two-byte escape sequence of the form ESC I F, where I is an intermediate byte (such as 0x28 for "(" designating a 94-character set into G0) and F is the final byte ranging from 0x20 to 0x2F or 0x40 to 0x7F, corresponding to a registered coded character set. For example, the sequence ESC ( B (0x1B 0x28 0x42) designates the ISO/IR 6 (ASCII) set into the G0 slot. Similarly, ESC ) B (0x1B 0x29 0x42) designates ISO/IR 6 into G1. These final bytes F are assigned through the ISO International Register of Coded Character Sets to ensure uniqueness and interoperability.[3] Character sets are designated into specific slots: G0 through G3 for graphic character sets (with G0 and G1 being the primary slots for 7-bit and 8-bit environments, respectively, and G2/G3 for supplementary sets), and separate slots for control function sets C0 (fixed as the primary 32 control functions) and C1 (supplementary controls, designated via sequences like ESC % F). Designations for C1 use a similar syntax, such as ESC 0x25 F (where 0x25 is "%" as intermediate), to replace the default C1 set with a registered alternative. An example is ESC % G (0x1B 0x25 0x47), which designates UTF-8, though this is a non-strict extension outside pure ISO/IEC 2022 compliance.[3] For multi-byte character sets, the designation syntax extends to ESC I G I F or longer forms with additional intermediates, accommodating sets with more than 94 characters per row, such as double-byte encodings. These sequences indicate the structure (e.g., 2/4 for 94² two-byte) via the combination of intermediates and final byte. A common example is ESC ) $ B (0x1B 0x29 0x24 0x42) for designating JIS X 0208 into G1. Designations must not occur in the middle of a multi-byte character sequence to avoid ambiguity, and once made, they persist in the slot until overridden by a subsequent designation sequence.[3] Extensions for standards like JIS X 0213 introduce four-byte escape sequences to designate its extended planes. For instance, ESC ) ( P (0x1B 0x29 0x24 0x28 0x50) does so for plane 2. These follow the same persistence and no-mid-character rules, ensuring compatibility with base ISO/IEC 2022 while supporting the expanded repertoire.[47][48]| Slot | Designation Syntax Example | Designated Set | Sequence (Hex) |
|---|---|---|---|
| G0 | ESC ( B | ISO/IR 6 (ASCII) | 1B 28 42 |
| G1 | ESC ) $ B | JIS X 0208 | 1B 29 24 42 |
| G2 | ESC * $ ( O | JIS X 0213 Plane 1 | 1B 2A 24 28 4F |
| C1 | ESC % G | UTF-8 | 1B 25 47 |
Interaction with Other Systems
ISO/IEC 8859 encodings are 8-bit single-byte coded graphic character sets that conform to ISO/IEC 2022 at level 1 by fixing the designation of a 94-character set to the G0 position (bytes 0x00–0x7F) and another to the G1 position (bytes 0x80–0xFF), without dynamic switching.[49] This fixed structure makes ISO/IEC 8859 a subset of ISO/IEC 2022 capabilities, as the latter supports runtime designation of multiple character sets via escape sequences, providing greater flexibility for multilingual text.[1] The escape sequence ESC % G designates UTF-8 as a coding system within ISO/IEC 2022 streams, enabling a transition to Unicode-based encoding while maintaining 7-bit compatibility.[50] However, upon switching to UTF-8, ISO/IEC 2022's state-dependent rules, such as character set invocations and shifts, no longer apply, as UTF-8 operates as a self-contained, stateless transformation format of ISO 10646.[51] In legacy systems combining ISO/IEC 2022 with Unicode surrogates, CESU-8 serves as a fallback encoding scheme, representing supplementary characters as sequences of UTF-16 surrogates in UTF-8 form to preserve compatibility without full UTF-8 conformance.[52] For email and MIME transport, ISO/IEC 2022 variants like ISO-2022-JP (RFC 1468) and ISO-2022-CN (RFC 1922) enable 7-bit safe delivery of multilingual text without requiring 8-bit SMTP extensions, using escape sequences to invoke character sets within ASCII streams.[11] These encodings specify the charset parameter (e.g., charset=ISO-2022-JP) for MIME headers, allowing quoted-printable or 7-bit transfer encoding to avoid corruption.[53] Challenges arise with 8-bit channels or non-compliant gateways, where the high bit may be stripped or misinterpreted, leading to unreadable text unless fallback to 7-bit encodings is enforced.[11] Conversion between ISO/IEC 2022 and legacy encodings like EUC-JP or Shift-JIS involves tracking the current invocation state (e.g., active G0 or G1 set) to map characters correctly, as EUC-JP represents a stateless extension of ISO/IEC 2022's 94-character sets using the high bit for multi-byte sequences, while Shift-JIS employs a distinct byte-range mapping for double-byte characters.[54] This state tracking ensures accurate round-trip conversion, often via intermediate mappings to shared code points like those in JIS X 0208.[55] The stateful design of ISO/IEC 2022, which depends on escape sequences to maintain and switch character sets, poses challenges in stateless systems such as HTTP or concatenated data streams, where partial parsing without a reset sequence (e.g., ESC (B) can result in incorrect character interpretation or synchronization errors. Multi-byte support in these interactions requires preserving the full context to avoid desynchronization during processing.Code Structure Announcements
Code structure announcements in ISO/IEC 2022 provide an optional mechanism for declaring the coding facilities and initial state at the beginning of a data stream, enabling efficient processing without requiring decoders to parse the entire content for control sequences.[1] This feature, known as the Announcement of Code Structure Facilities (ACS), uses escape sequences to specify the version of the standard and the invoked character sets, ensuring compatibility across systems handling multi-national text.[4] The process begins with an identification sequence in the formESC % / followed by a version indicator, such as G to denote conformance to Version 3 of the standard, which supports advanced extension techniques including multiple graphic character sets and shift controls.[1] This is immediately followed by designation sequences that assign specific registered character sets to graphic registers (G0 through G3), for example, ESC ( B to designate the ISO/IR 6 (International Reference Version) set to the G0 register for 7-bit single-byte use.[4] Invocation states are then announced to indicate which sets are active in positions like GL (graphics left) or GR (graphics right), often using locking-shift sequences such as SI (Shift In) to invoke G0 or SO (Shift Out) for G1, ensuring the initial rendering context is predefined.[1] The final character set identifiers reference the ISO international register, specifying the exact repertoire (e.g., via a final byte like A for ISO 8859-1 Latin-1).[4]
These announcements conform to the syntax of Version 3, which mandates support for 8-bit and 7-bit codes with reversible transformations and allows for the declaration of up to four graphic sets per code element.[1] In practice, they are embedded in protocols such as email bodies under MIME charset parameters like ISO-2022-JP, where the sequences appear at the start of the message content to set up Japanese or multilingual text handling, or in file metadata for document interchange formats requiring explicit encoding declarations. This usage is optional depending on the conformance level, with Level 1 implementations potentially omitting them in favor of fixed defaults, while higher levels (e.g., Level 4) encourage them for robustness in international environments.[4]
The primary benefit of code structure announcements is to allow decoders to synchronize immediately upon encountering the initial sequences, avoiding the computational overhead of scanning for dynamic shifts and designations throughout the stream, which is particularly advantageous in resource-constrained or high-volume data processing scenarios.[1]
