Recent from talks
Contribute something
Nothing was collected or created yet.
C0 and C1 control codes
View on WikipediaThe C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.
C0 codes are the range 00HEX–1FHEX and the default C0 set was originally defined in ISO 646 (ASCII). C1 codes are the range 80HEX–9FHEX and the default C1 set was originally defined in ECMA-48 (harmonized later with ISO 6429). The ISO/IEC 2022 system of specifying control and graphic characters allows other C0 and C1 sets to be available for specialized applications, but they are rarely used.
C0 controls
[edit]ASCII defines 32 control characters, plus the DEL character. This large number of codes was desirable at the time, as multi-byte controls would require implementation of a state machine in the terminal, which was very difficult with contemporary electronics and mechanical terminals.
Only a few codes have maintained their use: BEL, ESC, and the format effector[1] (FEn) characters BS, TAB, LF, VT, FF, and CR. Others are unused or have acquired different meanings such as NUL being the C string terminator. Some data transfer protocols such as ANPA-1312, Kermit, and XMODEM do make extensive use of SOH, STX, ETX, EOT, ACK, NAK and SYN for purposes approximating their original definitions; and some file formats use the "Information Separators" (ISn) such as the Unix info format[2] and Python's splitlines string method.[3]
The names of some codes were changed in ISO 6429:1992 (or ECMA-48:1991) to be neutral with respect to writing direction. The abbreviations used were not changed, as the standard had already specified that those would remain unchanged when the standard is translated to other languages. In this table both new and old names are shown for the renamed controls (the old name is the one matching the abbreviation).
Unicode provides Control Pictures that can replace C0 control characters to make them visible on screen. However caret notation is used more often.
Decimal
|
Hexadecimal
|
Abbreviations | Name | Description | |||||
|---|---|---|---|---|---|---|---|---|---|
| ^@ | 0 | 00 | NUL | ␀ | Null | \0 | Does nothing. The code of blank paper tape, and also used for padding to slow transmission. | ||
| ^A | 1 | 01 | TC1, SOH | ␁ | Start of Heading | First character of the heading of a message.[5] | |||
| ^B | 2 | 02 | TC2, STX | ␂ | Start of Text | Terminates the header and starts the message text. | |||
| ^C | 3 | 03 | TC3, ETX | ␃ | End of Text | Ends the message text, starts a footer (up to the next TC character).[5][6] | |||
| ^D | 4 | 04 | TC4, EOT | ␄ | End of Transmission | Ends the transmission of one or more messages.[5][6] May place terminals on standby.[6] | |||
| ^E | 5 | 05 | TC5, ENQ, WRU[a] | ␅ | Enquiry | Trigger a response at the receiving end, to see if it is still present. | |||
| ^F | 6 | 06 | TC6, ACK | ␆ | Acknowledge | Indication of successful receipt of a message. | |||
| ^G | 7 | 07 | BEL[b] | ␇ | Bell, Alert | \a | Call for attention from an operator. | ||
| ^H | 8 | 08 | FE0, BS | ␈ | Backspace | \b | Move one position leftwards. Next character may overprint or replace the character that was there. | ||
| ^I | 9 | 09 | FE1, HT | ␉ | Character Tabulation, Horizontal Tabulation |
\t | Move right to the next tab stop. | ||
| ^J | 10 | 0A | FE2, LF | ␊ | Line Feed | \n | Move down to the same position on the next line (some devices also moved to the left column). | ||
| ^K | 11 | 0B | FE3, VT | ␋ | Line Tabulation, Vertical Tabulation |
\v | Move down to the next vertical tab stop. | ||
| ^L | 12 | 0C | FE4, FF | ␌ | Form Feed | \f | Move down to the top of the next page. | ||
| ^M | 13 | 0D | FE5, CR | ␍ | Carriage Return | \r | Move to column zero while staying on the same line. | ||
| ^N | 14 | 0E | SO, LS1[13][c] | ␎ | Shift Out | Switch to an alternative character set. | |||
| ^O | 15 | 0F | SI, LS0[13][c] | ␏ | Shift In | Return to regular character set after SO. | |||
| ^P | 16 | 10 | TC7, DC0,[d] DLE | ␐ | Data Link Escape | Cause a limited number of contiguously following characters to be interpreted in some different way.[15][16] | |||
| ^Q | 17 | 11 | DC1, XON | ␑ | Device Control One | Turn on (DC1 and DC2) or off (DC3 and DC4) devices.
Teletype[7] used these for the paper tape reader and the paper tape punch. The first use became the de facto standard for software flow control.[17] | |||
| ^R | 18 | 12 | DC2, TAPE | ␒ | Device Control Two | ||||
| ^S | 19 | 13 | DC3, XOFF | ␓ | Device Control Three | ||||
| ^T | 20 | 14 | DC4, |
␔ | Device Control Four | ||||
| ^U | 21 | 15 | TC8, NAK | ␕ | Negative Acknowledge | Negative response to a sender, such as a detected error. | |||
| ^V | 22 | 16 | TC9, SYN | ␖ | Synchronous Idle | Sent in synchronous transmission systems when no other character is being transmitted. | |||
| ^W | 23 | 17 | TC10, ETB | ␗ | End of Transmission Block | End of a transmission block of data when data are divided into such blocks for transmission purposes. | |||
| ^X | 24 | 18 | CAN | ␘ | Cancel | Indicates that the data preceding it are in error or are to be disregarded. | |||
| ^Y | 25 | 19 | EM | ␙ | End of medium | Indicates on paper or magnetic tapes that the end of the usable portion of the tape had been reached.[4] | |||
| ^Z | 26 | 1A | SUB | ␚ | Substitute | Replaces a character that was found to be invalid or in error. Should be ignored. | |||
| ^[ | 27 | 1B | ESC | ␛ | Escape | \e [e] |
Alters the meaning of a limited number of following bytes. Nowadays this is almost always used to introduce an ANSI escape sequence. | ||
| ^\ | 28 | 1C | IS4, FS | ␜ | File Separator | Can be used as delimiters to mark fields of data structures. US is the lowest level, while RS, GS, and FS are of increasing level to divide groups made up of items of the level beneath it. SP (space) could be considered an even lower level. | |||
| ^] | 29 | 1D | IS3, GS | ␝ | Group Separator | ||||
| ^^ | 30 | 1E | IS2, RS | ␞ | Record Separator | ||||
| ^_ | 31 | 1F | IS1, US | ␟ | Unit Separator | ||||
| While not technically part of the C0 control character range, the following two characters can be thought of as having some characteristics of control characters. | |||||||||
| 32 | 20 | SP | ␠ | Space | Move right one character position. | ||||
| ^? | 127 | 7F | DEL | ␡ | Delete | Should be ignored. Used to delete characters on punched tape by punching out all the holes. | |||
- ^ Teletype labelled the key WRU for 'who are you?'[7]
- ^ The name BELL is assigned by Unicode to the unrelated emoji character 🔔 (U+1F514). While C0 and C1 control characters were not formally named by the Unicode standard itself at the time, this collided with existing use of BELL as the name of this control character in software following the previous versions of UTS#18 (the Unicode Regular Expressions standard),[8] e.g. in Perl.[9] Unicode now accepts ALERT and BEL (but not BELL) as formal aliases for the control character,[10] although the code chart still lists BELL as the ISO 6429 alias,[11] and the corresponding control picture code point is called SYMBOL FOR BELL. Perl subsequently switched to using BELL for the emoji in version 5.18.[12]
- ^ a b ISO/IEC 2022 (ECMA-35) refers to these as LS0 and LS1 in 8-bit environments, and as SI and SO in 7-bit environments.[13]
- ^ The first, 1963 edition of ASCII classified DLE as a device control, rather than a transmission control, and gave it the abbreviation DC0 ("device control reserved for data link escape").[14]
- ^ The '\e' escape sequence is not part of ISO C and many other language specifications. However, it is understood by several compilers, including GCC.
C1 controls
[edit]In 1973, ECMA-35 and ISO 2022[18] attempted to define a method so an 8-bit "extended ASCII" code could be converted to a corresponding 7-bit code, and vice versa.[19] In a 7-bit environment, the Shift Out (SO) would change the meaning of the 96 bytes 0x20 through 0x7F[a][21] (i.e. all but the C0 control codes), to be the characters that an 8-bit environment would print if it used the same code with the high bit set. This meant that the range 0x80 through 0x9F could not be printed in a 7-bit environment,[19] thus it was decided that no alternative character set could use them, and that these codes should be additional control codes, which become known as the C1 control codes. To allow a 7-bit environment to use these new controls, the sequences ESC @ through ESC _ were to be considered equivalent.[19] The later ISO 8859 standards abandoned support for 7-bit codes, but preserved this range of control characters.
The first C1 control code set to be registered for use with ISO 2022 was DIN 31626,[22] a specialised set for bibliographic use which was registered in 1979.[23]
The more common general-use ISO/IEC 6429 set was registered in 1983,[24] although the ECMA-48 specification upon which it was based had been first published in 1976[25] and JIS X 0211 (formerly JIS C 6323).[26] Symbolic names defined by RFC 1345 and early drafts of ISO 10646, but not in ISO/IEC 6429 (PAD, HOP and SGC) are also used.[9][27]
Except for SS2 and SS3 in EUC-JP text, and NEL in text transcoded from EBCDIC, the 8-bit forms of these codes were almost never used. CSI, DCS and OSC are used to control text terminals and terminal emulators, but almost always by using their 7-bit escape code representations. Nowadays if these codes are encountered it is far more likely they are intended to be printing characters from that position of Windows-1252 or Mac OS Roman.
Except for NEL, Unicode does not provide a "control picture" for any of these. There is no well-known variation of Caret notation for them either.
ESC+
|
Decimal
|
Hex
|
Abbr | Name | Description[28] |
|---|---|---|---|---|---|
| @ | 128 | 80 | PAD[10] | Padding Character[b] | Proposed as a "padding" or "high byte" for single-byte characters to make them two bytes long for easier interoperability with multiple byte characters. Extended Unix Code (EUC) occasionally uses this.[32] |
| A | 129 | 81 | HOP[10] | High Octet Preset[b] | Proposed to set the high byte of a sequence of multiple byte characters so they only need one byte each, as a simple form of data compression. |
| B | 130 | 82 | BPH | Break Permitted Here[c] | Follows a graphic character where a line break is permitted. Roughly equivalent to a soft hyphen or zero-width space except it does not define what is printed at the line break. |
| C | 131 | 83 | NBH | No Break Here[c] | Follows the graphic character that is not to be broken. See also word joiner. |
| D | 132 | 84 | IND | Index[d] | Move down one line without moving horizontally, to eliminate ambiguity about the meaning of LF. |
| E | 133 | 85 | NEL | Next Line | Equivalent to CR+LF, to match the EBCDIC control character. |
| F | 134 | 86 | SSA | Start of Selected Area | Used by block-oriented terminals. In xterm ESC F moves to the lower-left corner of the screen, since certain software assumes this behaviour.[35]
|
| G | 135 | 87 | ESA | End of Selected Area | |
| H | 136 | 88 | HTS |
|
Set a tab stop at the current position. |
| I | 137 | 89 | HTJ |
|
Right-justify the text since the last tab against the next tab stop. |
| J | 138 | 8A | VTS |
|
Set a vertical tab stop. |
| K | 139 | 8B | PLD |
|
To produce subscripts and superscripts in ISO/IEC 6429. Subscripts use PLD text PLU while superscripts use PLU text PLD.
|
| L | 140 | 8C | PLU |
| |
| M | 141 | 8D | RI |
|
Move up one line. |
| N | 142 | 8E | SS2 | Single-Shift 2 | Next character is from the G2 or G3 sets, respectively. |
| O | 143 | 8F | SS3 | Single-Shift 3 | |
| P | 144 | 90 | DCS | Device Control String | Followed by a string of printable characters (0x20 through 0x7E) and format effectors (0x08 through 0x0D), terminated by ST (0x9C). Xterm defined a number of these.[36] |
| Q | 145 | 91 | PU1 | Private Use 1 | Reserved for private function agreed on between the sender and the recipient of the data. |
| R | 146 | 92 | PU2 | Private Use 2 | |
| S | 147 | 93 | STS | Set Transmit State | |
| T | 148 | 94 | CCH | Cancel character | Destructive backspace, to eliminate ambiguity about meaning of BS. |
| U | 149 | 95 | MW | Message Waiting | |
| V | 150 | 96 | SPA | Start of Protected Area | Used by block-oriented terminals. |
| W | 151 | 97 | EPA | End of Protected Area | |
| X | 152 | 98 | SOS | Start of String[c] | Followed by a control string terminated by ST (0x9C) which (unlike DCS, OSC, PM or APC) may contain any character except SOS or ST. |
| Y | 153 | 99 | SGC,[10] SGCI[37] | Single Graphic Character Introducer[b] | Intended to allow an arbitrary Unicode character to be printed; it would be followed by that character, most likely encoded in UTF-1.[37] |
| Z | 154 | 9A | SCI | Single Character Introducer[c] | To be followed by a single printable character (0x20 through 0x7E) or format effector (0x08 through 0x0D), and to print it as ASCII no matter what graphic or control sets were in use. |
| [ | 155 | 9B | CSI | Control Sequence Introducer | Used to introduce control sequences that take parameters. Used for ANSI escape sequences. |
| \ | 156 | 9C | ST | String Terminator | Terminates a string started by DCS, SOS, OSC, PM or APC. |
| ] | 157 | 9D | OSC | Operating System Command | Followed by a string of printable characters (0x20 through 0x7E) and format effectors (0x08 through 0x0D), terminated by ST (0x9C), intended for use to allow in-band signaling of protocol information, but rarely used for that purpose.
Some terminal emulators, including xterm, use OSC sequences for setting the window title and changing the colour palette. They may also support terminating an OSC sequence with BEL instead of ST.[38] Kermit used APC to transmit commands.[39] |
| ^ | 158 | 9E | PM | Privacy Message | |
| _ | 159 | 9F | APC | Application Program Command |
- ^ In early versions the range excluded SP and DEL[20]
- ^ a b c Not part of ISO/IEC 6429 (ECMA-48)[9][27][29]: 4 [30]: 5 [31]: 8
- ^ a b c d Not part of the first edition of ISO/IEC 6429.[24][29]: 4
- ^ Deprecated in 1988 and withdrawn in 1992 from ISO/IEC 6429[31]: 87 (1986[33] and 1991[34] respectively for ECMA-48).
Other control code sets
[edit]The ISO/IEC 2022 (ECMA-35) extension mechanism allowed escape sequences to change the C0 and C1 sets. The standard C0 control character set shown above is chosen with the sequence ESC ! @ and the above C1 set chosen with the sequence ESC " C.[24]
Several official and unofficial alternatives have been defined, but this is pretty much obsolete. Most were forced to retain a good deal of compatibility with the ASCII controls for interoperability. The standard makes ESC,[40][41] SP and DEL[a] "fixed" coded characters, which are available in their ASCII locations in all encodings that conform to the standard.[43] It also specifies that if a C0 set included transmission control (TCn) codes, they must be encoded at their ASCII locations[40] and could not be put in a C1 set,[44] and any new transmission controls must be in a C1 set.[40]
Alternative C0 character sets
[edit]- ANPA-1312, a text markup language used for news transmission, replaces several C0 control characters.
- IPTC 7901, the newer international version of the above, has its own variations.
- Videotex has a completely different set.
- Teletext also defines a set similar to Videotex.
- T.61/T.51,[45] and others[46] replaced EM and GS with SS2 and SS3 so these functions could be used in a 7-bit environment without resorting to escape sequences.
- Some sets replaced FS with SS2,[47] (same as ANPA-1312).
- The now-withdrawn JIS C 6225, designated JIS X 0207 in later sources.[48] replaced FS with CEX or "Control Extension"[49] which introduces control sequences for vertical text behaviour, superscripts and subscripts[50] and for transmitting custom character graphics.[48]
Alternative C1 character sets
[edit]- A specialized C1 control code set is registered for bibliographic use (including string collation), such as by MARC-8.[23][51][52]
- Various specialised C1 control code sets are registered for use by Videotex formats.[22]
- The Stratus VOS operating system uses a C1 set called the NLS control set.[53] It includes SS1 (Single-Shift 1) through SS15 (Single-Shift 15) controls,[54] used to invoke individual characters from pre-defined supplementary character sets,[55] in a similar manner to the single-shift mechanism of ISO/IEC 2022. The only single-shift controls defined by ISO/IEC 2022 are SS2 and SS3; these are retained in the VOS set at their original code points and function the same way.
- EBCDIC defines up to 29 additional control codes besides those present in ASCII. When translating EBCDIC to Unicode (or to ISO 8859), these codes are mapped to C1 control characters in a manner specified by IBM's Character Data Representation Architecture (CDRA).[56][57] Although the New Line (NL) does translate to the ISO/IEC 6429 NEL (although it is often swapped with LF, following UNIX line ending convention),[56] the remainder of the control codes do not correspond. For example, the EBCDIC control SPS and the ECMA-48 control PLU are both used to begin a superscript or end a subscript, but are not mapped to one another. Extended-ASCII-mapped EBCDIC can therefore be regarded as having its own C1 set, although it is not registered with the ISO-IR registry for ISO/IEC 2022.[22]
Unicode
[edit]Unicode reserves the 65 code points described above for compatibility with the C0 and C1 control codes, giving them the general category Cc (control). These are:
- U+0000–U+001F (C0 controls) and U+007F (DEL) assigned to the C0 Controls and Basic Latin block, and
- U+0080–U+009F (C1 controls) assigned to the C1 Controls and Latin-1 Supplement block.
Unicode only specifies semantics for the C0 format controls HT, LF, VT, FF, and CR (note BS is missing); the C0 information separators FS, GS, RS, US (and SP); and the C1 control NEL.[58] The rest of the codes are transparent to Unicode and their meanings are left to higher-level protocols, with ISO/IEC 6429 suggested as a default.[58]
Unicode includes many additional format effector characters besides these, such as marks, embeds, isolates and pops for explicit bidirectional formatting, and the zero-width joiner and non-joiner for controlling ligature use. However these are given the general category Cf (format) rather than Cc.
See also
[edit]- Control Pictures - Unicode graphical representation characters for the C0 control codes
- ANSI escape code
Footnotes
[edit]- ^ ISO/IEC 4873 extends this requirement to the C1 SS2 and SS3,[42] although ISO/IEC 2022 itself does not.
References
[edit]- ^ Standard ECMA-6 7-bit Coded Character Set (PDF) (Technical report). 1965. p. 4.
- ^ Fox, Brian. "Adding a new node to Info". Info: The online, menu-driven GNU documentation system. GNU Project.
- ^ "Built-in Types § str.splitlines". The Python Standard Library. Python Software Foundation.
- ^ a b ISO/TC 97/SC 2 (1975). The set of control characters of the ISO 646 (PDF). ITSCJ/IPSJ. ISO-IR-1.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ a b c IPTC (1995). The IPTC Recommended Message Format (PDF) (5th ed.). IPTC TEC 7901.
- ^ a b c "end-of-transmission character (EOT)". Federal Standard 1037C. 1996. Archived from the original on 2016-03-09.
- ^ a b Robert McConnell; James Haynes; Richard Warren (December 2002). "Understanding ASCII Codes". NADCOMM.
- ^ Williamson, Karl. "Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0".
- ^ a b c Ken Whistler (July 20, 2011). "Formal Name Aliases for Control Characters, L2/11-281". Unicode Consortium.
- ^ a b c d "Name Aliases". Unicode Character Database. Unicode Consortium.
- ^ "C0 Controls and Basic Latin" (PDF). Unicode Consortium.
- ^ "charnames". Perl Programming Documentation.
- ^ a b c ECMA (1994). "7.3: Invocation of character-set code elements". Character Code Structure and Extension Techniques (PDF) (ECMA Standard) (6th ed.). p. 14. ECMA-35.
- ^ American Standards Association (1963). American Standard Code for Information Interchange: 4. Legend. p. 6. ASA X3.4-1963.
- ^ "data link escape character (DLE)". Federal Standard 1037C. 1996. Archived from the original on 2016-08-01.
- ^ "Supplementary transmission control functions (an extension of the basic mode control procedures for data communication systems)". European Computer Manufacturers Association. 1972. ECMA-37.
- ^ "What is the point of Ctrl-S?". Unix and Linux Stack exchange. Retrieved 14 February 2019.
- ^ ECMA/TC 1 (1973). "Brief History". 7-bit Input/Output Coded Character Set (PDF) (4th ed.). ECMA. ECMA-6:1973.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ a b c ECMA/TC 1 (1971). "8.2: Correspondence between the 7-bit Code and an 8-bit Code". Extension of the 7-bit Coded Character Set (PDF) (1st ed.). ECMA. pp. 21–24. ECMA-35:1971.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ ECMA/TC 1 (1973). "4.2: Specific Control Characters". 7-bit Input/Output Coded Character Set (PDF) (4th ed.). ECMA. p. 16. ECMA-6:1973.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ ECMA/TC 1 (1985). "5.3.8: Sets of 96 graphic characters". Code Extension Techniques (PDF) (4th ed.). ECMA. pp. 17–18. ECMA-35:1985.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ a b c ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences (PDF), ITSCJ/IPSJ, ISO-IR, archived from the original (PDF) on 2023-05-12, retrieved 2023-05-13
- ^ a b DIN (1979-07-15). Additional Control Codes for Bibliographic Use according to German Standard DIN 31626 (PDF). ITSCJ/IPSJ. ISO-IR-40.
- ^ a b c ISO/TC97/SC2 (1983-10-01). C1 Control Set of ISO 6429:1983 (PDF). ITSCJ/IPSJ. ISO-IR-77.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ ECMA/TC 1 (1979). "Brief History". Additional Control Functions for Character-Imaging I/O Devices (PDF) (2nd ed.). ECMA. ECMA-48:1979.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ "JIS X 02xx 符号" (in Japanese).
- ^ a b Ken Whistler (2015-10-05). "Why Nothing Ever Goes Away". Unicode Mailing List.
- ^ ECMA/TC 1 (June 1991). Control Functions for Coded Character Sets (PDF) (5th ed.). ECMA. ECMA-48:1991.
{{cite book}}: CS1 maint: numeric names: authors list (link) - ^ a b ISO 6429:1983 Information processing — ISO 7-bit and 8-bit coded character sets — Additional control functions for character-imaging devices. ISO. 1983-05-01.
- ^ ISO 6429:1988 Information processing — Control functions for 7-bit and 8-bit coded character sets. ISO. 1988-11-15.
- ^ a b ISO/IEC 6429:1992 Information technology — Control functions for coded character sets. ISO. 1992-12-15. Retrieved 2024-05-29.
- ^ Lunde, Ken (2008). CJKV Information Processing: Chinese, Japanese, Korean, and Vietnamese Computing. O'Reilly. p. 244. ISBN 9780596800925.
- ^ ECMA/TC 1 (December 1986). "Appendix E: Changes Made in this Edition". Control Functions for Coded Character Sets (PDF) (4th ed.). ECMA. ECMA-48:1986.
{{cite book}}: CS1 maint: numeric names: authors list (link) - ^ ECMA/TC 1 (June 1991). "F.8 Eliminated control functions". Control Functions for Coded Character Sets (PDF) (5th ed.). ECMA. ECMA-48:1991.
{{cite book}}: CS1 maint: numeric names: authors list (link) - ^ "VT100 Widget Resources (§ hpLowerleftBugCompat)". xterm - terminal emulator for X.
- ^ Moy, Edward; Gildea, Stephen; Dickey, Thomas. "Device-Control functions". XTerm Control Sequences.
- ^ a b Brender, Ronald F. (1989). "Ada 9x Project Report: Character Set Issues for Ada 9x". Carnegie Mellon University.
- ^ Moy, Edward; Gildea, Stephen; Dickey, Thomas. "Operating System Commands". XTerm Control Sequences.
- ^ Frank da Cruz; Christine Gianone (1997). Using C-Kermit. Digital Press. p. 278. ISBN 978-1-55558-164-0.
- ^ a b c ECMA (1994). "6.4.2: Primary sets of coded control functions". Character Code Structure and Extension Techniques (PDF) (ECMA Standard) (6th ed.). p. 11. ECMA-35.
- ^ ISO/TC97/SC2/WG-7; ECMA (1985-08-01). Minimum C0 set for ISO 4873 (PDF). ITSCJ/IPSJ. ISO-IR-104.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ ISO/TC97/SC2/WG-7; ECMA (1985-08-01). Minimum C1 Set for ISO 4873 (PDF). ITSCJ/IPSJ. ISO-IR-105.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ ECMA (1994). "6.2: Fixed coded characters". Character Code Structure and Extension Techniques (PDF) (ECMA Standard) (6th ed.). p. 7. ECMA-35.
- ^ ECMA (1994). "6.4.3: Supplementary sets of coded control functions". Character Code Structure and Extension Techniques (PDF) (ECMA Standard) (6th ed.). p. 11. ECMA-35.
- ^ ITU (1985). Teletex Primary Set of Control Functions (PDF). ITSCJ/IPSJ. ISO-IR-106.
- ^ Úřad pro normalizaci a měřeni (1987). The set of control characters of ISO 646, with EM replaced by SS2 (PDF). ITSCJ/IPSJ. ISO-IR-140.
- ^ ISO/TC 97/SC 2 (1977). The set of control characters of ISO 646, with IS4 replaced by Single Shift for G2 (SS2) (PDF). ITSCJ/IPSJ. ISO-IR-36.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ a b ISO/TC97/SC2/WG6. "Liaison statement to ISO/TC97/SC2/WG8 and ISO/TC97/SC18/WG8" (PDF). ISO/TC97/SC2/WG6 N317.rev. Archived from the original (PDF) on 2020-10-26.
{{cite web}}: CS1 maint: numeric names: authors list (link) - ^ ISO/TC 97/SC 2 (1982). The C0 set of Control Characters of Japanese Standard JIS C 6225-1979 (PDF). ITSCJ/IPSJ. ISO-IR-74.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ Printronix (2012). OKI® Programmer's Reference Manual (PDF). p. 26.
- ^ ISO/TC 46 (1983-06-01). Additional Control Codes for Bibliographic Use according to International Standard ISO 6630 (PDF). ITSCJ/IPSJ. ISO-IR-67.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ ISO/TC 46 (1986-02-01). Additional Control Codes for Bibliographic Use according to International Standard ISO 6630 (PDF). ITSCJ/IPSJ. ISO-IR-124.
{{citation}}: CS1 maint: numeric names: authors list (link) - ^ Stratus Technologies Ireland, Ltd. "Overview of NLS Strings". National Language Support User's Guide (R212).
- ^ Stratus Technologies Ireland, Ltd. "The OpenVOS Internal Character Code Set". OpenVOS System Administration: Administering and Customizing a System (R281).
- ^ Stratus Technologies Ireland, Ltd. "The Supplementary Graphic Character Sets". National Language Support User's Guide (R212).
- ^ a b Umamaheswaran, V.S. (1999-11-08). "3.3 Step 2: Byte Conversion". UTF-EBCDIC. Unicode Consortium. Unicode Technical Report #16.
The 64 control characters […], the ASCII DELETE character (U+007F)[…] are mapped respecting EBCDIC conventions, as defined in IBM Character Data Representation Architecture, CDRA, with one exception -- the pairing of EBCDIC Line Feed and New Line control characters are swapped from their CDRA default pairings to ISO/IEC 6429 Line Feed (U+000A) and Next Line (U+0085) control characters
- ^ Steele, Shawn (1996-04-24). cp037_IBMUSCanada to Unicode table. Microsoft/Unicode Consortium.
- ^ a b "23.1: Control Codes" (PDF). The Unicode Standard (15.0.0 ed.). Unicode Consortium. 2022. pp. 914–916. ISBN 978-1-936213-32-0.
External links
[edit]- The Unicode Standard
- C0 Controls and Basic Latin
- C1 Controls and Latin-1 Supplement
- Control Pictures
- The Unicode Standard, Version 6.1.0, Chapter 16: Special Areas and Format Characters
- ATIS Telecom Glossary 2007
- De litteris regentibus C1 quaestiones septem or Are C1 characters legal in XHTML 1.0?
- W3C I18N FAQ: HTML, XHTML, XML and Control Codes
- International register of coded character sets to be used with escape sequences Archived 2023-05-12 at the Wayback Machine
C0 and C1 control codes
View on GrokipediaSelected C0 Control Codes
| Code (Decimal) | Name | Function | Citation |
|---|---|---|---|
| 0 | NUL | Null character; padding or terminator | [2] |
| 7 | BEL | Audible bell or visual alert | [2] |
| 9 | HT | Horizontal tabulation | [3] |
| 10 | LF | Line feed | [3] |
| 13 | CR | Carriage return | [3] |
| 27 | ESC | Escape; introduces control sequences | [1] |
Selected C1 Control Codes
| Code (Decimal) | Name | Function | Citation |
|---|---|---|---|
| 128 | PAD | Padding character | [1] |
| 133 | NEL | Next line (CR + LF equivalent) | [3] |
| 155 | CSI | Control Sequence Introducer; starts parameterized sequences (e.g., for cursor movement) | [1] |
| 156 | ST | String Terminator | [1] |
Overview and History
Definition and Purpose
C0 and C1 control codes refer to specific ranges of non-printable characters within 7-bit and 8-bit coded character sets, designed for signaling and control rather than representing visible graphic symbols. The C0 set occupies the first 32 positions (codes 0x00 to 0x1F) and code 0x7F (DELETE) in both 7-bit and 8-bit encodings, while the C1 set comprises the next 32 positions (codes 0x80 to 0x9F) available only in 8-bit extensions.[5][6] These control codes serve to manage the operation of peripherals and text processing systems, such as printers, terminals, and displays, by issuing commands for actions like formatting text layout (for instance, initiating line breaks) and regulating data transmission. Their primary purpose is to enable efficient interchange and processing of information without altering the visual content, with semantics typically defined by higher-level protocols for device-specific behaviors.[5][6] Unlike graphic characters, which form the printable repertoire of a character set, C0 and C1 codes do not produce visible output but instead trigger functional responses in receiving systems; they can also form the basis for escape sequences that invoke additional control or character set changes. In standards like ISO 646, which defines a 7-bit code structure with 128 total positions, the C0 set is mandatory as the 32 control positions, leaving 96 for graphic characters, while C1 extends this framework in 8-bit codes to support more advanced control functions.[5][7][6]Historical Development
The origins of C0 and C1 control codes trace back to 19th-century telegraphy, where early binary encoding systems laid the groundwork for non-printing characters used in device control and formatting. Émile Baudot's 5-bit telegraph code, patented in 1874, introduced uniform-length binary sequences for letters, numbers, and symbols, marking the first widely adopted digital communication protocol that influenced subsequent character encodings.[8] By the early 20th century, these concepts evolved into teletype systems, such as those based on the International Telegraph Alphabet No. 2 (ITA2), a refined Baudot variant standardized in 1930 for mechanical teleprinters, which incorporated basic control functions for shifting between character sets and managing transmission.[8] The standardization of the C0 control set began in the mid-20th century with the development of ASCII in 1963, formally known as ANSI X3.4-1963, which defined 32 control characters in positions 00–1F hexadecimal for functions like carriage return and line feed, alongside the delete (DEL) character at position 7F to aid tape erasure.[9] This 7-bit code was quickly adopted internationally through the International Reference Version (IRV) of ISO 646, published in 1973, which harmonized national variants while preserving the core C0 controls to ensure compatibility across telegraph and computing systems. In Europe, ECMA-6, adopted in 1965, mirrored ASCII's structure by specifying a 7-bit coded character set with up to 32 C0 controls, facilitating input/output operations in early computers and peripherals.[10] The C1 control set emerged in the 1970s to extend capabilities for 8-bit environments, with ECMA-48's first edition in March 1976 introducing codes in the 80–9F hexadecimal range, including the Control Sequence Introducer (CSI) for parameterized device commands.[11] This was harmonized internationally via ISO 6429, first published in 1983, which defined C1 functions for advanced formatting and became the basis for ISO/IEC 2022 in 1986, enabling dynamic switching between C0/C1 sets and graphic character sets.[12][13] Key milestones included minor refinements in ECMA-48's subsequent editions, culminating in the 5th edition of June 1991, which added controls for coded character imaging while maintaining backward compatibility.[1] International harmonization efforts addressed national variants in ISO 646, where differing graphic symbols were allowed but C0 controls remained invariant to promote interoperability, as seen in standards like the UK's BS 4731 and France's AFNOR variants from the 1970s. By the 1990s, with Unicode's adoption in 1991 incorporating C0 and C1 as fixed ranges (U+0000–U+001F and U+0080–U+009F), the sets achieved global stability under a policy prohibiting removals or reassignments, ensuring no major changes through 2025 despite ongoing digital evolution.[4]Standard Control Codes
C0 Control Codes
The C0 control codes constitute the invariant set of 33 non-printing control characters standardized across all 7-bit character encoding systems, assigned to bit combinations 00/00 through 01/15 (hexadecimal 00 to 1F) and 7/15 (hexadecimal 7F for DEL).[14] These codes are designed primarily for managing data interchange, transmission control, text formatting, and device operations in early computing and telecommunications environments.[1] Unlike graphic characters, C0 codes do not represent visible symbols but instead trigger specific actions in receiving devices, such as terminals or printers.[15] The following table enumerates the standard C0 control codes, including their names, hexadecimal values, and primary functions as specified in the relevant international standards.[14][1]| Hex | Name | Primary Function |
|---|---|---|
| 00 | NULL (NUL) | Acts as a filler or padding character with no effect on data content, often used for media-fill or time-fill. |
| 01 | START OF HEADING (SOH) | Marks the beginning of a message heading or control block in data streams. |
| 02 | START OF TEXT (STX) | Indicates the start of the textual content, terminating any preceding heading. |
| 03 | END OF TEXT (ETX) | Signals the conclusion of a block of text. |
| 04 | END OF TRANSMISSION (EOT) | Denotes the end of a complete transmission, potentially including multiple texts. |
| 05 | ENQUIRY (ENQ) | Requests a response from the receiving station, such as status information. |
| 06 | ACKNOWLEDGE (ACK) | Provides affirmative confirmation of successful data receipt. |
| 07 | BELL (BEL) | Triggers an audible or visible alert to notify the operator. |
| 08 | BACKSPACE (BS) | Moves the active position one character backward, sometimes interpreted as non-destructive in display systems. |
| 09 | HORIZONTAL TABULATION (HT) | Advances the position to the next predefined tab stop on the current line. |
| 0A | LINE FEED (LF) | Advances the active position to the next line, maintaining the horizontal position. |
| 0B | VERTICAL TABULATION (VT) | Moves to the next predefined vertical tab stop. |
| 0C | FORM FEED (FF) | Advances to the starting position on the next page or form. |
| 0D | CARRIAGE RETURN (CR) | Returns the active position to the beginning of the current line. |
| 0E | SHIFT OUT (SO) | Invokes an alternate graphic character set, as per code extension techniques. |
| 0F | SHIFT IN (SI) | Reverts to the standard (primary) character set. |
| 10 | DATA LINK ESCAPE (DLE) | Modifies the interpretation of subsequent characters for transmission control. |
| 11 | DEVICE CONTROL ONE (DC1) | Activates or initializes a device, often used as X-ON for flow control. |
| 12 | DEVICE CONTROL TWO (DC2) | Triggers device-specific operations or modes. |
| 13 | DEVICE CONTROL THREE (DC3) | Deactivates or halts a device, often used as X-OFF for flow control. |
| 14 | DEVICE CONTROL FOUR (DC4) | Interrupts or stops device operation. |
| 15 | NEGATIVE ACKNOWLEDGE (NAK) | Indicates refusal or error in response to a transmission. |
| 16 | SYNCHRONOUS IDLE (SYN) | Maintains timing synchronization during idle periods in synchronous transmission. |
| 17 | END OF TRANSMISSION BLOCK (ETB) | Marks the end of a logical block within a larger transmission. |
| 18 | CANCEL (CAN) | Aborts the current procedure and ignores preceding data as erroneous. |
| 19 | END OF MEDIUM (EM) | Identifies the physical end of a recording medium or data segment. |
| 1A | SUBSTITUTE (SUB) | Replaces invalid or erroneous characters to prevent processing errors. |
| 1B | ESCAPE (ESC) | Serves as a prefix to introduce extended control sequences for additional functions. |
| 1C | FILE SEPARATOR (FS) | Logically separates higher-level data structures, such as files. |
| 1D | GROUP SEPARATOR (GS) | Delimits subgroups within a larger data hierarchy. |
| 1E | RECORD SEPARATOR (RS) | Separates individual records in a structured dataset. |
| 1F | UNIT SEPARATOR (US) | Divides the smallest units, such as fields, within a record. |
| 7F | DELETE (DEL) | Obliterates or erases data, originally used to punch all holes in tape media for security. |
C1 Control Codes
The C1 control codes constitute the secondary control set defined in ISO/IEC 6429 (harmonized from ECMA-48), occupying hexadecimal positions 80–9F in 8-bit coded character sets.[1] This set extends the basic transmission and formatting capabilities of the C0 controls by introducing functions for structured text manipulation, device coordination, and intermediate sequence initiation, which are essential for advanced applications like video displays and printers.[1] Unlike the C0 set, C1 codes generally require an 8-bit environment for direct transmission; in 7-bit systems, they are emulated via escape sequences consisting of the ESC (1B hex) character followed by a final byte in the range 40–5F hex.[1] Key functions in the C1 set support enhanced formatting, such as Next Line (NEL, 85 hex), which repositions the active cursor to the first character position on the subsequent line, functioning equivalently to a combined carriage return and line feed in many systems.[16] The Control Sequence Introducer (CSI, 9B hex) enables parameterized commands for precise control, such as cursor movement or attribute setting in terminal emulators, while the String Terminator (ST, 9C hex) delimits the end of such sequences to prevent ambiguity in data streams.[16] Additional codes facilitate device-specific operations, including Start of Protected Area (SPA, 96 hex) and End of Protected Area (EPA, 97 hex), which designate text regions immune to erasure or overwrite in interactive displays.[16] Codes like Device Control String (DCS, 90 hex) allow for vendor-defined commands, supporting customization in hardware like impact printers.[1] The C1 set reserves several positions for private or national use, such as Private Use One (PU1, 91 hex) and Private Use Two (PU2, 92 hex), permitting implementers to assign non-standard functions through bilateral agreement without conflicting with the core standard.[16] Overall, these codes emphasize sequential and contextual control, distinguishing them from the immediate-action focus of C0, and their adoption underscores the evolution toward interoperable, feature-rich text processing in international standards.[1] The standard 32 C1 control codes, as specified in ISO/IEC 6429 and ECMA-48, are detailed in the following table, including hexadecimal values, official names, and concise functional descriptions:[1][16]| Hex | Name | Description |
|---|---|---|
| 80 | PAD | Provides padding for time-fill or media synchronization in transmission. |
| 81 | HOP | Presets high-order bits for subsequent code extension techniques. |
| 82 | BPH | Signals a permissible point for line breaking during text formatting. |
| 83 | NBH | Prohibits line breaking at the current position in formatted text. |
| 85 | NEL | Moves the active position to the initial position on the following line. |
| 86 | SSA | Designates the start of a selectable or transmittable text area. |
| 87 | ESA | Designates the end of a selectable or transmittable text area. |
| 88 | HTS | Establishes a horizontal tab stop at the current active position. |
| 89 | HTJ | Advances to the next tab stop and performs character justification. |
| 8A | VTS | Establishes a vertical tab stop at the current active line. |
| 8B | PLD | Shifts the active position forward by a partial line increment for imaging. |
| 8C | PLU | Shifts the active position backward by a partial line increment for imaging. |
| 8D | RI | Moves the active position to the initial position on the preceding line. |
| 8E | SS2 | Temporarily invokes the G2 character set for the immediately following graphic character. |
| 8F | SS3 | Temporarily invokes the G3 character set for the immediately following graphic character. |
| 90 | DCS | Introduces a device-specific control string, terminated by ST. |
| 91 | PU1 | Reserved for private, user-defined control functions. |
| 92 | PU2 | Reserved for private, user-defined control functions. |
| 93 | STS | Configures the transmit state for data flow from the device. |
| 94 | CCH | Invalidates the effect of the preceding character in the stream. |
| 95 | MW | Activates a message-waiting indicator on the receiving device. |
| 96 | SPA | Designates the start of a protected or guarded text area. |
| 97 | EPA | Designates the end of a protected or guarded text area. |
| 98 | SOS | Introduces a delimited string for special processing, terminated by ST. |
| 99 | SGCI | Introduces a single graphic character for intermediate processing. |
| 9A | SCI | Introduces a single control function or character. |
| 9B | CSI | Introduces a control sequence, optionally with parameters and intermediates. |
| 9C | ST | Terminates control strings initiated by DCS, OSC, PM, APC, or SOS. |
| 9D | OSC | Introduces an operating system command string, terminated by ST. |
| 9E | PM | Introduces a privacy or user message string, terminated by ST. |
| 9F | APC | Introduces an application program command string, terminated by ST. |
Alternative Control Code Sets
Alternative C0 Sets
In certain specialized systems, alternative C0 control sets deviate from the ISO 646 baseline to accommodate domain-specific requirements, such as enhanced formatting for transmission protocols or display capabilities. For instance, the ANPA-1312 standard, developed for news wire services by the American Newspaper Publishers Association, extensively employs C0 characters like SOH (start of heading), STX (start of text), ETX (end of text), and EOT (end of transmission) for markup and segmentation in text transmission, effectively repurposing them beyond general-purpose control while maintaining positional invariance in the 7-bit code. This approach arose from legacy hardware constraints in 1970s teletype systems, prioritizing reliable news dissemination over universal compatibility.[17] Videotex systems, including the British Prestel service, introduce a variant C0 set tailored for interactive display and mosaic graphics, as defined in the North American Presentation Level Protocol Syntax (PLPS). Here, standard format effectors like BS (backspace) and LF (line feed) are supplemented or replaced by adjacent positioning controls such as APB (0/8, adjacent back), APF (0/9, adjacent forward), APD (0/10, adjacent down), and APU (0/11, adjacent up), which enable precise cursor movement for rendering alphanumeric and block mosaic characters without disrupting screen layout. Additional codes like CS (0/12, clear screen), APR (0/13, adjacent return), and NSR (1/15, new screen request) support page-based navigation and reset functions essential for low-bandwidth terminal interactions. These adaptations stem from the need to optimize 7-bit or 8-bit channels for consumer-grade modems and televisions, with mosaic controls invoking G3 sets via SS3 (1/13) for semigraphic elements like diagonals and lines.[18] In EBCDIC, IBM's 8-bit encoding for mainframe systems, C0-equivalent controls occupy positions 0x00 to 0x3F but with significant shifts from ASCII/ISO alignments, reflecting punch-card heritage and hardware-specific signaling. For example, NUL remains at X'00', SOH at X'01', and CR at X'0D', but LF is relocated to X'25' (outside traditional C0), while utilities like RES (restore, X'14') and NL (new line, X'15') serve combined CR+LF functions. This mapping supports legacy peripherals like tape drives and printers, where bit patterns prioritize BCD compatibility over international standardization.[19] The CCITT T.61 recommendation for Teletex and telematic services adheres closely to ISO 646 C0 without positional deviations, defining standard functions like HT (horizontal tabulation), VT (vertical tabulation), and FF (form feed) for document interchange, though invocation via ESC sequences allows context-specific extensions. Such alternatives often emerged from industry needs, including banking protocols requiring custom delimiters or hardware limitations in 1980s systems, but convergence toward ISO standards post-1980s limited their proliferation. ISO 2022 facilitates transitions by permitting C0 designation via ESC F, yet mandates the primary C0 set remain invariant in most invocations to ensure interoperability, with G0/G1 shifts handling graphic variants instead.Alternative C1 Sets
Alternative C1 sets emerged in proprietary, sector-specific, and international standards to address specialized requirements beyond the baseline ISO/IEC 6429 C1 controls, often redefining the 0x80–0x9F range for enhanced formatting, graphics, or data processing functions. In Videotex systems, particularly those standardized by ETSI for services like the French Minitel, the C1 set incorporates extensions for visual presentation, including color selection (e.g., 0x90 for foreground color), mosaic graphics rendering (e.g., 0x97 for mosaic block selection), and display adjustments such as size control (e.g., 0x8B followed by parameters for double height or width). These deviate from the standard C1 by prioritizing telematic and interactive display features over general text processing.[20][18] IBM's EBCDIC encoding remaps control functions across its 8-bit structure, placing some C1-equivalent operations in higher bit positions (e.g., New Line at 0x15, akin to a C0 shift but extended for mainframe data streams), while additional C1-like controls in the 0x80–0xFF range support device-specific operations like printer formatting in legacy systems.[19][21] Notable deviations appear in code pages like Windows-1252, where the 0x80–0x9F range assigns printable graphic characters (e.g., 0x80 as a non-breaking space, 0x92 as an opening single quote) instead of control functions, effectively repurposing the C1 block for Western European text display in Microsoft environments.[22] Private C1 sets, registered under ISO 2022 for OSI application layers, allow invocation via escape sequences for domain-specific uses, such as locking shifts in telematic protocols.[23] The CCITT (now ITU-T) Recommendation T.61 for Teletex defines a specialized C1 set focused on document interchange, including codes for page ejection (0x0C shifted), superscript/subscript toggles, and fixed-spacing modes to support international telex-like formatting in early electronic mail and facsimile systems.[24] Early proposals during Unicode's development in the late 1980s and early 1990s explored custom C1 sets, such as the bibliographic-oriented DIN 31626 registered in 1979, but the standard ultimately adopted the ISO C1 while reserving flexibility for private-use controls in terminal and legacy integrations.[25] By the 1990s, adoption shifted toward escape sequences (e.g., ECMA-48 CSI sequences) for invoking advanced functions, reducing reliance on raw alternative C1 codes and confining their use to legacy hardware and protocols.[1]Encoding and Representation
In ASCII and ISO Standards
In the American Standard Code for Information Interchange (ASCII), defined as a 7-bit code in ANSI X3.4-1986 and harmonized with ISO/IEC 646, the C0 control codes occupy bit positions 0/0 through 1/15, corresponding to hexadecimal values 00 through 1F, with bit patterns ranging from 0000000 to 00011111.[26] The delete character (DEL) is positioned at 7/15 (hexadecimal 7F, bit pattern 0111111 in 7 bits).[26] ASCII provides no native positions for C1 control codes, as it is limited to 7 bits; instead, C1 emulation in 7-bit environments uses the escape character (ESC, hexadecimal 1B, bit pattern 00011011) followed by a byte in the range 40 through 5F hexadecimal (bit patterns 01000000 to 01011111).[11] This mapping derives from the formula where a C1 code (hexadecimal 80–9F) is represented as ESC followed by (C1 value minus 80 hexadecimal plus 40 hexadecimal), preserving the low 7 bits of the C1 code.[1] For example, the control sequence introducer (CSI, hexadecimal 9B) emulates as ESC followed by 5B hexadecimal (ESC [).[11] In 8-bit extensions like ISO/IEC 8859 series (e.g., ISO/IEC 8859-1), the C0 set remains fixed in the low bits at positions 00–1F hexadecimal, matching ASCII for compatibility.[27] The C1 set is assigned to the high bits at positions 80–9F hexadecimal (bit patterns 10000000 through 10011111), enabling direct single-byte transmission in 8-bit environments.[11] These standards reserve the first two columns (0–1 and 8–9) of the 96x96 code table structure for control functions, ensuring C0 operates in the character left (CL) area and C1 in the character right (CR) area.[27][1] Invocation and shifting mechanisms follow ISO/IEC 2022 (ECMA-35), which supports both 7-bit and 8-bit codes.[28] The shift-in (SI, hexadecimal 0F, from C0) and shift-out (SO, hexadecimal 0E, from C0) controls invoke the G0 or G1 graphic sets into the GL position for character shifting, facilitating 7-bit safe transmission where the eighth bit may serve as parity.[29][11] For C1 designation, the sequence ESC followed by 02/02 and a final byte F (e.g., ESC 02/02 04/02 for the standard C1 set) identifies and enables the C1 control set.[29] In 7-bit systems lacking direct C1 support, individual C1 functions are invoked via the ESC Fe sequence, where Fe is a byte from 40–5F hexadecimal, ensuring compatibility by transmitting C1 equivalents as two 7-bit bytes over parity-aware channels.[29][11] This approach allows 7-bit devices to process 8-bit C1 controls without loss, though it doubles the byte count for those sequences.[30]In Unicode
In Unicode, the C0 control codes are assigned to the code points U+0000 through U+001F and U+007F (DELETE), while the C1 control codes occupy U+0080 through U+009F, all located within the Basic Multilingual Plane (BMP) for compatibility with legacy 7-bit and 8-bit encodings.[5] These 65 code points (33 for C0 including DEL and 32 for C1) are classified under the Unicode category "Cc" (Other, Control), distinguishing them from graphic characters and treating them as non-renderable controls whose semantics are defined by external protocols rather than Unicode itself.[5] Applications must preserve these codes during interchange to maintain integrity, and Unicode normalization forms apply identity mappings to them without alteration or removal.[5] Unicode supports the interpretation of these controls in accordance with ISO/IEC 6429:1992, where the ESCAPE character (U+001B) initiates control sequences such as Control Sequence Introducer (CSI) for formatting and device control.[5] This ensures compatibility with ISO/IEC 2022 escape sequences, allowing higher-level protocols to define behaviors like cursor movement or screen clearing. Additionally, Unicode's general formatting controls in the range U+2060 through U+206F, such as the ZERO WIDTH NON-JOINER (U+200C) and INVISIBLE SEPARATOR (U+2063), extend concepts from C1 controls by providing invisible operators for text shaping and layout without affecting visible rendering.[5] Certain C1 code points carry aliases reflecting historical proposals, such as U+0080 (PADDING CHARACTER, also known as HIGH OCTET PRESET), but Unicode strongly discourages remapping these positions to graphic characters or alternative semantics to preserve interoperability and stability.[31][4] The C0 controls were included from Unicode 1.0 (1991), with the full C1 set integrated in the same initial release as part of the Latin-1 compatibility block, and no substantive changes have occurred since.[32] As of Unicode 17.0 (2025), these assignments remain stable, with ongoing policies prohibiting their repurposing as private-use areas.[4] In UTF-8 encoding, C0 controls are represented as single-byte sequences from 00 to 1F and 7F, directly matching their code point values, while C1 controls use two-byte sequences from C2 80 to C2 9F to avoid conflicts with ASCII and ensure lossless round-trip conversion from legacy 8-bit standards.[5] This encoding preserves the controls' positions and behaviors across modern text processing, supporting their use in protocols like HTML and XML where they must be escaped or handled explicitly.[5]In EBCDIC and Legacy Systems
In EBCDIC, the Extended Binary Coded Decimal Interchange Code primarily used on IBM mainframe systems, the C0 control codes are not assigned to a contiguous block of code points from 0x00 to 0x1F as in ASCII and ISO 646 standards. Instead, they are scattered across the lower range of code points, reflecting EBCDIC's origins in punched card encodings and early IBM hardware designs. For instance, in code page 1047 (CCSID 1047), a common EBCDIC variant for open systems, the null character (NUL) is at 0x00, carriage return (CR) at 0x0D, and line feed (LF) at 0x25, while the delete character (DEL) is absent in the standard ASCII position of 0x7F and instead represented differently, often as 0x07 for a delete function but without direct equivalence.[33][34] Extended EBCDIC code pages provide partial support for C1-like control functions, typically mapped to the range 0x80–0x9F, though not as a full, contiguous set compatible with ISO/IEC 6429. In IBM 1047, examples include the control sequence introducer (CSI) at 0x3B (mapping to Unicode U+009B) and single character introducer (SCI) at 0x3A (U+009A), which enable escape sequences for terminal control, while other C1 codes like privacy message (PM) appear at 0x3E (U+009E). These mappings allow limited emulation of C1 behaviors through software interpretation, but direct hardware support is absent, leading to incompatibilities in data interchange. Conversion tables between EBCDIC and ASCII, such as those used in mainframe utilities, adjust for these differences; for example, EBCDIC LF (0x25) maps to ASCII LF (0x0A), and CR remains at 0x0D in both, though new line sequences may require additional handling like EBCDIC NL (0x15) converting to ASCII CR-LF pairs.[33] Legacy systems beyond standard EBCDIC implementations further diverge in control code usage. Punched card systems, foundational to EBCDIC's development, employed Hollerith codes—a 12-row/80-column format influenced by early Baudot-like 5- and 6-bit encodings—where control functions like end-of-record were indicated by specific hole patterns rather than byte values, with EBCDIC later adopting compatible hole assignments for shared controls such as CR. IBM 3270 terminals, used in mainframe environments, relied on custom orders embedded in the data stream rather than standard C0 or C1 codes; these include start field (SF) at 0x1D for attribute setting and set buffer address (SBA) at 0x11 for cursor positioning, processed by the terminal's firmware to manage screen displays. Mainframe utilities like those in z/OS handle these through proprietary protocols, often requiring explicit translation.[35][36][37] EBCDIC lacks native, direct support for the full C1 set, necessitating software-based translations in tools such as iconv for interoperability with ASCII or Unicode systems, where escape sequences approximate missing functions. By 2025, EBCDIC control handling in z/OS primarily occurs through emulation layers, supporting legacy applications while facilitating conversions to modern encodings. This has driven a shift toward Unicode for enhanced interoperability since the 1990s, reducing reliance on EBCDIC's scattered controls in new developments, though it remains entrenched in existing mainframe data processing.[38][19]| Control Function | EBCDIC 1047 (Hex) | ASCII Equivalent (Hex) | Notes |
|---|---|---|---|
| NUL | 0x00 | 0x00 | Null terminator |
| CR | 0x0D | 0x0D | Carriage return |
| LF | 0x25 | 0x0A | Line feed; requires mapping |
| ESC | 0x27 | 0x1B | Escape for sequences |
| CSI | 0x3B | 0x9B | C1-like; partial support |
| DEL | 0x07 (partial) | 0x7F | No direct match; often ignored in conversions |
Modern Usage and Implications
In Terminals and Display Systems
In terminal emulators such as xterm, GNOME Terminal, and iTerm2, C0 and C1 control codes remain integral for implementing ANSI and VT100-compatible sequences that manage cursor movement, colors, and text attributes. For instance, the C1 Control Sequence Introducer (CSI, ESC [ or single-byte 0x9B) enables commands like ESC [A to move the cursor up one line or ESC [31m to set red foreground color, ensuring compatibility with legacy applications while supporting modern rendering.[39][40][41] Modern extensions build on these foundations; the xterm-256color terminfo entry, widely adopted in emulators like GNOME Terminal (default since version 3.16), leverages C1-derived sequences such as Operating System Command (OSC, ESC ] or 0x9D) to set window titles or manipulate 256-color palettes.[39][41] Similarly, web-based terminals like xterm.js preserve core C0 functions, including Line Feed (LF, 0x0A) and Carriage Return (CR, 0x0D), to accurately render line breaks and cursor positioning in browser environments.[42] Network protocols like SSH and Telnet facilitate the transmission of raw C0 and C1 codes over connections, allowing remote terminals to interpret them directly for interactive sessions. Emulators claiming VT220 or higher compatibility, such as those using the VTE library in GNOME Terminal, fully support C1 codes like Next Line (NEL, 0x85), which combines line feed with carriage return to enable automatic line wrapping without overflow.[43][44] As of 2025, Wayland compositors have enhanced support for these codes in native terminals like Foot, integrating them seamlessly with GPU acceleration for smoother rendering, though raw C1 usage has declined in UTF-8-dominant environments where codes 0x80–0x9F may conflict with multi-byte sequences.[45] The ncurses library exemplifies this evolution by generating C0/C1-equivalent escape sequences via terminfo databases, optimizing output for diverse terminals without direct byte manipulation.[46] Physical hardware terminals are rare today, with C0 and C1 primarily emulated in software-based virtual consoles, such as Linux's /dev/tty devices, which interpret a subset of these codes for text-mode operation on framebuffers.[47]In Programming and Data Processing
In contemporary programming languages, C0 and C1 control codes are integral to text manipulation and validation routines. For instance, Python'sstr.splitlines() method relies on C0 controls such as LF (U+000A) and CR (U+000D) to identify line boundaries, treating sequences like CR LF as a single newline while preserving the original characters in the output unless specified otherwise. Similarly, Java's Character.isISOControl() method identifies characters in the C0 and C1 ranges (U+0000–U+001F and U+007F–U+009F) as ISO control codes, enabling developers to filter or validate input strings for compliance with standards like ISO/IEC 2022. In Rust, the std::char::from_u32() function constructs characters from Unicode code points, including those in the C0 and C1 blocks (U+0000–U+009F), which is commonly used for safe parsing of legacy or binary data streams.
Specialized libraries extend this handling for terminal and internationalization contexts. The ncurses library, widely used for text-based user interfaces, interprets C1 escape sequences (e.g., CSI for cursor control) to manage screen output in Unix-like environments, mapping them to function calls like tputs() for device-independent rendering. PDCurses, a portable variant, similarly processes C1 codes for cross-platform terminal emulation, ensuring compatibility with Windows consoles by translating them into API calls. For Unicode-aware applications, the International Components for Unicode (ICU) library provides normalization functions that handle C0 and C1 codes, such as converting variant forms or stripping them during collation, adhering to Unicode Standard Annex #15 for compatibility decomposition.
Data formats impose specific rules for C0 and C1 inclusion to maintain portability. In JSON, control characters must be escaped (e.g., as \u000A for LF) to avoid parsing errors, as unescaped controls outside string contexts are invalid per RFC 8259. XML documents permit C0 controls within character data but require escaping for certain ones like NUL (U+0000) to comply with the XML 1.0 specification, ensuring well-formedness during serialization. CSV processing, governed by RFC 4180, treats the C0 HT (U+0009) as a common field delimiter in tab-separated variants, while other controls may disrupt parsing if not quoted, prompting libraries like Python's csv module to handle them as literal data.
Common processing techniques involve regex patterns and I/O modes to manage these codes. Python's re module supports substitutions like re.sub(r'[\x00-\x1F\x7F-\x9F]', '', text) to strip C0 and C1 characters, a standard approach for sanitizing user input in web backends. File I/O in binary mode, such as Python's 'rb' or 'wb', preserves all byte values including controls without alteration, contrasting with text mode which may normalize line endings based on C0 LF/CR.
As of 2025, trends in cloud-native development emphasize robust UTF-8 handling of C0 controls; for example, the AWS CLI incorporates LF and CR for interactive prompts in its output streams, leveraging UTF-8 encoding for global compatibility. Meanwhile, web applications increasingly implement proactive sanitization of C1 codes at the framework level, such as in Node.js or Django, to align with browser security models while supporting legacy data migration.
