Recent from talks
Nothing was collected or created yet.
Control character
View on WikipediaThis article needs additional citations for verification. (September 2007) |
In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly graphic characters, also known as printing characters (or printable characters), except perhaps for "space" characters. In the ASCII standard there are 33 control characters, such as code 7, BEL, which might ring a bell.
History
[edit]This section needs expansion. You can help by adding missing information. (January 2009) |
Procedural signs in Morse code are a form of control character.
A form of control characters were introduced in the 1870 Baudot code: NUL and DEL. The 1901 Murray code added the carriage return (CR) and line feed (LF), and other versions of the Baudot code included other control characters.
The bell character (BEL), which rang a bell to alert operators, was also an early teletype control character.
Some control characters have also been called "format effectors".
In ASCII
[edit]
There were quite a few control characters defined (33 in ASCII, and ECMA-35 adds 32 more). This was because early terminals had very primitive mechanical or electrical controls that made any kind of state-remembering API quite expensive to implement, thus a different code for each and every function was a requirement. All entries in the ASCII table below code 3210 (technically the C0 control code set) are control characters, including CR and LF used to separate lines of text. The code 12710 (DEL) is also a control character.[1][2]
Extended ASCII sets defined by ECMA-35 and ISO 8859 added the codes 12810 through 15910 as control characters. This was primarily done so that if the high bit was stripped, it would not change a printing character to a C0 control code. This second set is called the C1 set.
IBM's EBCDIC character set contains 65 control codes, including all of the ASCII C0 control codes plus additional codes which were not added to Unicode. There were also a number of attempts to define alternative sets of 32 control codes, none of these were transferred to Unicode either.
Only a small subset of the control characters are still in use for anything resembling their original purpose:
- 0x00 (null, NUL, \0, ^@), originally intended to be an ignored character, but now used by many programming languages including C to mark the end of a string.
- 0x04 (EOT, ^D) End Of File character on Unix terminals.[3]
- 0x07 (bell, BEL, \a, ^G), which may cause the device to emit a warning such as a bell or beep sound or the screen flashing.
- 0x08 (backspace, BS, \b, ^H), may overprint the previous character.
- 0x09 (horizontal tab, HT, \t, ^I), moves the printing position right to the next tab stop.
- 0x0A (line feed, LF, \n, ^J), moves the print head down one line (and maybe to the left edge). Used as the end of line marker in Unix-like systems.
- 0x0B (vertical tab, VT, \v, ^K), vertical tabulation.
- 0x0C (form feed, FF, \f, ^L), to cause a printer to eject paper to the top of the next page, or a video terminal to clear the screen.
- 0x0D (carriage return, CR, \r, ^M), moves the printing position to the start of the line, allowing overprinting. Used as the end of line marker in Classic Mac OS, OS-9, FLEX (and variants). A CR+LF pair is used by CP/M-80 and its derivatives including DOS and Windows.
- 0x1B (escape, ESC, \e (GCC only), ^[). Introduces an escape sequence.
Control characters may do something when the user inputs them, such as Ctrl+C (End-of-Text character, ETX) to interrupt the running process, and Ctrl+Z (Substitute character, SUB) for ending typed-in file on Windows. These uses usually have little to do with their ASCII definition. Modern systems often describe shortcuts as though they are control characters ("type a Ctrl+V to paste") but the code number is not even used to implement this.
In Unicode
[edit]These 65 control codes were carried over to Unicode. "Control-characters" are U+0000—U+001F (C0 controls), U+007F (delete), and U+0080—U+009F (C1 controls). Their General Category is "Cc". The Cc control characters have no Name in Unicode, but are given labels such as "<control-001A>" instead.[4]
Unicode added more characters (such as the zero-width non-joiner) that could be considered controls, but it makes a distinction between these "Formatting characters" and the 65 control characters. These are General Category "Cf" instead of "Cc".
Display
[edit]There are a number of techniques to display non-printing characters, which may be illustrated with the bell character in ASCII encoding:
- Caret notation in ASCII using the nth letter of the alphabet: ^G
- An escape sequence, as in C/C++ character string codes: \a, \007, \x07, etc.
- An abbreviation, often three capital letters: BEL
- A Unicode character from the Control Pictures Unicode block that condenses the abbreviation: U+2407 ␇ SYMBOL FOR BELL
- An ISO 2047 graphical representation: U+237E ⍾ BELL SYMBOL
How control characters map to keyboards
[edit]ASCII-based keyboards have a key labelled "Control", "Ctrl", or (rarely) "Cntl" which is used much like a shift key, being pressed in combination with another letter or symbol key. In one implementation, the control key generates the code 64 places below the code for the (generally) uppercase letter it is pressed in combination with (i.e., subtract 0x40 from ASCII code value of the (generally) uppercase letter). The other implementation is to take the ASCII code produced by the key and bitwise AND it with 0x1F, forcing bits 5 to 7 to zero. For example, pressing "control" and the letter "g" (which is 0110 0111 in binary), produces the code 7 (BELL, 7 in base ten, or 0000 0111 in binary). The NULL character (code 0) is represented by Ctrl-@, "@" being the code immediately before "A" in the ASCII character set. For convenience, some terminals accept Ctrl-Space as an alias for Ctrl-@. In either case, this produces one of the 32 ASCII control codes between 0 and 31. Neither approach works to produce the DEL character because of its special location in the table and its value (code 12710), Ctrl-? is sometimes used for this character.[5]
When the control key is held down, letter keys produce the same control characters regardless of the state of the shift or caps lock keys. In other words, it does not matter whether the key would have produced an upper-case or a lower-case letter. The interpretation of the control key with the space, graphics character, and digit keys (ASCII codes 32 to 63) varies between systems. Some will produce the same character code as if the control key were not held down. Other systems translate these keys into control characters when the control key is held down. The interpretation of the control key with non-ASCII ("foreign") keys also varies between systems.
Control characters are often rendered into a printable form known as caret notation by printing a caret (^) and then the ASCII character that has a value of the control character plus 64. Control characters generated using letter keys are thus displayed with the upper-case form of the letter. For example, ^G represents code 7, which is generated by pressing the G key when the control key is held down.
Keyboards also typically have a few single keys which produce control character codes. For example, the key labelled "Backspace" typically produces code 8, "Tab" code 9, "Enter" or "Return" code 13 (though some keyboards might produce code 10 for "Enter").
Many keyboards include keys that do not correspond to any ASCII printable or control character, for example cursor control arrows and word processing functions. The associated keypresses are communicated to computer programs by one of four methods: appropriating otherwise unused control characters; using some encoding other than ASCII; using multi-character control sequences; or using an additional mechanism outside of generating characters. "Dumb" computer terminals typically use control sequences. Keyboards attached to stand-alone personal computers made in the 1980s typically use one (or both) of the first two methods. Modern computer keyboards generate scancodes that identify the specific physical keys that are pressed; computer software then determines how to handle the keys that are pressed, including any of the four methods described above.
The design purpose
[edit]The control characters were designed to fall into a few groups: printing and display control, data structuring, transmission control, and miscellaneous.
Printing and display control
[edit]Printing control characters were first used to control the physical mechanism of printers, the earliest output device. An early example of this idea was the use of Figures (FIGS) and Letters (LTRS) in Baudot code to shift between two code pages. A later, but still early, example was the out-of-band ASA carriage control characters. Later, control characters were integrated into the stream of data to be printed. The carriage return character (CR), when sent to such a device, causes it to put the character at the edge of the paper at which writing begins (it may, or may not, also move the printing position to the next line). The line feed character (LF/NL) causes the device to put the printing position on the next line. It may (or may not), depending on the device and its configuration, also move the printing position to the start of the next line (which would be the leftmost position for left-to-right scripts, such as the alphabets used for Western languages, and the rightmost position for right-to-left scripts such as the Hebrew and Arabic alphabets). The vertical and horizontal tab characters (VT and HT/TAB) cause the output device to move the printing position to the next tab stop in the direction of reading. The form feed character (FF/NP) starts a new sheet of paper, and may or may not move to the start of the first line. The backspace character (BS) moves the printing position one character space backwards. On printers, including hard-copy terminals, this is most often used so the printer can overprint characters to make other, not normally available, characters. On video terminals and other electronic output devices, there are often software (or hardware) configuration choices that allow a destructive backspace (e.g., a BS, SP, BS sequence), which erases, or a non-destructive one, which does not. The shift in and shift out characters (SI and SO) selected alternate character sets, fonts, underlining, or other printing modes. Escape sequences were often used to do the same thing.
With the advent of computer terminals that did not physically print on paper and so offered more flexibility regarding screen placement, erasure, and so forth, printing control codes were adapted. Form feeds, for example, usually cleared the screen, there being no new paper page to move to. More complex escape sequences were developed to take advantage of the flexibility of the new terminals, and indeed of newer printers. The concept of a control character had always been somewhat limiting, and was extremely so when used with new, much more flexible, hardware. Control sequences (sometimes implemented as escape sequences) could match the new flexibility and power and became the standard method. However, there were, and remain, a large variety of standard sequences to choose from.
Data structuring
[edit]The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to structure data, usually on a tape, in order to simulate punched cards. End of medium (EM) warns that the tape (or other recording medium) is ending. While many systems use CR/LF and TAB for structuring data, it is possible to encounter the separator control characters in data that needs to be structured. The separator control characters are not overloaded; there is no general use of them except to separate data into structured groupings. Their numeric values are contiguous with the space character, which can be considered a member of the group, as a word separator.
For example, the RS separator is used by RFC 7464 (JSON Text Sequences) to encode a sequence of JSON elements. Each sequence item starts with a RS character and ends with a line feed. This allows to serialize open-ended JSON sequences. It is one of the JSON streaming protocols.
Transmission control
[edit]The transmission control characters were intended to structure a data stream, and to manage re-transmission or graceful failure, as needed, in the face of transmission errors.
The start of heading (SOH) character was to mark a non-data section of a data stream—the part of a stream containing addresses and other housekeeping data. The start of text character (STX) marked the end of the header, and the start of the textual part of a stream. The end of text character (ETX) marked the end of the data of a message. A widely used convention is to make the two characters preceding ETX a checksum or CRC for error-detection purposes. The end of transmission block character (ETB) was used to indicate the end of a block of data, where data was divided into such blocks for transmission purposes.
The escape character (ESC) was intended to "quote" the next character, if it was another control character it would print it instead of performing the control function. It is almost never used for this purpose today. Various printable characters are used as visible "escape characters", depending on context.
The substitute character (SUB) was intended to request a translation of the next character from a printable character to another value, usually by setting bit 5 to zero. This is handy because some media (such as sheets of paper produced by typewriters) can transmit only printable characters. However, on MS-DOS systems with files opened in text mode, "end of text" or "end of file" is marked by this Ctrl-Z character, instead of the Ctrl-C or Ctrl-D, which are common on other operating systems.
The cancel character (CAN) signaled that the previous element should be discarded. The negative acknowledge character (NAK) is a definite flag for, usually, noting that reception was a problem, and, often, that the current element should be sent again. The acknowledge character (ACK) is normally used as a flag to indicate no problem detected with current element.
When a transmission medium is half duplex (that is, it can transmit in only one direction at a time), there is usually a master station that can transmit at any time, and one or more slave stations that transmit when they have permission. The enquire character (ENQ) is generally used by a master station to ask a slave station to send its next message. A slave station indicates that it has completed its transmission by sending the end of transmission character (EOT).
The device control codes (DC1 to DC4) were originally generic, to be implemented as necessary by each device. However, a universal need in data transmission is to request the sender to stop transmitting when a receiver is temporarily unable to accept any more data. Digital Equipment Corporation invented a convention which used 19 (the device control 3 character (DC3), also known as control-S, or XOFF) to "S"top transmission, and 17 (the device control 1 character (DC1), a.k.a. control-Q, or XON) to start transmission. It has become so widely used that most don't realize it is not part of official ASCII. This technique, however implemented, avoids additional wires in the data cable devoted only to transmission management, which saves money. A sensible protocol for the use of such transmission flow control signals must be used, to avoid potential deadlock conditions, however.
The data link escape character (DLE) was intended to be a signal to the other end of a data link that the following character is a control character such as STX or ETX. For example a packet may be structured in the following way (DLE) <STX> <PAYLOAD> (DLE) <ETX>.
Miscellaneous codes
[edit]Code 7 (BEL) is intended to cause an audible signal in the receiving terminal.[6]
Many of the ASCII control characters were designed for devices of the time that are not often seen today. For example, code 22, "synchronous idle" (SYN), was originally sent by synchronous modems (which have to send data constantly) when there was no actual data to send. (Modern systems typically use a start bit to announce the beginning of a transmitted word— this is a feature of asynchronous communication. Synchronous communication links were more often seen with mainframes, where they were typically run over corporate leased lines to connect a mainframe to another mainframe or perhaps a minicomputer.)
Code 0 (ASCII code name NUL) is a special case. In paper tape, it is the case when there are no holes. It is convenient to treat this as a fill character with no meaning otherwise. Since the position of a NUL character has no holes punched, it can be replaced with any other character at a later time, so it was typically used to reserve space, either for correcting errors or for inserting information that would be available at a later time or in another place. In computing, it is often used for padding in fixed length records; to mark the end of a string; and formerly to give printing devices enough time to execute a control function.
Code 127 (DEL, a.k.a. "rubout") is likewise a special case. Its 7-bit code is all-bits-on in binary, which essentially erased a character cell on a paper tape when overpunched. Paper tape was a common storage medium when ASCII was developed, with a computing history dating back to WWII code breaking equipment at Biuro Szyfrów. Paper tape became obsolete in the 1970s, so this aspect of ASCII rarely saw any use after that. Some systems (such as the original Apple computers) converted it to a backspace. But because its code is in the range occupied by other printable characters, and because it had no official assigned glyph, many computer equipment vendors used it as an additional printable character (often an all-black box character useful for erasing text by overprinting with ink).
Non-erasable programmable ROMs are typically implemented as arrays of fusible elements, each representing a bit, which can only be switched one way, usually from one to zero. In such PROMs, the DEL and NUL characters can be used in the same way that they were used on punched tape: one to reserve meaningless fill bytes that can be written later, and the other to convert written bytes to meaningless fill bytes. For PROMs that switch one to zero, the roles of NUL and DEL are reversed; also, DEL will only work with 7-bit characters, which are rarely used today; for 8-bit content, the character code 255, commonly defined as a nonbreaking space character, can be used instead of DEL.
Many file systems do not allow control characters in filenames, as they may have reserved functions.
See also
[edit]- Arrow keys § HJKL keys, HJKL as arrow keys, used on ADM-3A terminal
- C0 and C1 control codes
- Escape sequence
- In-band signaling
- Whitespace character
Notes and references
[edit]- ^ ASCII format for network interchange. 1969-10-01. doi:10.17487/RFC0020. RFC 20. Retrieved 2023-04-05.
- ^ "5.2 Control Characters". American National Standard Code for Information Interchange | ANSI X3.4-1977 (PDF). National Institute for Standards. 1977. Archived (PDF) from the original on 2022-10-09.
- ^ "EOT (End of transmission)" (PDF). Component Description: IBM 2780 Data Transmission Terminal (PDF). Systems Reference Library. p. 31. GA27-3005-3. Retrieved May 21, 2025.
The EOT character terminates the current transmission and returns all terminals in the data-link to control mode. When sent by the transmitting terminal, it indicates that the terminal has nothing more to transmit and is relinquishing the communications line. The receiving terminal can send an EOT character instead of a normal DLE 0, DLE 1, or NAK response. The EOT character in this case is an abort signal that terminates the transmission. When sent in response to a polling operation, the EOT character indicates that the polled terminal has no data to transmit or is unable to continue transmission. An EOT character is recognized (except in Six- Bit Transcode) only when immediately preceded by a SYN pattern (SYN SYN EOT PAD), or when imme- diately preceded by a DLE and followed by a character of which the first four bits must be all "1" bits (PAD character) DLE EOT PAD.
- ^ "4.8 Name". The Unicode Standard Version 13.0 – Core Specification (PDF). Unicode, Inc. Archived (PDF) from the original on 2022-10-09.
- ^ "ASCII Characters". Archived from the original on October 28, 2009. Retrieved 2010-10-08.
- ^ ASCII format for Network Interchange. October 1969. doi:10.17487/RFC0020. RFC 20. Retrieved 2013-11-03. An old RFC, which explains the structure and meaning of the control characters in chapters 4.1 and 5.2
External links
[edit]- ISO IR 1 C0 Set of ISO 646 (PDF)
Control character
View on GrokipediaOverview
Definition and Characteristics
A control character, also known as a non-printing character, is a code point within a character encoding system that does not correspond to a visible graphic symbol but instead invokes specific functions to influence the processing, display, or transmission of data by hardware or software. These characters are integral to information processing systems, where they direct actions such as formatting text or managing device operations without generating any visible output on a display or print medium. Defined in standards like ISO/IEC 6429 and ECMA-48, control characters are embedded in data streams to ensure proper interpretation and execution by compatible equipment.[5][6] Key characteristics of control characters include their assignment to designated code points, such as the range of decimal values 0 through 31 and 127 in the ASCII encoding scheme, which reserves these positions exclusively for non-graphic purposes. Unlike standard fonts that provide glyphs for printable elements, control characters lack any visual representation, relying instead on their encoded value to trigger predefined behaviors in receiving systems. They play a crucial role in controlling peripheral devices, such as terminals for screen navigation or printers for paper advancement, thereby facilitating efficient data handling in computing environments. This non-printable nature ensures they remain invisible during normal rendering, preserving the integrity of the textual content.[7][6] In distinction from printable characters, which encode letters, numerals, punctuation, or other symbols intended for direct visual depiction, control characters solely initiate operational commands without contributing to the semantic or aesthetic content of the output. For instance, a control character might reposition a cursor on a display, insert spacing between elements, or emit an alert signal, thereby shaping how subsequent printable characters are interpreted or rendered. This functional dichotomy underscores their utility in layered text processing, where control sequences orchestrate the environment for graphic rendering. Such properties trace back briefly to early telegraphy systems, where analogous signals managed message flow and device synchronization.[7][6][8]Classification and Categories
Control characters are classified into categories such as format effectors for layout control, transmission controls for data flow management, device controls for ancillary devices, and information separators for data organization. These categories ensure interoperability in basic 7-bit environments. Format effectors modify the layout or presentation of text, such as advancing positions or initiating new lines, without altering the content itself.[9] Standard categories for control characters are delineated in ISO/IEC 2022, which structures them into the C0 set (bit combinations 00/00 to 01/15, corresponding to codes 0-31 in decimal) for basic operations and the C1 set (bit combinations 08/00 to 09/15, codes 128-159, or equivalent escape sequences) for extended capabilities. The separation of C0 and C1 facilitates compatibility: C0 supports essential 7-bit environments with minimal functions like null termination and basic formatting, while C1 extends to 8-bit codes for advanced features such as device selection and synchronization, preventing overload in simpler systems.[6] Functionally, control characters are grouped into transmission controls for managing data flow and error handling over networks (e.g., acknowledgment and end-of-transmission signals), device controls for operating physical devices like printers or displays (e.g., DC1-DC4), format effectors (e.g., form feed), and information separators for organizing data records at varying hierarchical levels (e.g., US, RS, GS, FS).[9] These groupings originated in early 7-bit standards like ISO 646, emphasizing telecommunication and printing needs, and evolved with 8-bit extensions in ISO/IEC 6429 (equivalent to ECMA-48) to address growing demands for multimedia and bidirectional text processing while maintaining backward compatibility.[6]Historical Development
Origins in Early Communication Systems
Control characters originated in the mid-19th century amid the rapid expansion of electrical telegraphy, where non-printing signals were essential for managing transmission and mechanical operations. Émile Baudot, a French telegraph engineer, invented the Baudot code in 1874 as part of his printing telegraph system, which used a six-unit synchronous code to enable multiple operators to transmit simultaneously over a single wire. By 1876, Baudot refined it to a five-unit asynchronous code, introducing the first dedicated control signals, such as "letter space" and "figure space," to switch the receiving printer between alphabetic and numeric/punctuation modes without printing a character; these shifts advanced the paper feed while altering interpretation of subsequent codes.[8] This innovation addressed the limitations of earlier systems like Morse code, which lacked uniform-length encodings and required manual decoding, thereby improving efficiency in 19th-century telegraph networks for direct printing of messages.[8][10] In the early 20th century, control characters evolved with the advent of teletype and punch tape systems, which mechanized input and output for more reliable long-distance communication. Donald Murray, an inventor working on typewriter-like keyboards for telegraphy, modified the Baudot code starting in 1901 and introduced a dedicated "line" control character by 1905 to trigger both carriage return and paper advance on mechanical printers, using punched paper tape to store and feed sequences of five-bit codes. By the 1910s and into the 1920s–1940s, systems like those from Morkrum and Western Union separated these into distinct carriage return (CR) and line feed (LF) controls, represented by specific hole patterns on tape—such as all holes punched for CR in some variants—to independently manage horizontal reset and vertical advancement on printing mechanisms. These punch tape teletypewriters, widely adopted for news services and business telegrams, relied on such controls to format output on mechanical devices, preventing garbled text from continuous printing.[8][11] The International Telegraph Union (ITU), originally the International Telegraph Union founded in 1865, played a pivotal role in standardizing control characters during the early 1900s to ensure interoperability across global telegraph networks. Through international conferences, the Union's Bureau standardized Baudot-derived codes by the early 1900s, defining basic controls like mode shifts and spacing for uniform equipment operation. The Comité Consultatif International Télégraphique (CCIT), established in 1926 under the ITU, further refined these into the International Telegraph Alphabet No. 1 (ITA1) and No. 2 (ITA2) by 1931, incorporating controls such as CR, LF, and "who are you?" (WRU) signals to query remote devices and manage formatting in international transmissions.[8][12][13] A significant advancement in the 1930s came with the integration of control signals into radio teletype systems for error correction. As radio transmission introduced noise and interference absent in wired telegraphy, U.S. military radioteletype applications from the 1930s employed control characters like parity checks and repeat signals to detect errors, laying groundwork for reliable over-air messaging. By 1939, error-detecting codes using dedicated control sequences were standardized for radioteletype. Automatic repeat request (ARQ) protocols, enabling automatic retransmission requests, were developed post-World War II.[14][8]Evolution Through Computing Standards
In the 1950s and 1960s, control characters were integrated into early digital computing media such as punch cards and magnetic tapes, primarily through IBM's development of EBCDIC (Extended Binary Coded Decimal Interchange Code). EBCDIC evolved from punch card encodings used since the late 19th century but was formalized for computers in the early 1960s, with its initial specification appearing in 1963 alongside IBM's System/360 mainframe released in 1964.[15] This 8-bit code included control characters like ACK (acknowledge), NAK (negative acknowledge), and BEL (bell) to manage data processing, error handling, and device control on tapes and cards, enabling efficient batch processing in business and scientific applications.[15] EBCDIC's adoption reflected IBM's dominance in mainframe computing, though its proprietary nature limited interoperability.[16] The standardization of ASCII (American Standard Code for Information Interchange) from 1963 to 1967 marked a pivotal shift toward universal compatibility. The initial ASCII-1963 (ANSI X3.4-1963) defined a 7-bit code with control characters for teletypewriters and early computers, but it was revised in 1967 (USAS X3.4-1967) and 1968 (ANSI X3.4-1968) to include 33 control characters—positions 0–31 (C0 set) and 127 (DEL)—covering functions like line feeds and carriage returns.[16][7] USASCII, as the 1968 version was termed, was adopted internationally through ECMA-6 (1965) and ISO 646 (1972), which harmonized the 33 controls to facilitate data exchange across diverse systems, reducing reliance on vendor-specific codes like EBCDIC.[7][17] This effort by ANSI, ECMA, and ISO emphasized backward compatibility while promoting a minimal set of controls essential for telecommunications and computing.[18] During the 1970s and 1980s, computing standards transitioned from 7-bit to 8-bit encodings to support international characters, extending control sets via ISO 646 variants and the addition of the C1 set. ISO 646, building on ASCII, allowed national variants but retained the core 33 C0 controls; by the late 1970s, 8-bit extensions like ISO 8859 (introduced 1987) incorporated the C1 controls (positions 128–159) for advanced device management, such as cursor positioning and screen erasing, standardized in ISO 6429 (1988).[7][19] These developments, driven by ISO and ECMA, addressed global needs by enabling 8-bit bytes for accented letters in Western Europe while preserving legacy controls, thus bridging telegraph-era practices with modern terminals and printers.[7][20] In the late 20th century, Unicode's emergence in the 1990s preserved and unified these legacy control characters for global text processing. Unicode 1.0 (1991), developed by the Unicode Consortium and aligned with ISO/IEC 10646 (1993), directly incorporated ASCII's 33 C0 controls and the C1 set into its Basic Multilingual Plane, ensuring compatibility with EBCDIC and ISO systems without alteration.[21] This preservation allowed seamless migration of existing data while expanding to over a million code points, with controls like NUL and ESC maintaining their roles in formatting and protocols.[7] By the mid-1990s, Unicode's adoption in software and the web solidified control characters as a stable foundation for interoperable computing.[22]Representation in Character Encodings
Control Characters in ASCII
The American Standard Code for Information Interchange (ASCII), formalized as ANSI X3.4-1968 and later aligned with the international ISO/IEC 646 standard, employs a 7-bit encoding scheme that defines 128 character positions, ranging from 0 to 127.[23] Within this structure, 33 positions are reserved for control characters: the first 32 (codes 0 through 31, known as the C0 set) and code 127 (DEL).[23] These non-printable characters were designed primarily for controlling data transmission, formatting, and device operations in early computing and telecommunications systems, rather than representing visible symbols.[23] The following table enumerates all 33 ASCII control characters, including their decimal code points, standard names, acronyms, and brief descriptions of their intended functions as specified in ISO/IEC 646:1991.[23]| Decimal Code | Name | Acronym | Description/Original Intent |
|---|---|---|---|
| 0 | NULL | NUL | No action or used to allow time for feeding paper. |
| 1 | START OF HEADING | SOH | Indicates the start of a heading. |
| 2 | START OF TEXT | STX | Indicates the start of text. |
| 3 | END OF TEXT | ETX | Indicates the end of text. |
| 4 | END OF TRANSMISSION | EOT | Indicates the end of transmission. |
| 5 | ENQUIRY | ENQ | Requests a response. |
| 6 | ACKNOWLEDGE | ACK | Acknowledges receipt. |
| 7 | BELL | BEL | Produces an audible or visible signal. |
| 8 | BACKSPACE | BS | Moves the active position one position backward. |
| 9 | HORIZONTAL TABULATION | HT | Moves the active position to the next predetermined position. |
| 10 | LINE FEED | LF | Moves the active position to the same position on a new line. |
| 11 | VERTICAL TABULATION | VT | Moves the active position to the next predetermined line. |
| 12 | FORM FEED | FF | Moves the active position to the starting position on a new page. |
| 13 | CARRIAGE RETURN | CR | Moves the active position to the beginning of the line. |
| 14 | SHIFT OUT | SO | Indicates that following characters are to be interpreted according to an alternative set. |
| 15 | SHIFT IN | SI | Indicates that following characters are to be interpreted according to the standard set. |
| 16 | DATA LINK ESCAPE | DLE | Provides supplementary data link control. |
| 17 | DEVICE CONTROL ONE | DC1 | Used for device control. |
| 18 | DEVICE CONTROL TWO | DC2 | Used for device control. |
| 19 | DEVICE CONTROL THREE | DC3 | Used for device control. |
| 20 | DEVICE CONTROL FOUR | DC4 | Used for device control. |
| 21 | NEGATIVE ACKNOWLEDGE | NAK | Indicates a negative acknowledgment. |
| 22 | SYNCHRONOUS IDLE | SYN | Provides a signal for synchronizing purposes. |
| 23 | END OF TRANSMISSION BLOCK | ETB | Indicates the end of a transmission block. |
| 24 | CANCEL | CAN | Indicates that preceding data is in error. |
| 25 | END OF MEDIUM | EM | Indicates the physical end of a medium. |
| 26 | SUBSTITUTE | SUB | Replaces a character considered invalid. |
| 27 | ESCAPE | ESC | Provides a means of extending the character set. |
| 28 | FILE SEPARATOR | FS | Separates portions of a file. |
| 29 | GROUP SEPARATOR | GS | Separates groups of data. |
| 30 | RECORD SEPARATOR | RS | Separates records. |
| 31 | UNIT SEPARATOR | US | Separates units within a record. |
| 127 | DELETE | DEL | Used to obliterate unwanted characters. |
Control Characters in Unicode and ISO Standards
Unicode incorporates control characters from established standards to ensure compatibility with legacy systems, preserving the 32 C0 controls at code points U+0000 through U+001F and the DELETE character at U+007F in its Basic Latin block, which directly map to their ASCII positions.[1] The standard further includes the 32 C1 controls at U+0080 through U+009F, extending the 7-bit framework to 8-bit environments while maintaining semantic consistency for interchange.[1] These assignments align with ISO/IEC 2022 for code extension techniques, allowing seamless integration in multi-byte encodings.[1] ISO/IEC 6429:1992 defines standardized control functions and their coded representations for 7-bit and 8-bit character sets, specifying the C0 set for basic operations and the C1 set for advanced device control in 8-bit contexts.[5] In this framework, C1 controls enable more sophisticated text processing, such as the Control Sequence Introducer (CSI) at decimal 155 (U+009B), which prefixes parameter-driven sequences for functions like cursor positioning and attribute setting in terminal environments.[6] Another example is the Index (IND) control at decimal 132 (U+0084), which advances the active cursor position to the next line while maintaining the column, supporting screen management in character-imaging devices.[25] These C1 additions differ from the ASCII C0 set by providing 8-bit-specific capabilities for interactive systems, building on the foundational 7-bit controls.[6] Unicode normalization forms, including NFC (Normalization Form Canonical Composition) and NFD (Normalization Form Canonical Decomposition), handle control characters as indivisible units, leaving them unchanged during decomposition or composition to preserve their functional integrity in text streams.[26] This stability ensures that controls do not introduce unintended variations in normalized text, though their interaction with bidirectional algorithms requires adherence to Unicode Standard Annex #9 to avoid rendering issues in mixed-directionality content.[27] In modern implementations, control characters are fully supported in UTF-8 and UTF-16 encodings, where C0 codes occupy single bytes and C1 codes use multi-byte sequences for compatibility.[1] However, many C1 controls are now deprecated for general text interchange, with recommendations to use higher-level protocols or Unicode format characters instead to mitigate legacy interpretation risks.[1]Visual Display and Rendering
Methods of Displaying Control Characters
Control characters are often rendered invisibly in terminal emulators, where they trigger specific actions without producing visible glyphs. For instance, the line feed (LF, ASCII 0x0A) character advances the cursor to the next line, while the carriage return (CR, ASCII 0x0D) moves the cursor to the beginning of the current line, enabling text formatting such as line breaks in command-line interfaces.[28] These behaviors follow standards like ECMA-48 for control sequence processing in terminals, ensuring seamless output without displaying the characters themselves.[28] In debugging and data inspection tools, control characters are typically displayed as their hexadecimal or decimal equivalents to reveal their presence without ambiguity. Thehexdump utility in Unix-like systems, for example, formats file contents in a tabular view showing byte offsets, hexadecimal values, and ASCII representations, where non-printable controls like LF appear as "0a" alongside a dot (.) for the unprintable byte.[29] This approach allows developers to analyze binary data or text streams containing controls, such as identifying embedded line terminators in files, while preserving the exact byte values for troubleshooting.[30]
Within network protocols, control characters are processed invisibly during transmission and reception, often being stripped, normalized, or interpreted as structural elements rather than rendered. In HTTP, messages may include controls in bodies or headers, but parsers handle them according to RFC 7230, treating characters like CR and LF as delimiters for lines without visual output in client displays.[31] Similarly, in email via MIME (RFC 2045), text parts mandate CRLF sequences for line breaks, with other controls like TAB permitted for spacing but processed silently by clients to maintain readability, excluding disallowed controls that could disrupt transport.[32] Symbolic notations, such as ^M for CR, may occasionally reference these in logs but are not part of primary rendering.
Accessibility tools like screen readers interpret control characters as navigational or structural cues to enhance user experience for visually impaired individuals. For example, CR and LF are typically announced silently but trigger actions like advancing to the next line or paragraph. This ensures that documents using controls for formatting, such as in PDFs or web content, maintain logical reading order without verbose announcements of the characters themselves.
Symbolic Representations and Glyphs
Caret notation provides a textual method for representing non-printable ASCII control characters by prefixing a caret symbol (^) to an uppercase letter that corresponds to the character's decimal value modulo 32 plus 64, effectively mapping it to a printable ASCII letter. For example, the Start of Heading (SOH, code 01) is shown as ^A, the Bell (BEL, code 07) as ^G, and the Substitute (SUB, code 1A) as ^Z. This convention originated in the 1967 version of the ASCII standard to enable clear documentation and visualization of controls in teletype and early computing environments.[33] The notation remains prevalent in modern text editors and tools, such as Vim, where it visually distinguishes control characters during editing and debugging of files containing binary data or legacy formats.[34] In Unicode, the Control Pictures block (U+2400–U+243F) defines dedicated graphic symbols to depict C0 control characters (codes 00–1F and 7F) and select others, facilitating their inclusion in printable contexts like diagrams or educational materials. Representative glyphs include U+2400 (␀) for Null (NUL), U+2401 (␁) for Start of Heading (SOH), U+2407 (␇) for Bell (BEL), U+2409 (␉) for Horizontal Tabulation (HT), U+240A (␊) for Line Feed (LF), and U+241B (␛) for Escape (ESC). These symbols are designed as simple line drawings or boxes enclosing abbreviations, with actual rendering varying by font but standardized in shape for consistency.[35] Control characters are further symbolized through their official abbreviated names, as defined in the Unicode Standard for the C0 set, such as SOH, STX (Start of Text), ETX (End of Text), and BEL. The BEL character, in particular, is often visualized in graphical user interfaces (GUIs) as a bell icon or through an audible alert to represent its alerting function without altering text layout.[36] For the C1 control set (codes 80–9F, as in ISO/IEC 2022), no equivalent glyphs exist in the Control Pictures block, leading to their display in Unicode-compliant fonts as fallback representations like open boxes (e.g., ␡-style) or warning symbols to denote uninterpreted controls.[7] In terminal behaviors, these may align with caret notation for consistency across C0 and C1 ranges.[37]Input and Device Mapping
Keyboard and Hardware Input Mechanisms
Control characters are primarily generated through hardware input devices such as keyboards, where specific key combinations or dedicated hardware mechanisms map to their binary codes. On standard QWERTY keyboards, the Control (Ctrl) key serves as a modifier to produce many C0 control characters from the ASCII set (codes 0–31), by combining it with alphabetic keys to clear the high bits of the letter's code. For instance, Ctrl+C generates End of Text (ETX, ASCII 3), while Ctrl+D produces End of Transmission (EOT, ASCII 4), a convention originating from early teletypewriter systems and standardized in ASCII to facilitate efficient data interruption and termination.[7][38] In Windows environments, dead keys and numeric keypad modifiers enable input of control characters via Alt codes, where holding the Alt key while typing a numeric sequence on the keypad inserts the corresponding ASCII value. A representative example is Alt+7 (or Alt+007 for padded entry), which inputs the Bell (BEL, ASCII 7) character to trigger an audible alert. This method supports both C0 and some extended controls but relies on the system's code page interpretation, making it hardware-agnostic yet tied to the keyboard's numeric input capabilities.[39][40] Historically, early teletype keyboards, such as the Teletype Model 33 and Model 35 used in mid-20th-century computing, featured dedicated keys or labeled positions for control characters, including special function keys like BREAK (for interrupt signals) and ESC (for escape sequences), integrated directly into the mechanical keyboard layout to transmit codes over serial lines without additional modifiers. These devices punched paper tape or sent electrical signals corresponding to control codes, influencing modern keyboard designs. In contemporary hardware, USB keyboards adhere to the Human Interface Device (HID) protocol, transmitting key events as scan codes—low-level identifiers for each key press or release—to the host system, which then maps them to control characters like Ctrl combinations or function keys (e.g., F1–F12 often aliased to higher controls). This scan code transmission ensures compatibility across devices, with make/break codes distinguishing press and release actions for precise control input.[41][42][43] A key limitation in hardware input arises from bit-width constraints: 7-bit systems, common in original ASCII implementations, restrict direct input to C0 controls (0–31) via Ctrl+key or special keys like Backspace (BS, 8) and Enter (CR, 13), while C1 controls (128–159) require 8-bit capable hardware or multi-byte escape sequences (e.g., ESC followed by a letter) initiated by the Esc key, often necessitating function keys or composed inputs on modern layouts. Software remapping can extend these capabilities but remains secondary to hardware generation.[7][44]Software and Programming Interfaces for Input
In programming languages, control characters are often generated or embedded using escape sequences within string literals. In C, the escape character ESC (ASCII 27) is represented as\x1B in hexadecimal notation or \033 in octal, allowing developers to insert it directly into strings for initiating control sequences, such as those used in terminal output.[45] Similarly, other control characters like newline (\n, ASCII 10) and carriage return (\r, ASCII 13) are predefined escapes that facilitate input handling in code.[45] These mechanisms abstract the binary representation of control codes, enabling portable code across compilers while adhering to standards like ISO C.[45]
High-level languages provide built-in functions and methods for creating and detecting control characters in input processing. Python's chr() function converts an integer Unicode code point to its corresponding character; for instance, chr(10) yields the line feed (LF) control character, equivalent to \n, which is commonly used in text streams for line breaks.[46] In Java, the Character.isISOControl(char ch) method identifies ISO control characters by checking if the input falls within the ranges U+0000 to U+001F (C0 controls) or U+007F to U+009F (DEL and C1 controls), aiding in validation and sanitization of input data from user interfaces or files.[47] These APIs promote safe handling by distinguishing control characters from printable ones, reducing errors in parsing network or file inputs.[47]
Terminal emulators integrate control character input through standardized sequences, particularly for navigation keys. In xterm, arrow keys generate Control Sequence Introducer (CSI) sequences prefixed by ESC [ (0x1B 0x5B); for example, the left arrow sends CSI D in normal mode, while application cursor keys mode (enabled via CSI ? 1 h) may alter the interpretation for enhanced input control in applications like vi.[44] This allows software to receive structured input events as byte streams containing control codes, supporting interactive command-line interfaces.[44]
Cross-platform development introduces challenges in processing control characters within input streams, primarily due to varying conventions for line endings. On Unix-like systems, LF (U+000A) denotes a newline, whereas Windows uses CR LF (U+000D U+000A); Python addresses this via universal newlines mode in TextIOWrapper (when newline=None), which transparently translates all variants—'\n', '\r', or '\r\n'—to '\n' on input, ensuring consistent handling across operating systems without altering other control characters.[48] Unicode normalization forms (NFC, NFD, NFKD, NFKC) do not impact control characters, as ASCII-range codes like U+0000 to U+007F remain unchanged, preserving their integrity in internationalized input pipelines.[49] Developers must configure stream readers accordingly to avoid mismatches, such as binary mode preserving raw CR LF sequences for protocol data.[48]
Primary Applications
Formatting and Output Control
Control characters play a crucial role in managing text layout and output on devices such as printers and screens by serving as format effectors that adjust positioning without producing visible glyphs. In the ASCII standard, the Horizontal Tabulation (HT, ASCII 09) advances the active position to the next horizontal tab stop, typically every eight columns, facilitating aligned spacing in tabular data or code. Similarly, the Line Feed (LF, ASCII 0A) moves the position to the next line, while the Carriage Return (CR, ASCII 0D) returns it to the beginning of the current line; these are often combined as CR LF to ensure both horizontal reset and vertical advance in legacy systems. The Form Feed (FF, ASCII 0C) ejects the current page or advances to the top of the next form, commonly used in printing to initiate new pages.[50] Historically, control characters extended to more complex formatting in dot-matrix printers through escape sequences prefixed by the Escape (ESC, ASCII 1B) character, enabling attributes like bold and italic printing. For instance, in Epson's ESC/P command set, ESC E selects bold mode by increasing character density, while ESC 4 enables italic slant, allowing printers like the FX-80 to produce varied typographic effects on impact mechanisms. These sequences were essential for generating professional-looking documents on early office equipment, where direct hardware control was necessary due to limited software rendering capabilities.[51] In modern terminal emulators, such as those implementing the VT100 standard, the Vertical Tabulation (VT, ASCII 0B) supports vertical positioning by advancing the cursor to the next predefined line tab stop, aiding in the layout of multi-line forms or aligned text blocks. Defined in ECMA-48, VT typically behaves like multiple LF characters if tab stops are unset, but enables precise vertical alignment when configured, enhancing output control in command-line interfaces and legacy applications.[6][52] Interoperability challenges arise from differing conventions for line endings, particularly the use of CR LF in Windows environments versus LF alone in Unix-like systems, leading to issues like extra blank lines or truncated displays when files are exchanged across platforms. This discrepancy stems from historical typewriter mechanics but persists in text processing, requiring normalization tools to maintain consistent formatting during output.[53]Data Structuring and Delimitation
Control characters play a crucial role in organizing and delineating data within streams or files, particularly in legacy computing environments where they establish hierarchical boundaries for parsing and processing information. In traditional systems, the information separators—File Separator (FS, ASCII 28), Group Separator (GS, ASCII 29), Record Separator (RS, ASCII 30), and Unit Separator (US, ASCII 31)—form a structured hierarchy to divide data logically.[54] The FS serves as the highest-level delimiter, separating entire files or major divisions; GS divides groups within files; RS marks boundaries between records inside groups; and US delimits the smallest units, such as fields within records.[54] This hierarchy was designed to mimic punched card or tape structures and remains relevant in legacy applications, including COBOL-based file processing on mainframes, where it enables efficient sequential reading and hierarchical data management.[55][56] Specific control characters also function as terminators in various data formats to signal the end of content units. The Null (NUL, ASCII 0) character acts as a string terminator in the C programming language, appended to character arrays to indicate where the valid string ends, allowing functions likestrlen to determine length without length prefixes. Similarly, the End of Text (ETX, ASCII 3) character denotes the conclusion of a text sequence, often following a Start of Text (STX) in communication protocols to bound message payloads.[57] These delimiters facilitate reliable parsing by providing unambiguous endpoints in binary or text streams.
In contemporary data formats, direct use of control characters as delimiters has largely given way to printable text-based alternatives, though legacy practices persist in certain domains. Formats like JSON and XML escape control characters (e.g., via Unicode escapes such as \u0003 for ETX) to prevent interference with parsing, relying instead on structural elements like brackets and quotes for delimitation. However, in Electronic Data Interchange (EDI) standards, such as EDIFACT and VDA, control characters including FS, GS, RS, and US continue to serve as separators for hierarchical data organization, ensuring compatibility with older transmission systems.[58][59] This retention supports interoperability in B2B exchanges where legacy infrastructure predominates.
For error handling in data structuring, the Substitute (SUB, ASCII 26) character provides a mechanism to flag and replace corrupted or invalid data segments. When transmission errors or encoding issues are detected, SUB can be inserted as a placeholder to maintain stream integrity, allowing downstream processes to identify and skip problematic bytes without halting parsing.[7][60] This approach, rooted in early ASCII design, underscores control characters' role in robust data delimitation by accommodating imperfections in storage or transfer.
