Recent from talks
Contribute something
Nothing was collected or created yet.
Substitute character
View on WikipediaIn computer data, a substitute character (␚) is a control character that is used to pad transmitted data in order to send it in blocks of fixed size, or to stand in place of a character that is recognized to be invalid, erroneous or unrepresentable on a given device. It is also used as an escape sequence in some programming languages.
In the ASCII character set, this character is encoded by the number 26 (1A hex). Standard keyboards transmit this code when the Ctrl and Z keys are pressed simultaneously (Ctrl+Z, often documented by convention as ^Z).[1] Unicode inherits this character from ASCII, but recommends that the replacement character (�, U+FFFD) be used instead to represent un-decodable inputs, when the output encoding is compatible with it.
Uses
[edit]End of file
[edit]Historically, under PDP-6 monitor,[2] RT-11, VMS, and TOPS-10,[3] and in early PC CP/M 1 and 2 operating systems (and derivatives like MP/M) it was necessary to explicitly mark the end of a file (EOF) because the native filesystem could not record the exact file size by itself; files were allocated in extents (records) of a fixed size, typically leaving some allocated but unused space at the end of each file.[4][5][6][7] This extra space was filled with 1A16 (hex) characters under CP/M. The extended CP/M filesystems used by CP/M 3 and higher (and derivatives like Concurrent CP/M, Concurrent DOS, and DOS Plus) did support byte-granular files,[8][9] so this was no longer a requirement, but it remained as a convention (especially for text files) in order to ensure backward compatibility.
In CP/M, 86-DOS, MS-DOS, PC DOS, DR-DOS, and their various derivatives, the SUB character was also used to indicate the end of a character stream,[citation needed] and thereby used to terminate user input in an interactive command line window (and as such, often used to finish console input redirection, e.g. as instigated by the command COPY CON: TYPEDTXT.TXT).
While no longer technically required to indicate the end of a file, as of 2017, many text editors[which?] and program languages still support this convention, or can be configured to insert this character at the end of a file when editing, or at least properly cope with them in text files.[citation needed] In such cases, it is often termed a "soft" EOF, as it does not necessarily represent the physical end of the file, but is more a marker indicating that "there is no useful data beyond this point". In reality, more data may exist beyond this character up to the actual end of the data in the file system, thus it can be used to hide file content when the file is entered at the console or opened in editors. Many file format standards (e.g. PNG or GIF) include the SUB character in their headers to perform precisely this function. Some modern text file formats (e.g. CSV-1203[10]) still recommend a trailing EOF character to be appended as the last character in the file. However, typing Control+Z does not embed an EOF character into a file in either MS-DOS or Windows, nor do the APIs of those systems use the character to denote the actual end of a file.
Some programming languages (e.g. Visual Basic) will not read past a "soft" EOF when using the built-in text file reading primitives (INPUT, LINE INPUT etc.),[citation needed] and alternate methods must be adopted, e.g. opening the file in binary mode or using the File System Object to progress beyond it.
Character 26 was used to mark "End of file" even though ASCII calls this character Substitute, and has other characters to indicate "End of file". Number 28 which is called "File Separator" has also been used for similar purposes.
Other uses
[edit]In Unix-like operating systems, this character is typically used in shells as a way for the user to suspend the currently executing interactive process.[11] The suspended process can then be resumed in foreground (interactive) mode, or be made to resume execution in background mode, or be terminated. When entered by a user at their computer terminal, the currently running foreground process is sent a "terminal stop" (SIGTSTP) signal, which generally causes the process to suspend its execution. The user can later continue the process execution by using the "foreground" command (fg) or the "background" command (bg).
The Unicode Security Considerations report[12] recommends this character as a safe replacement for unmappable characters during character set conversion.
In many GUIs and applications, Control+Z (⌘ Command+Z on macOS) can be used to undo the last action. In many applications, earlier actions than the last one can also be undone by pressing Control+Z multiple times. Control+Z was one of a handful of keyboard sequences chosen by the program designers at Xerox PARC to control text editing.
Representation
[edit]ASCII and Unicode representation of "substitute":
- Octal code: 32
- Decimal code: 26
- Hexadecimal code: 1A, U+001A
- Mnemonic symbol: SUB
- Binary value: 11010
See also
[edit]- C0 and C1 control codes (ISO 646)
- U+FFFD (Unicode replacement character �)
- Access key
- Control-C
- Control-G
- Control-V
- Control-X
- Control-\
- Keyboard shortcut
- List of file signatures
- .notdef, a symbol (sometimes called by the slang term tofu) used to represent a missing character
- Noto fonts, a Google project to eliminate missing characters
References
[edit]- ^ "Keyboard shortcuts for Windows". Microsoft Support. Microsoft. Retrieved 2012-06-02.
- ^ "Table of IO Device Characteristics - Console or Teletypewriters". PDP-6 Multiprogramming System Manual (PDF). Maynard, Massachusetts, USA: Digital Equipment Corporation (DEC). 1965. p. 43. DEC-6-0-EX-SYS-UM-IP-PRE00. Archived (PDF) from the original on 2014-07-14. Retrieved 2014-07-10. (1+84+10 pages)
- ^ "5.1.1.1. Device Dependent Functions - Data Modes - Full-Duplex Software A(ASCII) and AL(ASCII Line)". PDP-10 Reference Handbook: Communicating with the Monitor - Time-Sharing Monitors (PDF). Vol. 3. Digital Equipment Corporation (DEC). 1969. pp. 5-3 – 5-6 [5-5 (431)]. Archived (PDF) from the original on 2011-11-15. Retrieved 2014-07-10. (207 pages)
- ^ Elliott, John C. (1998). "CP/M 1.4 disc formats". Archived from the original on 2020-11-14. Retrieved 2021-11-18.
- ^ Elliott, John C. (1998). "CP/M 2.2 disc formats". Archived from the original on 2020-11-05. Retrieved 2021-11-18.
- ^ "2. Operating System Call Conventions". CP/M 2.0 Interface Guide (PDF) (1 ed.). Pacific Grove, California, USA: Digital Research. 1979. p. 5. Archived (PDF) from the original on 2020-02-28. Retrieved 2020-02-28.
[...] The end of an ASCII file is denoted by a control-Z character (1AH) or a real end of file, returned by the CP/M read operation. Control-Z characters embedded within machine code files (e.g., COM files) are ignored, however, and the end of file condition returned by CP/M is used to terminate read operations. [...]
(56 pages) - ^ Hogan, Thom (1982). "3. CP/M Transient Commands". Osborne CP/M User Guide - For All CP/M Users (2 ed.). Berkeley, California, USA: A. Osborne/McGraw-Hill. p. 74. ISBN 0-931988-82-9. Retrieved 2020-02-28.
[...] CP/M marks the end of an ASCII file by placing a CONTROL-z character in the file after the last data character. If the file contains an exact multiple of 128 characters, in which case adding the CONTROL-Z would waste 127 characters, CP/M does not do so. Use of the CONTROL-Z character as the end-of-file marker is possible because CONTROL-z is seldom used as data in ASCII files. In a non-ASCII file, however, CONTROL-Z is just as likely to occur as any other character. Therefore, it cannot be used as the end-of-file marker. CP/M uses a different method to mark the end of a non-ASCII file. CP/M assumes it has reached the end of the file when it has read the last record (basic unit of disk space) allocated to the file. The disk directory entry for each file contains a list of the disk records allocated to that file. This method relies on the size of the file, rather than its content, to locate the end of the file. [...]
[1][2] - ^ Elliott, John C. (1998). "CP/M 3.1 disc formats". Archived from the original on 2021-10-26. Retrieved 2021-11-18.
- ^ Elliott, John C. (1998). "CP/M 4.1 disc formats". Archived from the original on 2020-11-05. Retrieved 2021-11-18.
- ^ CSV-1203 format specification Archived 2016-05-16 at the Portuguese Web Archive
- ^ "Quick Reference: Unix Commands". IT Connect. University of Washington. Retrieved 2012-06-02.
- ^ Unicode Security Considerations report
Further reading
[edit]Substitute character
View on Grokipediacopy command appends Ctrl+Z (SUB) to indicate the file's end in plain text contexts.[2] This dual role—error substitution and file termination—has persisted in legacy systems, though modern Unicode environments treat it primarily as a legacy control code within the C0 Controls and Basic Latin block, with no visual glyph but potential rendering as a symbol like ␚ for debugging purposes. Its use highlights the evolution of character encoding standards from 7-bit ASCII to broader multilingual support, where SUB remains reserved for compatibility but is rarely invoked in contemporary data handling due to more robust error-detection methods like checksums and parity bits.
Definition and History
Core Definition
The substitute character, known as SUB, is a control character defined in the American Standard Code for Information Interchange (ASCII) with a decimal value of 26.[3] It functions primarily as a mechanism for replacing characters that are determined to be invalid or erroneous during data processing or transmission.[3] In data streams, SUB maintains structural integrity by standing in for unknown, invalid, or unrepresentable characters, preventing disruptions in parsing or rendering without altering the intended flow of information. This substitution role ensures that systems can continue operations even when encountering anomalies, such as corrupted bytes or encoding mismatches.[4] SUB is inherently non-printable, meaning it does not correspond to a visible symbol or glyph in output devices, but rather operates as a signal within control protocols.[5] As part of the broader category of control characters in encoding systems, it contributes to non-textual instructions that manage device behavior, formatting, and error recovery, distinct from printable alphanumeric content.[1] These control characters, including SUB, form the foundational layer for reliable data interchange in computing environments.[6]Historical Origins
The substitute character, designated as code 1A in hexadecimal, was first assigned as a control character in the initial American Standard Code for Information Interchange (ASCII), published by the American Standards Association (ASA)—predecessor to the American National Standards Institute (ANSI)—as X3.4-1963. This early version of ASCII reserved the first 32 positions (codes 00 to 1F) for control characters, including the position for what would later be explicitly named the substitute (SUB), initially termed a "generic separator" to handle data formatting in transmission. The inclusion reflected the need for mechanisms to manage erroneous or unrepresentable data in nascent digital systems, drawing from precedents in telegraphy and punched card technologies where substitution codes were employed to flag and mitigate corruption during mechanical data transfer.[7][4] Influenced by International Telegraph Alphabet No. 2 (ITA2) and related teletype standards from the early 20th century, the SUB concept addressed transmission errors common in wire-based communication, where invalid signals could propagate corruption; punched card systems, prevalent in data processing since Herman Hollerith's 1890 tabulating machines, similarly used filler or error-indicating punches to preserve data integrity during batch handling. In the 1963 standard, SUB's role emphasized reliability in environments like perforated tape readers and early terminals, preventing cascade failures from garbled inputs. This foundational design prioritized interoperability across heterogeneous hardware, a critical concern for the emerging computer industry.[7][4][8] The 1967 revision (USAS X3.4-1967) explicitly redefined the 1A position as SUBSTITUTE, clarifying its function to replace invalid or erroneous characters, often inserted automatically during transmission errors, while adding lowercase letters and refining other controls for broader adoption in teletypewriters and computer terminals. This update, adopted as ANSI X3.4-1968, solidified SUB's integration into practical systems, enhancing error recovery in real-time data exchange. Concurrently, the European Computer Manufacturers Association (ECMA) incorporated SUB into its ECMA-6 standard in 1965, aligning it closely with the evolving ASCII to promote transatlantic compatibility in computing hardware.[4][7][9] International standardization advanced with the International Organization for Standardization (ISO)'s adoption of SUB in ISO Recommendation R 646 (1967), later formalized in ISO 646 (1973), which harmonized 7-bit codes for global use, designating it as a transmission control character to indicate substituted invalid data while allowing national variants in graphic positions. This inclusion ensured SUB's role in preventing data loss across diverse international networks, building on ASCII's framework to support the growing demand for standardized information interchange in the 1970s.[10][11]Encoding and Representation
In ASCII Standard
In the American Standard Code for Information Interchange (ASCII), the substitute character is designated as the control character SUB, with a decimal value of 26, hexadecimal value of 1A, and binary representation of 00011010.[1] This assignment places SUB = 26 (decimal) as part of the standardized control code set, where values are fixed according to the 7-bit encoding scheme defined in the standard.[1] SUB occupies position 26 in the 7-bit ASCII table, specifically within the control character block spanning decimal 0–31 (hexadecimal 00–1F), which is reserved for non-printable codes that manage device operations rather than rendering text.[1] These control codes, including SUB, originated from adaptations of earlier telegraphy signaling systems to support reliable data transmission in computing environments.[7] As a non-printable control character, SUB is used as a replacement for any invalid or erroneous character detected in a data stream during processing or transmission.[1] Its primary property in ASCII is to indicate substitution during data processing, ensuring error handling without altering the structural integrity of the transmitted information.[1]In Unicode and Extended Encodings
In Unicode, the substitute character is assigned the code point U+001A, named SUBSTITUTE (also abbreviated as SUB or CONTROL 26), and is classified as a control character within the Basic Latin block (U+0000–U+007F).[12] This placement ensures backward compatibility with earlier standards like ASCII, where it occupies the same position, while integrating it into the broader Unicode repertoire for multilingual text processing.[12] The substitute character's encoding in Unicode transformation formats maintains its single-byte representation from legacy systems. In UTF-8, it is encoded as the single byte 0x1A, allowing seamless handling in byte-oriented environments without multi-byte overhead for this low code point. In UTF-16, it is represented as the two-byte sequence 0x001A (in both big-endian and little-endian variants), as it falls within the Basic Multilingual Plane and requires no surrogate pairs. Extended encodings preserve the substitute character at its original position for compatibility. In the ISO-8859 series, including ISO-8859-1 (Latin-1), it maps directly to byte 0x1A, corresponding to U+001A in Unicode.[13] Similarly, Windows-1252, a superset of ISO-8859-1 for Western European languages, retains it at position 26 (0x1A), ensuring consistent interpretation across Windows applications and text files.[14] Regarding text processing behaviors, the substitute character has a bidirectional class of Boundary Neutral (BN), meaning it acts as a neutral boundary in bidirectional algorithms without influencing the directionality of adjacent characters in mixed left-to-right and right-to-left scripts. In Unicode normalization forms, such as NFC (Normalization Form Canonical Composition) and NFD (Normalization Form Canonical Decomposition), control characters like U+001A remain unchanged, as normalization processes do not decompose or compose non-graphic controls with combining class 0.[15]Primary Uses
End-of-File Marker
In CP/M operating systems, the substitute character (SUB, Ctrl+Z, ASCII 0x1A) functioned as an end-of-file (EOF) marker for text files, a convention adopted due to the system's storage of files in fixed 128-byte blocks without explicit length tracking.[16] Programs would append SUB at the logical end of content, signaling readers to halt and ignore any trailing bytes in the final block, which might otherwise contain unrelated or uninitialized data.[16] Early MS-DOS versions, including MS-DOS 1.0, preserved this CP/M compatibility by treating SUB similarly as an EOF indicator in text file operations, such as during input from the console or file reading commands like COPY.[16] Users could enter Ctrl+Z at the command line to terminate input streams, ensuring clean file endings without extraneous content.[17] The mechanism operated by having file I/O routines scan for SUB during sequential reads; upon detection, reading ceased, preventing the ingestion of potential garbage data beyond the intended content.[16] For instance, in GW-BASIC programming under MS-DOS, the EOF function returned true (-1) when encountering SUB in sequential files or redirected console input, allowing programs to test for file completion reliably.[18] This approach contrasted with Unix-like systems, where EOF is not marked by an embedded character but signaled interactively via Ctrl+D (EOT, ASCII 0x04), which flushes the input buffer and indicates no further data without altering the file contents.[19]Character Substitution Mechanism
The substitute character (SUB, ASCII code 26 or 0x1A) serves as a control character specifically intended to replace any character identified as invalid or erroneous during data processing or transmission. According to the American National Standard Code for Information Interchange (ANSI X3.4-1977), SUB is defined as "a control character that may be substituted for a character that is determined to be invalid or in error."[1] This mechanism ensures that the structural integrity and layout of the text or data stream are maintained, preventing cascading errors or complete processing failure when encountering problematic bytes. By inserting SUB in place of the offending character, systems can continue handling the input without loss of positional information, which is particularly useful in environments constrained to 7-bit ASCII encoding. In text editors and parsers operating within ASCII-compatible frameworks, the substitution process typically involves scanning input bytes and replacing those that fall outside the valid range (e.g., values greater than 127 for non-ASCII content) or fail encoding validation with the SUB character. This preserves the document's visual and logical flow, avoiding gaps or misalignments that could arise from deletion or ignoring invalid elements. A basic algorithmic outline for this substitution in pseudocode is:function processCharacter(char):
if not isValidASCII(char) or isUndefinedInEncoding(char):
return SUB
else:
return char
function processCharacter(char):
if not isValidASCII(char) or isUndefinedInEncoding(char):
return SUB
else:
return char
