Escape character
View on WikipediaThis article needs additional citations for verification. (April 2010) |
In computing and telecommunications, an escape character is a character (more specifically a metacharacter) that, based on a contextual convention, specifies an alternative interpretation of the sequence of characters that follow it. The escape character plus the characters that follow it to form a syntactic unit is called an escape sequence. A convention can define any particular character code as a sequence prefix. Some conventions use a normal, printable character such as backslash (\) or ampersand (&). Others use a non-printable (a.k.a. control) character such as ASCII escape.
In telecommunications, an escape character is used to indicate that the following characters are encoded differently. This is used to alter control characters that would otherwise be noticed and acted on by the underlying telecommunications hardware, such as illegal characters. In this context, the use of an escape character is sometimes referred to as quoting.
Definition
[edit]An escape character may not have its own meaning, so all escape sequences are of two or more characters.
Escape characters are part of the syntax for many programming languages, data formats, and communication protocols. For a given alphabet an escape character's purpose is to start character sequences (so named escape sequences), which have to be interpreted differently from the same characters occurring without the prefixed escape character.
The functions of escape sequences include:
- To encode a syntactic entity, such as device commands or special data, which cannot be directly represented by the alphabet.
- To represent characters, referred to as character quoting, which cannot be typed in the current context, or would have an undesired interpretation. In this case, an escape sequence is a digraph consisting of an escape character itself and a "quoted" character.
Control character
[edit]In contrast to an escape character, a control character (i.e. carriage return) has meaning on its own; without a special prefix or following characters. An escape character has no meaning on its own. It only has meaning in the context of a sequence.
Generally, an escape character is not a particular case of (device) control characters, nor vice versa. If we define control characters as non-graphic, or as having a special meaning for an output device (e.g. printer or text terminal) then any escape character for this device is a control one. But escape characters used in programming (such as the backslash, \) are graphic, hence are not control characters. Conversely most (but not all) of the ASCII "control characters" have some control function in isolation, therefore they are not escape characters.
In many programming languages, an escape character also forms some escape sequences which are referred to as control characters. For example, line break has an escape sequence of \n.
Examples
[edit]JavaScript
[edit]JavaScript uses the \ (backslash) as an escape character for:[1][2]
\'single quote\"double quote\\backslash\nnew line\rcarriage return\ttab\bbackspace\fform feed\vvertical tab (Internet Explorer 9 and older treats\vasvinstead of a vertical tab (\x0B). If cross-browser compatibility is a concern, use\x0Binstead of\v.)\0null character (U+0000 <control-0000> ) (only if the next character is not a decimal digit; else it is an octal escape sequence)\xFFcharacter represented by the hexadecimal byteFF
The \v and \0 escapes are not allowed in JSON strings.
Example code:
console.log("Using \\n \nWill shift the characters after \\n one row down")
console.log("Using \\t \twill shift the characters after \\t one tab length to the right")
console.log("Using \\r \rWill imitate a carriage return, which means shifting to the start of the row") // can be used to clear the screen on some terminals. Windows uses \r\n instead of \n alone
ASCII escape character
[edit]The ASCII "escape" character (octal: \033, hexadecimal: \x1B, or, in decimal, 27, also represented by the sequences ^[ or \e) is used in many output devices to start a series of characters called a control sequence or escape sequence. Typically, the escape character was sent first in such a sequence to alert the device that the following characters were to be interpreted as a control sequence rather than as plain characters, then one or more characters would follow to specify some detailed action, after which the device would go back to interpreting characters normally. For example, the sequence of ^[, followed by the printable characters [2;10H, would cause a Digital Equipment Corporation (DEC) VT102 terminal to move its cursor to the 10th cell of the 2nd line of the screen. This was later developed into ANSI escape codes covered by the ANSI X3.64 standard. The escape character also starts each command sequence in the Hewlett-Packard Printer Command Language.
An early reference to the term "escape character" is found in Bob Bemer's IBM technical publications, who is credited with inventing this mechanism during his work on the ASCII character set.[3]
The Escape key is usually found on standard PC keyboards. However, it is commonly absent from keyboards for PDAs and other devices not designed primarily for ASCII communications. The DEC VT220 series was one of the few popular keyboards that did not have a dedicated Esc key, instead of using one of the keys above the main keypad. In user interfaces of the 1970s–1980s it was not uncommon to use this key as an escape character, but in modern desktop computers, such use is dropped. Sometimes the key was identified with AltMode (for alternative mode). Even with no dedicated key, the escape character code could be generated by typing [ while simultaneously holding down Ctrl.
Programming and data formats
[edit]Many modern programming languages specify the double-quote character (") as a delimiter for a string literal. The backslash (\) escape character typically provides two ways to include double-quotes inside a string literal, either by modifying the meaning of the double-quote character embedded in the string (\" becomes "), or by modifying the meaning of a sequence of characters including the hexadecimal value of a double-quote character (\x22 becomes ").
C, C++, Java, and Ruby all allow exactly the same two backslash escape styles. The PostScript language and Microsoft Rich Text Format also use backslash escapes. The quoted-printable encoding uses the equals sign as an escape character.
URL and URI use %-escapes to quote characters with a special meaning, as for non-ASCII characters. The ampersand (&) character may be considered as an escape character in SGML and derived formats such as HTML and XML.
Some programming languages also provide other ways to represent special characters in literals, without requiring an escape character (see e.g. delimiter collision).
Communication protocols
[edit]The Point-to-Point Protocol (PPP) uses the 0x7D octet (\175, or ASCII: }) as an escape character. The octet immediately following should be XORed by 0x20 before being passed to a higher level protocol. This is applied to both 0x7D itself and the control character 0x7E (which is used in PPP to mark the beginning and end of a frame) when those octets need to be transmitted by a higher level protocol encapsulated by PPP, as well as other octets negotiated when the link is established. That is, when a higher level protocol wishes to transmit 0x7D, it is transmitted as the sequence 0x7D 0x5D, and 0x7E is transmitted as 0x7D 0x5E.
Bourne shell
[edit]In Bourne shell (sh), the asterisk (*) and question mark (?) characters are wildcard characters expanded via globbing. Without a preceding escape character, an * will expand to the names of all files in the working directory that do not start with a period if and only if there are such files, otherwise * remains unexpanded. So to refer to a file literally called "*", the shell must be told not to interpret it in this way, by preceding it with a backslash (\). This modifies the interpretation of the asterisk (*).
Compare:
rm * # delete all files in the current directory
rm \* # delete the file named *
|
Similarly, characters like the ampersand, pipe and semicolon (used for command chaining), angle brackets (used for redirection), and parentheses have special syntactic meaning to the Bourne shell. These must also be escaped—referred to as "quoting" in the sh(1) manual page[4]—in order to be used literally as arguments to another program:
$ echo (`-´)> # not escaped or quoted
bash: syntax error near unexpected token ``-´'
$ echo \(`-´\)\> # escaped with backslashes
(`-´)>
$ echo '(`-´)>' # protected by single quotes; same effect as above
(`-´)>
$ echo ;) # syntax error
$ echo ';)' \;\) # both OK
Windows Command Prompt
[edit]The Windows command-line interpreter uses a caret character (^) to escape reserved characters that have special meanings (in particular: &, |, (, ), <, >, ^).[5] The DOS command-line interpreter, though it has similar syntax, does not support this.
For example, on the Windows Command Prompt, this will result in a syntax error.
C:\>echo <hello world>
The syntax of the command is incorrect.
whereas this will output the string: <hello world>
C:\>echo ^<hello world^>
<hello world>
Windows PowerShell
[edit]In Windows, the backslash is used as a path separator; therefore, it generally cannot be used as an escape character. PowerShell uses backtick[6] ( ` ) instead.
For example, the following command:
PS C:\> echo "`tFirst line`nNew line"
First line
New line
Others
[edit]- Quoted-printable, which encodes 8-bit data into 7-bit data of limited line lengths, uses the equals sign (
=) as an escape character.
See also
[edit]- AltGr key – Modifier key on some computer keyboards
- Escape sequences in C – Special character sequences in the C programming language
- Leaning toothpick syndrome – Escape characters making an expression unreadable
- Nested quotation – Quotations within quotations
- Stropping (syntax) – Method in computer language design
References
[edit]- ^ "JavaScript character escape sequences". Mathias Bynens. 21 December 2011. Retrieved 2014-06-30.
- ^ "Special Characters (JavaScript)". Microsoft Developer Network. Archived from the original on Dec 14, 2014. Retrieved 2014-06-30.
- ^ Bemer, Bob (Oct 25, 2003). "How Bob Bemer Invented the ESCAPE Sequence and Key". Bob Bemer. Archived from the original on 4 January 2018. Retrieved 22 March 2018.
- ^ "Manual Page - sh(1)".
- ^ Tim Hill (1998). "The Windows NT Command Shell". Microsoft Learn. MacMillan Technical Publishing. Retrieved 2010-01-13.
- ^ "about_Escape_Characters". Microsoft Developer Network. 2014-05-08. Archived from the original on 2016-11-25. Retrieved 2016-11-24.
External links
[edit]- That Powerful ESCAPE Character -- Key and Sequences Archived 2016-03-25 at the Wayback Machine – Bob Bemer
This article incorporates public domain material from Federal Standard 1037C. General Services Administration. Archived from the original on 2022-01-22.
Escape character
View on Grokipedia\n for newline, \t for horizontal tab, and \" to embed a double quote inside a string, as standardized in the C language specification and adopted across implementations.[5] This mechanism ensures portability and readability, with octal (\ooo) or hexadecimal (\xhh) forms allowing representation of any ASCII or Unicode code point.[5] Beyond strings, escape characters appear in data formats like JSON (using \ for control characters) and regular expressions, where they neutralize metacharacters (e.g., \. to match a literal period).[6]
Escape characters also play a critical role in protocol design for telecommunications and networking, where they delimit frames or escape control signals in serial communications, such as in HDLC-like protocols (e.g., PPP), to avoid misinterpretation of data as commands.[7] Their use extends to markup languages like HTML and XML, though there the term often refers to entities (e.g., &) rather than a single character, highlighting the evolution from low-level control to higher-level text processing. Overall, escape characters facilitate robust handling of diverse character sets in digital systems, balancing expressiveness with unambiguous parsing.[1]
Fundamentals
Definition
An escape character is a metacharacter that causes the system or parser to interpret one or more subsequent characters differently from their default meaning, typically to include reserved symbols literally within text or data structures.[1] This functionality enables the representation of special characters—such as delimiters, quotes, or control signals—that would otherwise trigger predefined behaviors, by temporarily suspending their usual interpretation.[8] For instance, in textual contexts, an escape character allows a quotation mark to appear inside a quoted string without ending the quotation.[9] The primary mechanism involves forming escape sequences, where the escape character precedes one or more additional characters to specify the intended literal or functional output.[10] These sequences can be as simple as the escape character followed by a single symbol to neutralize its special role, or more complex multi-character combinations that denote non-printable elements, such as a newline represented abstractly as an escape followed by 'n'.[8] The distinction lies in their length and purpose: single-character escapes directly modify the immediate follower, while multi-character sequences encode broader instructions or representations, often standardized within specific formats or protocols.[11] Escape characters form a subset of control characters, which encompass a wider range of non-printable symbols used to manage device operations or data flow, but the term "escape character" specifically emphasizes the alteration of interpretive context in sequential data processing.[12]Historical Development
The concept of escape characters traces its roots to 19th-century telegraphic systems, where non-printing control signals were essential for managing device operations without producing visible output. Émile Baudot's printing telegraph, introduced in 1874 and refined to a five-unit code by 1876, employed shift mechanisms such as "letter space" and "figure space" to toggle between alphabetic and numeric modes, allowing efficient transmission of mixed content over limited bandwidth.[13] These non-printing controls served as precursors to modern escape functions by altering the interpretation of subsequent codes without printing characters. In the early 20th century, Donald Murray's 1898 telegraph system further advanced this by incorporating shift characters for "figures," "capitals," and "release," along with line controls for carriage return and paper feed by 1905, standardizing operational signaling in mechanical printing telegraphs.[13] By the mid-20th century, international standards began formalizing these ideas. The International Telegraph Alphabet No. 2 (ITA2), adopted by the CCITT in 1930, retained Baudot-inspired shift characters ("letter shift" and "figure shift") for case switching and introduced proposals for an explicit "escape" control in 1963 extensions to support expanded character sets in telegraphy.[13] This evolution culminated in the American Standard Code for Information Interchange (ASCII), published in 1963 as ANSI X3.4, which designated the ESC control character (decimal 27, hexadecimal 1B) specifically to initiate escape sequences for supplementary controls or additional character sets.[14][15] In ASCII, ESC altered the meaning of following characters, enabling flexible device control in early computing environments.[14] The 1970s marked a significant expansion with the rise of video terminals, driving the development of standardized escape code systems. The ECMA-48 standard, released in 1976, built on ASCII's C0 controls to define structured escape sequences for cursor movement, screen attributes, and mode changes in terminal devices.[16] This was followed by ANSI X3.64 in 1979, which formalized additional sequences for video text terminals, addressing the limitations of proprietary codes in emerging systems like DEC's VT52 (1975).[17] A key milestone was the VT100 terminal, introduced by Digital Equipment Corporation in August 1978, which was the first widely adopted device to implement these ANSI-compatible escape sequences, using ESC (octal 033) to introduce control functions such as cursor addressing (e.g., ESC [Pn;Pn H) and erasing (e.g., ESC [Ps J).[18] The VT100's design influenced subsequent terminal standards and persists in modern emulators.[19] From the 1990s onward, escape characters integrated into global encoding frameworks. The Unicode Standard, first published in 1991 (version 1.0), incorporated the ASCII-derived ESC as a C0 control character (U+001B), preserving its role in initiating sequences while unifying character representation across scripts.[14] Subsequent encodings like UTF-8, defined in 1993 and integrated into Unicode by version 2.0 (1996), supported transmission of ESC as a single byte (0x1B), enabling backward compatibility with legacy control sequences in multilingual environments without introducing shift states in the core model. This adoption ensured escape mechanisms remained viable for terminal control and protocol signaling in diverse, internationalized systems.[14]Core Concepts
Control Sequences
An escape sequence is fundamentally structured as an escape character immediately followed by one or more modifier characters or bytes that alter the interpretation of the subsequent elements to represent special or control functions.[5] This anatomy allows the escape character to signal the parser that the following content should be treated differently, such as substituting a non-printable control code or embedding a reserved symbol without triggering its default behavior.[1] For instance, in various computing environments, the backslash () serves as the escape character, paired with a modifier like 't' to denote the horizontal tab character (\t).[5] Escape sequences can be categorized into several types based on their purpose and complexity. Printable escapes enable the inclusion of characters that would otherwise have syntactic significance, such as the double quote within a quoted string ("), preventing premature termination of the string literal.[5] Control escapes represent non-printable actions, like the backspace (\b) for cursor movement or the newline (\n) for line breaks, which are essential for formatting output without direct input of control codes. Parameterized sequences extend this further by incorporating numeric or symbolic parameters after the escape character and initial modifiers, as seen in terminal control where the escape initiates a sequence like ESC [ n m to set foreground color to value n.[2] These sequences play a critical role in avoiding ambiguity during the parsing of strings or commands, ensuring that special characters are interpreted literally or as intended controls rather than as delimiters or operators. By embedding the escape mechanism, parsers can distinguish between structural elements (e.g., quotes bounding a string) and content (e.g., a literal quote inside it), maintaining the integrity of data representation across different systems. In software contexts, the backslash () is a prevalent notation for escape sequences due to its availability as a printable character in character sets like ASCII, facilitating use in text-based programming and data files. Conversely, in hardware-oriented protocols such as those for terminals, the non-printable ESC (ASCII 27) is commonly employed to initiate sequences, leveraging its dedicated control status for efficient device communication.[2] The ASCII ESC character, standardized as code point 27, historically facilitated the introduction of such control sequences in early data interchange standards.Escaping Mechanisms
Escape characters primarily function through prefix mechanisms, where a designated character—most commonly the backslash (\)—precedes another character or sequence to alter its interpretation, preventing it from being treated as a delimiter or control signal.[5] This approach allows the escape character to signal the parser or compiler to interpret the following content literally or as a special value, such as representing non-printable characters or quoting delimiters within strings.[20] Control sequences represent one common implementation of this, where the escape initiates a multi-character directive resolved to a single entity.
A frequent strategy to represent the escape character itself literally is doubling, where two instances of the escape character are used to denote one unescaped version; for example, in C and related languages, \\ produces a single backslash in the output string.[5] Similarly, in data interchange formats like JSON, the backslash is escaped as \\ to include it without triggering further interpretation.[21] This doubling avoids the need for additional special characters and maintains consistency in parsing rules across contexts. Alternative delimiters offer another variation, particularly in systems supporting raw or multiline strings, where custom opening and closing delimiters (such as # or other non-conflicting characters) enclose content without requiring escapes for internal special characters, simplifying handling of complex literals.[22]
Interpretation of escape characters is often context-dependent, occurring at compile-time in statically processed source code or runtime in dynamically parsed data. In languages like Java, escape sequences in string literals are resolved during compilation, replacing them with corresponding Unicode characters before the program executes, ensuring early validation of the source.[20] Conversely, in runtime environments such as JSON parsers, escaping is applied when deserializing strings from input data, allowing flexible handling of user-supplied content but introducing potential for deferred errors.[21]
Error handling for invalid escape sequences varies by system but typically results in rejection to maintain data integrity. Compilers like Java's issue a compile-time error for unrecognized sequences following the escape character, such as \q, preventing malformed code from proceeding.[20] In parsers for formats like JSON, invalid escapes—such as unpaired Unicode surrogates in \u sequences—lead to parsing failures or undefined behavior, as they violate the specification's requirements for well-formed input.[23] Prefix-based escaping predominates in most systems due to its simplicity and left-to-right parsing efficiency, though postfix variants appear in rare, specialized contexts where the modifier follows the target character for syntactic reasons.[5]
In Programming Languages
JavaScript
In JavaScript, the backslash (\) serves as the primary escape character within string literals delimited by single or double quotes, allowing the inclusion of otherwise reserved characters such as quotes themselves. For instance, to embed a double quote inside a double-quoted string, it is written as "He said \"hello\"."[24]. This mechanism ensures that the string parser interprets the escaped character literally rather than as a delimiter.[24]
JavaScript supports a range of standard escape sequences in string literals for representing control characters and non-ASCII content. Common examples include \n for a newline, \t for a horizontal tab, \uXXXX for a Unicode code point specified by four hexadecimal digits (e.g., \u00A9 for the copyright symbol ©), and \xHH for a Latin-1 character via two hexadecimal digits (e.g., \xA9).[24] Other escapes encompass \b for backspace, \r for carriage return, \f for form feed, \v for vertical tab, \\ for a literal backslash, and \' or \" for the respective quote marks.[24] These sequences are processed during string construction to produce the intended character values.[24]
Template literals, introduced in ECMAScript 2015 and delimited by backticks (`), also process escape sequences similarly to regular strings but support interpolation via ${expression}. To include a literal backtick or dollar sign within a template literal, precede it with a backslash, as in `He said \`hello\ or \${value} .[](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals) For raw strings that ignore escape processing—treating backslashes literally—the built-in `String.raw` tag can be applied, such as `String.raw`He said \\nhello , which yields "He said \nhello" without converting \n to a newline.[25]
In regular expression literals and RegExp objects, the backslash retains its escaping role but includes additional conventions for pattern syntax. For example, \/ matches a literal forward slash, bypassing its potential role as a delimiter, while \\ matches a backslash itself.[26] Standard control escapes like \n and Unicode forms such as \u{1F600} (for emoji) are also supported, ensuring consistency with string handling.[26]
These escape mechanisms have been consistently implemented across browser environments and Node.js since ECMAScript 5, released in 2009, providing uniform behavior for string and regular expression processing in modern JavaScript engines.
Shell Environments
In Bourne and POSIX-compliant shells, such as sh and bash, the backslash () serves as the primary escape character, preserving the literal value of the following character and preventing its special interpretation by the shell.[27] For instance, to output a literal dollar sign without triggering variable expansion, one usesecho \$HOME, which displays $HOME instead of the user's home directory path.[27] This mechanism, rooted in early Unix shells from the 1970s, allows precise control over command interpretation.[28]
Quoting provides alternative ways to handle escaping in these shells. Single quotes (' ') treat all enclosed content literally, disabling escapes and expansions entirely, so echo '$HOME' outputs the string $HOME verbatim.[27] In contrast, double quotes (" ") permit certain expansions like variables (e.g., echo "$HOME" expands to /home/user) while still protecting against word splitting and globbing, but backslashes within double quotes can escape specific characters such as the dollar sign.[27]
In the Windows Command Prompt (cmd.exe), the caret (^) functions as the escape character to neutralize special operators like ampersand (&), pipe (|), and redirection (>).[29] For example, echo New^&Name treats & as literal text rather than a command separator.[29] Quotation marks can also enclose special characters for similar protection.
PowerShell employs the backtick () as its escape character, enabling line continuations and special sequences primarily within double-quoted strings.[](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_special_characters?view=powershell-7.5) It supports escapes like `` n `` for a newline, as in Write-Output "Line1nLine2"`, which produces a line break.[30] Single-quoted strings ignore backticks, treating them literally, while double quotes interpret them for escapes and expansions.
Unix-like shells and Windows shells differ notably in path and variable handling, affecting escaping needs. Unix paths use forward slashes (/) and are case-sensitive, with variables prefixed by $ (e.g., $PATH), often requiring backslash escapes only for shell metacharacters within paths; spaces in paths typically need quoting rather than escaping.[27] Windows paths employ backslashes () and are case-insensitive, using % for variables (e.g., %PATH%), where escaping with ^ is crucial for operators in paths, and spaces or special characters demand quotes or caret escapes to avoid misinterpretation.[29] These distinctions arise from POSIX standards for Unix shells versus Windows' command-line conventions, impacting cross-platform scripting.[31]
In Data Formats and Protocols
Markup and Text Formats
In markup languages such as HTML and XML, escape characters are essential for representing reserved symbols that could otherwise be interpreted as structural delimiters during parsing. The ampersand (&) serves as the primary escape trigger in both formats, initiating a character entity reference when followed by a name or numeric code and terminated by a semicolon (;). For instance, in HTML, the less-than sign (<) is escaped as<, the greater-than sign (>) as >, and the ampersand itself as &, ensuring these characters appear as literal content without disrupting tag recognition.[32] Similarly, XML mandates five predefined entities: < for <, > for >, & for &, ' for ', and " for ", which processors must recognize to handle content safely, with < and & requiring mandatory escaping in element text to avoid markup confusion.[33]
JSON, a lightweight data interchange format, employs the backslash () as its escape character within double-quoted strings to handle quotation marks and control characters that might otherwise terminate the string or cause parsing errors. Specifically, a double quote (") inside a string is escaped as \", and the backslash itself as \\, while control characters like newline are represented via sequences such as \n or Unicode escapes like \u000A.[21] This mechanism allows JSON strings to embed arbitrary text safely, including structural elements from containing documents.
In comma-separated values (CSV) files, a plain-text format for tabular data, escaping primarily involves handling the field delimiter (comma) and the enclosure character (double quote) through a doubling convention rather than a dedicated escape symbol. Fields containing commas, line breaks, or quotes are enclosed in double quotes, and any internal double quote is escaped by repeating it (e.g., "He said, ""Hello""").[34] Some CSV dialects extend this with backslash escaping for delimiters, but the standard RFC 4180 prioritizes quote doubling for interoperability across tools.[34]
Configuration file formats like YAML and INI use backslashes or quotes to escape special characters in string values, preventing misinterpretation in key-value pairs or hierarchical structures. In YAML, double-quoted strings support backslash escapes for control characters (e.g., \n for newline, \" for quote) and Unicode representations (e.g., \u0026 for &), while single-quoted strings escape apostrophes by doubling (''), allowing flexible embedding of markup-like content without full parsing overhead.[35] INI files, a simpler predecessor, enclose values in double quotes to protect spaces or semicolons, with backslashes escaping quotes (\"), newlines (\n), or the backslash itself (\\) in quoted strings, though path values often forgo escaping to preserve literal paths like C:\dir.[36]
The use of escape mechanisms in these formats traces its roots to the Standard Generalized Markup Language (SGML), formalized in ISO 8879 in 1986, which introduced entity references starting with & for substituting characters in document content, including escapes for delimiters like < and >. This foundation influenced XML, published as a W3C Recommendation in 1998, which streamlined SGML's entity model into a web-optimized subset while retaining core escaping principles for broader adoption in structured text processing.[37]