Recent from talks
Contribute something
Nothing was collected or created yet.
Numeric character reference
View on WikipediaThis article relies largely or entirely on a single source. (February 2021) |
A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document (for example, because they are international characters that do not fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.
Examples
[edit]In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma
| Unicode character | Numerical base | Numerical reference in markup | Effect |
|---|---|---|---|
| U+03A3 | Decimal | Σ | Σ |
| U+03A3 | Decimal | Σ | Σ |
| U+03A3 | Hexadecimal | Σ | Σ |
| U+03A3 | Hexadecimal | Σ | Σ |
| U+03A3 | Hexadecimal | Σ | Σ |
In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE
| Unicode character | Numerical base | Numerical reference in markup | Effect |
|---|---|---|---|
| U+00C6 | Decimal | Æ | Æ |
| U+00C6 | Hexadecimal | Æ | Æ |
In SGML, HTML, and XML, the following are all valid numeric character references for the Latin small letter sharp s ß
| Unicode character | Numerical base | Numerical reference in markup | Effect |
|---|---|---|---|
| U+00DF | Decimal | ß | ß |
| U+00DF | Hexadecimal | ß | ß |
List of numeric character references for the printable ASCII characters:
| Unicode character | Character Reference (decimal) |
Character Reference (hexadecimal) |
Effect |
|---|---|---|---|
| U+0020 |   |   | (space) |
| U+0021 | ! | ! | ! |
| U+0022 | " | " | " |
| U+0023 | # | # | # |
| U+0024 | $ | $ | $ |
| U+0025 | % | % | % |
| U+0026 | & | & | & |
| U+0027 | ' | ' | ' |
| U+0028 | ( | ( | ( |
| U+0029 | ) | ) | ) |
| U+002A | * | * | * |
| U+002B | + | + | + |
| U+002C | , | , | , |
| U+002D | - | - | - |
| U+002E | . | . | . |
| U+002F | / | / | / |
| U+0030 | 0 | 0 | 0 |
| U+0031 | 1 | 1 | 1 |
| U+0032 | 2 | 2 | 2 |
| U+0033 | 3 | 3 | 3 |
| U+0034 | 4 | 4 | 4 |
| U+0035 | 5 | 5 | 5 |
| U+0036 | 6 | 6 | 6 |
| U+0037 | 7 | 7 | 7 |
| U+0038 | 8 | 8 | 8 |
| U+0039 | 9 | 9 | 9 |
| U+003A | : | : | : |
| U+003B | ; | ; | ; |
| U+003C | < | < | < |
| U+003D | = | = | = |
| U+003E | > | > | > |
| U+003F | ? | ? | ? |
| U+0040 | @ | @ | @ |
| U+0041 | A | A | A |
| U+0042 | B | B | B |
| U+0043 | C | C | C |
| U+0044 | D | D | D |
| U+0045 | E | E | E |
| U+0046 | F | F | F |
| U+0047 | G | G | G |
| U+0048 | H | H | H |
| U+0049 | I | I | I |
| U+004A | J | J | J |
| U+004B | K | K | K |
| U+004C | L | L | L |
| U+004D | M | M | M |
| U+004E | N | N | N |
| U+004F | O | O | O |
| U+0050 | P | P | P |
| U+0051 | Q | Q | Q |
| U+0052 | R | R | R |
| U+0053 | S | S | S |
| U+0054 | T | T | T |
| U+0055 | U | U | U |
| U+0056 | V | V | V |
| U+0057 | W | W | W |
| U+0058 | X | X | X |
| U+0059 | Y | Y | Y |
| U+005A | Z | Z | Z |
| U+005B | [ | [ | [ |
| U+005C | \ | \ | \ |
| U+005D | ] | ] | ] |
| U+005E | ^ | ^ | ^ |
| U+005F | _ | _ | _ |
| U+0060 | ` | ` | ' |
| U+0061 | a | a | a |
| U+0062 | b | b | b |
| U+0063 | c | c | c |
| U+0064 | d | d | d |
| U+0065 | e | e | e |
| U+0066 | f | f | f |
| U+0067 | g | g | g |
| U+0068 | h | h | h |
| U+0069 | i | i | i |
| U+006A | j | j | j |
| U+006B | k | k | k |
| U+006C | l | l | l |
| U+006D | m | m | m |
| U+006E | n | n | n |
| U+006F | o | o | o |
| U+0070 | p | p | p |
| U+0071 | q | q | q |
| U+0072 | r | r | r |
| U+0073 | s | s | s |
| U+0074 | t | t | t |
| U+0075 | u | u | u |
| U+0076 | v | v | v |
| U+0077 | w | w | w |
| U+0078 | x | x | x |
| U+0079 | y | y | y |
| U+007A | z | z | z |
| U+007B | { | { | { |
| U+007C | | | | | | |
| U+007D | } | } | } |
| U+007E | ~ | ~ | ~ |
Discussion
[edit]Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding.
Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence.
Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each.
Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.
The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.
Character references that are based on the referenced character's UCS or Unicode code point are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows:
Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:
- one or more decimal digits zero (U+0030) through nine (U+0039); or
- character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066);
all followed by character U+003B (semicolon). Older versions of HTML disallowed the hexadecimal syntax.
The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.
There is another kind of character reference called a character entity reference, which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity.) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.
Restrictions
[edit]The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS.
While the syntax of SGML does not prohibit references to invalid or unassigned code points, such as , SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters.
Restrictions may also apply for other reasons. For example, in HTML 4, , which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference.[1][citation needed] As another example, €, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers – some of which interpret it as a reference to the character represented by code value 128 in the Windows-1252 encoding for compatibility reasons. This character, "€", has to be represented as € in a standard-compliant HTML code. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like 𐀀 (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended.
Markup languages also place restrictions on where character references can occur.
Compatibility issues
[edit]In the initial versions of SGML and HTML, numeric character references were interpreted in relationship to the document character encoding, rather than Unicode. For Latin-script documents, numeric character references to characters between x80 and x9F in those documents will not be correct against Unicode, and must be recoded. HTML standards prior to HTML 4 supported only Western Latin script documents: the treatment of character references above #7F may vary between applications and national conventions.
For example, as mentioned above, the correct numeric character reference for the Euro sign "€" U+20AC when using Unicode is decimal € and hexadecimal €. However, if using tools supporting obsolete implementations of HTML, the reference € (Euro sign in the CP-1252 code page) or ¤ (Euro sign in ISO/IEC 8859-15) may work.
As another example, if some text was created originally using the MacRoman character set, the left double quotation mark " will be represented with code point xD2. This will not display properly in a system expecting a document encoded as UTF-8, ISO 8859-1, or CP-1252, where this code point is occupied by the letter Ò. The correct numeric character reference for " in HTML 4 and newer is “, because U+201C is its UCS code. In some systems, the named character reference “ may also be available.
See also
[edit]References
[edit]- ^ "HTML 5.2: 8. The HTML syntax". www.w3.org.
Numeric character reference
View on Grokipedia&# followed by a decimal integer (optionally prefixed with x for hexadecimal notation) and ends with a semicolon (;), such as A for the Latin capital letter A (U+0041) or A for the same character in hexadecimal form.[1] The parser converts the numeric value to the corresponding Unicode code point, emitting it as a character token, while handling invalid references—such as those exceeding U+10FFFF or representing surrogate code points—by substituting the replacement character U+FFFD.[1] Semicolons are required for conformance, though their absence triggers a parse error but does not prevent processing in tolerant parsers.[1]
Similarly, in XML 1.0 as specified by the W3C, NCRs follow the production CharRef ::= '&#' [0-9]+ ';' for decimal or CharRef ::= '&#x' [0-9a-fA-F]+ ';' for hexadecimal, ensuring representation of legal XML characters within the defined ranges (e.g., U+0020 to U+10FFFF, excluding certain controls).[2] These references must denote valid characters per the XML Char production and are expanded immediately by processors into the referenced character data, facilitating interoperability across diverse encoding environments.[2] Both standards emphasize NCRs as a fundamental mechanism for escaping special characters like < (< or <) and & (& or &) in markup, distinct from named character references that use predefined entity names.[1][2]
Syntax and Formats
Decimal Form
The decimal form of a numeric character reference begins with an ampersand (&) immediately followed by a number sign (#), then one or more decimal digits (0 through 9) that represent the Unicode code point value in base-10, and ends with a semicolon (;).[2][3]
This form is used to reference Unicode code points in the range from U+0000 to U+10FFFF, subject to the validity rules defined by the markup language specification.[4]
Leading zeros are permitted in the decimal sequence and do not change the interpreted value, allowing flexibility in formatting while maintaining equivalence (e.g., A equals A).[2][5]
For example, Σ denotes the Greek capital letter sigma (Σ) at code point U+03A3.[3]
The decimal form serves as the base-10 alternative to the hexadecimal notation for referencing the same Unicode code points.[2][3]
Hexadecimal Form
The hexadecimal form of a numeric character reference provides an alternative to the decimal variant by expressing the Unicode code point in base-16 notation.[6][7] It begins with the sequence&#x (uppercase &#X in HTML only), followed by one or more hexadecimal digits representing the code point value, and terminates with a semicolon (;).[6][7] This format allows for the inclusion of characters whose code points are more succinctly represented in hexadecimal, particularly for higher values beyond the basic ASCII range.[6]
Hexadecimal digits in this reference are case-insensitive, accepting both uppercase (A-F) and lowercase (a-f) letters alongside digits 0-9; for instance, both Σ and Σ resolve to the Greek capital letter sigma (Σ, U+03A3).[6][7] The valid numeric range mirrors that of the decimal form, from U+0000 to U+10FFFF, subject to language-specific restrictions on certain code points such as controls and noncharacters.[6][8] Leading zeros are permitted but not required, as in 
 for the line feed character (U+000A), though they do not alter the interpreted value.[6][7]
A practical example is ♥, which renders as ♥ (black heart suit, U+2665), demonstrating hexadecimal's advantage in brevity for code points like this one, where the four-digit hex equivalent (2665) is shorter than its five-digit decimal counterpart (9829).[6][7]
Applications in Markup Languages
In HTML
Numeric character references (NCRs) in HTML serve to represent characters that cannot be directly entered from a keyboard or that might conflict with markup syntax, such as the less-than sign (<) or ampersand (&). For instance, the reserved ampersand can be encoded as & to prevent it from being interpreted as the start of another entity.[9]
In HTML5, both decimal and hexadecimal NCR forms are supported, allowing references to any valid Unicode code point from U+0000 to U+10FFFF. The decimal form uses &# followed by decimal digits, while the hexadecimal form uses &#x or &#X followed by hexadecimal digits; both must typically end with a semicolon (;) to terminate the reference. However, semicolons are required except in certain ambiguous cases, such as when the reference is followed by trailing digits that could otherwise extend the numeric value, where parsers may still resolve it for compatibility.[9][10]
NCRs are resolved to their corresponding Unicode characters during parsing, independent of the document's declared encoding, with UTF-8 assumed as the default if none is specified. Invalid NCRs, such as those referencing code points outside the Unicode range or malformed sequences, are typically treated as literal text by parsers or replaced with the Unicode replacement character U+FFFD.[3][11]
HTML5 permits NCRs for C1 control characters (U+0080 to U+009F), which are mapped to specific Unicode equivalents during tokenization in certain contexts, such as attributes, but they may not display consistently across browsers due to varying handling of control codes.[12][13]
A unique aspect of HTML parsing involves resolving ambiguities, such as distinguishing ; (encoding a semicolon) from a plain semicolon; most parsers enforce the semicolon terminator to avoid misinterpretation, treating non-terminated sequences as literal ampersands followed by digits. In contrast to XML's stricter requirements, HTML's more lenient approach ensures broader compatibility with legacy content.[14]
In XML and SGML
Numeric character references were introduced in the Standard Generalized Markup Language (SGML), defined by ISO 8879:1986, to reference characters by their numeric position within the document character set, a coded character set specified in the SGML declaration that determines the repertoire of allowable characters.[15] This mechanism allowed authors to include characters not directly available in a limited encoding by substituting them with references like &#N;, where N denotes the character's code position.[15] In SGML, these references are resolved relative to the declared document character set, providing flexibility for various international character sets, though SGML itself does not mandate Unicode.[16] XML, as a subset of SGML, adapted numeric character references to align with Unicode (ISO/IEC 10646), requiring them to denote valid Unicode code points while mandating a terminating semicolon in all cases for unambiguous parsing.[17] Both decimal (&#d;) and hexadecimal (&#xh;) forms have been supported since XML 1.0's initial 1998 recommendation, enabling references to characters across the Unicode range.[18] In XML 1.0 editions prior to 2000, valid code points were restricted to exclude most control characters (e.g., #x1 through #x1F were forbidden both directly and via reference, except for #x9, #xA, and #xD), surrogates (#xD800–#xDFFF), and noncharacters like #xFFFE and #xFFFF, limiting the range effectively to U+0009, U+000A, U+000D, U+0020–U+D7FF, and U+E000–U+FFFD.[18] XML 1.1, introduced in 2004, extended support to the full Unicode repertoire (up to U+10FFFF) by permitting numeric references to additional control characters (#x1–#x1F and #x7F–#x9F, excluding #x0), while maintaining prohibitions on surrogates and requiring special handling in UTF-16 encodings where surrogates might appear in streams.[19] XML parsers enforce numeric character references through strict well-formedness validation, expanding valid ones immediately into character data and treating invalid references—such as those to disallowed code points—as fatal errors that halt processing.[17] This rigorous enforcement contrasts with HTML's more permissive approach, which tolerates certain malformed references for practical web authoring.[17] In both XML 1.0 and 1.1, processors must reject documents containing invalid numeric references to ensure conformance to Unicode semantics and document integrity.[19]Illustrative Examples
Common ASCII and Latin Characters
Numeric character references (NCRs) provide a standardized way to represent common characters from the ASCII range (Unicode U+0020 to U+007E) and the Latin-1 Supplement (U+0080 to U+00FF), particularly those that are reserved in markup languages or difficult to input on standard keyboards.[20][2] These references are especially useful for symbols like the ampersand (&) and less-than sign (<), which must be escaped in HTML and XML to avoid interpretation as markup delimiters.[20] In practice, NCRs are frequently employed for the characters & (ampersand), < (less-than), > (greater-than), " (quotation mark), and ' (apostrophe) to prevent entity conflicts and ensure document validity.[2] For the basic ASCII printable characters, NCRs allow direct reference to their Unicode code points. For instance, the ampersand (&, U+0026) is represented as & in decimal form, while the less-than sign (<, U+003C) uses <.[20] Similarly, the greater-than sign (>, U+003E) is >, the quotation mark (", U+0022) is ", and the apostrophe (', U+0027) is '.[2] Letters like uppercase A (A, U+0041) can be denoted as A, though such usage is rare for alphanumeric characters that are easily typed.[20] Extending to the Latin-1 Supplement, NCRs facilitate inclusion of accented and symbolic characters common in Western European languages. The copyright symbol (©, U+00A9) is encoded as ©, and the cent sign (¢, U+00A2) as ¢.[20] For ligatures in legacy contexts, such as the small oe (œ, U+0153 in Unicode, often mapped from Windows-1252 byte 0x9C), the correct NCR is œ in decimal, though older systems might reference it via encoding-specific decimals like 156 for compatibility.[2] These examples highlight how NCRs bridge basic input limitations while adhering to Unicode standards.[20]| Character | Description | Unicode | Decimal NCR | Hexadecimal NCR |
|---|---|---|---|---|
| & | Ampersand | U+0026 | & | & |
| < | Less-than | U+003C | < | < |
| > | Greater-than | U+003E | > | > |
| " | Quotation mark | U+0022 | " | " |
| ' | Apostrophe | U+0027 | ' | ' |
| © | Copyright | U+00A9 | © | © |
| ¢ | Cent sign | U+00A2 | ¢ | ¢ |
| œ | Small oe ligature | U+0153 | œ | œ |
