Hubbry Logo
List of XML and HTML character entity referencesList of XML and HTML character entity referencesMain
Open search
List of XML and HTML character entity references
Community hub
List of XML and HTML character entity references
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
List of XML and HTML character entity references
List of XML and HTML character entity references
from Wikipedia

In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity reference. This article lists the character entity references that are valid in HTML and XML documents.

Character reference overview

[edit]

In HTML and XML, a numeric character reference refers to a character by its Universal Coded Character Set/Unicode code point, and uses the format:

&#xhhhh;

or

&#nnnn;

where the x must be lowercase in XML documents, hhhh is the code point in hexadecimal form, and nnnn is the code point in decimal form. The hhhh (or nnnn) may be any number of hexadecimal (or decimal) digits and may include leading zeros. The hhhh for hexadecimal digits may mix uppercase and lowercase letters, though uppercase is the usual style. However the XML and HTML standards restrict the usable code points to a set of valid values, which is a subset of UCS/Unicode code point values, that excludes all code points assigned to non-characters or to surrogates, and most code points assigned to C0 and C1 controls (with the exception of line separators and tabulations treated as white spaces).

By contrast, a character entity reference refers to a sequence of one or more characters by the name of an entity which has the desired characters as its replacement text. The format is: &name; where name is the case-sensitive name of the entity. The semicolon is usually required in the character entity reference, unless marked otherwise in the table below (see [a]). The entity must either be predefined (built into the markup language), or declared in a Document Type Definition (DTD) by using <!ENTITY name "value">.[b]

Standard public entity sets for characters

[edit]
XML
XML specifies five predefined entities needed to support every printable ASCII character: &amp;, &lt;, &gt;, &apos;, and &quot;. The trailing semicolon is mandatory in XML (and XHTML) for these five entities (even if HTML or SGML allows omitting it for some of them, according to their DTD).
ISO Entity Sets
SGML supplied a comprehensive set of entity declarations for characters widely used in Western technical and reference publishing, for Latin, Greek and Cyrillic scripts. The American Mathematical Society also contributed entities for mathematical characters (see [c]).
HTML Entity Sets
Early versions of HTML built in small subsets of these, relating to characters found in three Western 8-bit fonts.
MathML Entity Sets
The W3C developed a set of entity declarations for MathML characters.
XML Entity Sets
The W3C MathML Working Group took over maintenance of the ISO public entity sets, combined with the MathML and documents them in XML Entity Definitions for Characters. This set can support the requirements of XHTML, MathML and as an input to future versions of HTML.
HTML5
HTML5 adopts the XML entities as named character references, and does not group them into sets. The character reference names originate from XML Entity Definitions for Characters. The HTML5 specification additionally provides mappings from the names to Unicode character sequences using JSON.

Numerous other entity sets have been developed for special requirements, and for major and minority scripts. However, the advent of Unicode has largely superseded them.

Formal public identifiers for HTML DTD entities subsets

[edit]

The full formal public identifier and system identifier for the DTD entities subset (where the character entity name is defined) is actually mapped from one of the following three defined named entities:

HTML DTD entities subsets
Name Version Formal public identifier System identifier
HTMLlat1 HTML 4 "-//W3C//ENTITIES Latin 1//EN//HTML" "http://www.w3.org/TR/html4/HTMLlat1.ent" (optional)
XHTML 1 "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent"
HTMLsymbol HTML 4 "-//W3C//ENTITIES Symbols//EN//HTML" "http://www.w3.org/TR/html4/HTMLsymbol.ent" (optional)
XHTML 1 "-//W3C//ENTITIES Symbols for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent"
HTMLspecial HTML 4 "-//W3C//ENTITIES Special//EN//HTML" "http://www.w3.org/TR/html4/HTMLspecial.ent" (optional)
XHTML 1 "-//W3C//ENTITIES Special for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent"
html.dtd[i] N/A "http://info.cern.ch/MarkUp/html-spec/html.dtd"
HTML 5[ii] "-//W3C//ENTITIES HTML MathML Set//EN//XML" "http://www.w3.org/2003/entities/2007/htmlmathml-f.ent"
  1. ^ The original HTML 1.0 DTD, which would have been available at http://info.cern.ch/MarkUp/html-spec/html.dtd
  2. ^ There is no DTD for HTML 5, where all entities are predefined; it is impossible to strictly validate in XML the schema needed for (X)HTML 5, without also defining custom XSD's (at least for the custom "data-*" attributes). Rather than requiring support for a DTD (with the associated security concerns such as billion laughs), the best way to securely interchange HTML5 with XHTML is to convert all entity references to plain-text, numerical character references, or (where applicable) the five standard entities of XML 1.0. That being said:
    • The HTML 5 entity set is also used in MathML 3 and, for that purpose, its DTD entities subset is assigned the identifier set PUBLIC "-//W3C//ENTITIES HTML MathML Set//EN//XML" "http://www.w3.org/2003/entities/2007/htmlmathml-f.ent".[1]
    • The WHATWG specification encourages browsers to map the formal public identifiers for MathML 2 or XHTML 1.x (when used in XML) to a data URI containing the HTML5 entity set, and give this precedence over the provided system identifier, so as to "handle entities in an interoperable fashion without requiring any network access".[2]

Formal public identifiers for old ISO entities subsets

[edit]

The ISO entities subsets are old (documented) character subsets, which are given SGML character entity names in ISO 8879 and ISO 9573, and which were used in legacy encodings before the unification within ISO 10646. Their full formal public identifiers are as follows:

ISO entities subsets
Name Formal public identifier(s)
ISOamsa
  • "ISO 8879:1986//ENTITIES Added Math Symbols: Arrow Relations//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Added Math Symbols: Arrow Relations//EN"[4]
ISOamsb
  • "ISO 8879:1986//ENTITIES Added Math Symbols: Binary Operators//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Added Math Symbols: Binary Operators//EN"[4]
ISOamsc
  • "ISO 8879:1986//ENTITIES Added Math Symbols: Delimiters//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Added Math Symbols: Delimiters//EN"[4]
ISOamsn
  • "ISO 8879:1986//ENTITIES Added Math Symbols: Negated Relations//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Added Math Symbols: Negated Relations//EN"[4]
ISOamso
  • "ISO 8879:1986//ENTITIES Added Math Symbols: Ordinary//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Added Math Symbols: Ordinary//EN"[4]
ISOamsr
  • "ISO 8879:1986//ENTITIES Added Math Symbols: Relations//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Added Math Symbols: Relations//EN"[4]
ISObox "ISO 8879:1986//ENTITIES Box and Line Drawing//EN"[i][3]
ISOchem "ISO 9573-13:1991//ENTITIES Chemistry//EN"[4]
ISOcyr1 "ISO 8879:1986//ENTITIES Russian Cyrillic//EN"[i][3]
ISOcyr2 "ISO 8879:1986//ENTITIES Non-Russian Cyrillic//EN"[i][3]
ISOdia "ISO 8879:1986//ENTITIES Diacritical Marks//EN"[i][3]
ISOgrk1 "ISO 8879:1986//ENTITIES Greek Letters//EN"[i][3]
ISOgrk2 "ISO 8879:1986//ENTITIES Monotoniko Greek//EN"[i][3]
ISOgrk3
  • "ISO 8879:1986//ENTITIES Greek Symbols//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Greek Symbols//EN"[4]
ISOgrk4
  • "ISO 8879:1986//ENTITIES Alternative Greek Symbols//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES Alternative Greek Symbols//EN"[4]
ISOlat1 "ISO 8879:1986//ENTITIES Added Latin 1//EN"[i][ii][3]
ISOlat2 "ISO 8879:1986//ENTITIES Added Latin 2//EN"[i][3]
ISOmfrk "ISO 9573-13:1991//ENTITIES Math Alphabets: Fraktur//EN"[4]
ISOmopf "ISO 9573-13:1991//ENTITIES Math Alphabets: Open Face//EN"[4]
ISOmscr "ISO 9573-13:1991//ENTITIES Math Alphabets: Script//EN"[4]
ISOnum "ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN"[i][3]
ISOpub "ISO 8879:1986//ENTITIES Publishing//EN"[i][3]
ISOtech
  • "ISO 8879:1986//ENTITIES General Technical//EN"[i][3]
  • "ISO 9573-13:1991//ENTITIES General Technical//EN"[4]
  1. ^ a b c d e f g h i j k l m n o p q r s A version beginning with ISO 8879-1986// instead of ISO 8879:1986// is considered deprecated.[3]
  2. ^ A version with appended //HTML is sometimes erroneously used for the larger HTMLlat1 entity set, i.e. instead of "-//W3C//ENTITIES Latin 1//EN//HTML"[3] (see above).

List of character entity references in HTML

[edit]

Entities representing special characters in XHTML

[edit]

The XHTML DTDs explicitly declare 253 entities (including the 5 predefined entities of XML 1.0) whose expansion is a single character, which can therefore be informally referred to as "character entities". These (with the exception of the &apos; entity) have the same names and represent the same characters as the 252 character entities in HTML 4.0. Also, by virtue of being XML, XHTML documents may reference the predefined &apos; entity, which is not one of the 252 character entities in HTML 4.0. Additional entities of any size may be defined on a per-document basis. However, the usability of entity references in XHTML is affected by how the document is being processed:[citation needed]

  • Legacy abbreviated character entities (without the final colon) inherited from HTML 2.0 (and still supported in HTML 5.0) are not supported in XML 1.0 and XHTML; the trailing semicolon must be present in all entity references used in XML and XHTML documents.
  • If the XHTML document is read by a conforming HTML 4.0 processor, then only the 252 HTML 4.0 character entities may safely be used. The use of &apos; or custom entity references may not be supported and may produce unpredictable results (it is recommended to use the numerical character reference &#39; instead).
  • If the document is read by an XML parser that does not or cannot read external entities, then only the five built-in XML character entities can safely be used, although other entities may be used if they are declared in the internal DTD subset. However, modern XML parsers recognize and implement a builtin cache for SGML references to DTDs used by all standard versions of HTML, XHTML, SVG and MathML, without needing to parse and process the external DTD via their URL and without needing to process entities defined in an internal DTD subset of the document.
  • If the document is read by an XML parser that does read external entities and does not implement a builtin cache for well-known DTDs, then the five built-in XML character entities (and numeric character references) can safely be used. The other 248 HTML character entities can be used as long as the XHTML DTD is accessible to the parser at the time the document is read. Other entities may also be used if they are declared in the internal DTD subset and the XML processor can parse internal DTD subsets.[citation needed]
  • HTML 5.0 parsers cannot process XHTML documents, and it's impossible to define a fully validating DTD for HTML5 documents encoded with the XHTML syntax (notably it's impossible to validate all attributes names, notably "data-*" attributes); as well it's still impossible to fully validate (with W3C standard schemas for XML, such as XSD or relax NG) HTML5 documents represented in the XHTML syntax, and for now a custom validator specific to HTML 5.0 is required.

Because of the special &apos; case mentioned above, only &quot;, &amp;, &lt;, and &gt; will work in all XHTML processing situations.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Character entity references in XML and are predefined named sequences, typically beginning with an ampersand (&) and ending with a semicolon (;), that represent specific Unicode characters within markup documents. These references allow authors to include reserved characters—such as angle brackets and ampersands, which have structural roles in markup—or other special symbols without causing parsing errors or ambiguity. In XML, the specification defines exactly five predefined character entity references to escape the characters with special syntactic meaning: & for U+0026 ampersand, < for U+003C less-than sign, > for U+003E greater-than sign, " for U+0022 quotation mark, and ' for U+0027 apostrophe. These are the only built-in named entities in XML, though additional entities can be declared in document type definitions (DTDs) for custom use. XML requires all entity references to end with a semicolon and enforces strict parsing, disallowing ambiguous or incomplete forms. HTML, governed by the WHATWG standard, extends this concept significantly with over 2,000 named character references to support a broader array of glyphs, including diacritics, mathematical operators, geometric shapes, and legacy symbols from earlier versions. This extensive set ensures compatibility with , where references like   () or € () provide convenient mnemonics for characters that may not be easily typed or displayed directly. Unlike XML, parsing is more forgiving, permitting some references without a trailing in certain contexts, though best practices recommend their inclusion to avoid issues. The intersection of XML and HTML entities includes the five core XML references, which are fully supported in HTML, but HTML's larger repertoire introduces entities not recognized in plain XML, such as those for currency symbols or arrows. For cross-compatibility, W3C provides supplementary entity sets that can be incorporated into XML via DTDs, mirroring many HTML names while adhering to XML's stricter rules. These lists are essential for developers working with XHTML—HTML serialized as XML—or other hybrid formats, ensuring consistent rendering across parsers.

Fundamentals of Character References

Definition and Role in Markup Languages

Character entity references are named sequences in markup languages that begin with an ampersand (&) and end with a (;), used to represent specific characters within content. These allow the inclusion of characters that have special syntactic meaning in the markup, such as the (<) via <, without triggering the parser's interpretation of them as structural elements. In both XML and HTML, they function as an escaping mechanism to ensure documents remain well-formed and parsable, preventing errors that could arise from directly embedding reserved delimiters like < or &. Originating from the Standard Generalized Markup Language (SGML), standardized as ISO 8879 in 1986, character entity references were designed to enhance document modularity and portability across systems. HTML, proposed by Tim Berners-Lee in 1990 and formalized through early specifications in the 1990s, adopted SGML's entity reference system to support structured web documents. XML, released as a W3C Recommendation in February 1998, simplified SGML's entity model while retaining named references for compatibility, establishing a streamlined subset focused on web interoperability. Beyond escaping reserved characters, entity references play a crucial role in enabling the representation of non-ASCII characters in documents encoded with legacy standards like , facilitating internationalization through mappings to Unicode code points. This capability supports the inclusion of symbols, accented letters, and other extended characters essential for multilingual content, ensuring portability without reliance on specific local encodings. By standardizing these references, XML and HTML promote consistent rendering across diverse platforms and parsers.

Syntax Variations Between XML and HTML

Character references in both XML and HTML serve to represent reserved or special characters within markup, using a common general syntax that begins with an ampersand (&) and ends with a semicolon (;). For named character references, this takes the form &name;, where "name" is a predefined identifier for a specific character. Numeric character references provide alternatives, using decimal form as &#decimal; or hexadecimal as &#xhex;, allowing direct reference to Unicode code points without relying on named identifiers. In XML, the syntax adheres to strict rules defined in the XML 1.0 specification, limiting built-in support to only five predefined named entities: &amp; for ampersand, &lt; for less-than, &gt; for greater-than, &apos; for apostrophe, and &quot; for quotation mark. Any other named entities must be explicitly declared in a Document Type Definition (DTD) or used as numeric references; references to undefined names result in a fatal parsing error, ensuring document well-formedness. XML parsing is case-sensitive for entity names, so variations like &Amp; are invalid and rejected. This rigorous approach requires all documents to be validated against their declarations for compliance. HTML, governed by the HTML Living Standard, employs the same syntactic forms but offers broader support through a large set of predefined named character references—over 2,000 in total—covering Latin, Greek, mathematical symbols, and more, without needing external declarations. Browsers tolerate these references forgivingly; if a named reference is undefined or malformed (such as an ambiguous ampersand followed by alphanumerics and a semicolon), it is typically rendered as literal text rather than causing a parse failure. Like XML, HTML named character references are case-sensitive, requiring exact matching to the defined names (e.g., &Aacute; differs from &aacute;). HTML parsers prioritize robustness, allowing recovery from errors to display content, in contrast to XML's strict rejection.

Predefined Entities in XML

Core Five Predefined Entities

The core five predefined entities in XML are essential built-in references that every XML processor must recognize without requiring any external declaration or document-specific entity declaration. These entities map to characters with special syntactic roles in XML markup, ensuring that literal occurrences of those characters within document content, attributes, or processing instructions can be unambiguously represented. Defined in the Extensible Markup Language (XML) 1.0 specification, they are: &lt; for the less-than sign (<, U+003C), &gt; for the greater-than sign (>, U+003E), &amp; for the ampersand (&, U+0026), &apos; for the apostrophe (', U+0027), and &quot; for the quotation mark (", U+0022). These entities serve as mandatory escapes for reserved characters to prevent parsing ambiguities, such as mistaking literal angle brackets for tag delimiters or ampersands for entity starters. XML processors are required to substitute these references with their corresponding characters during , regardless of whether the document is well-formed or valid. No additional declarations are needed for their use, distinguishing them from other entities that rely on DTDs or schemas. This built-in support promotes interoperability across XML implementations. The rationale for these predefined entities stems from the need to delimit markup structures reliably in XML documents, avoiding conflicts between content and syntax elements like tags (< and >), attribute values (quotes), and references (&). Introduced in the initial XML 1.0 Recommendation published , they form the minimal set required for basic XML compliance, with subsequent editions maintaining their definitions for . For example, to include a literal less-than sign in element content without triggering tag interpretation, it must be written as &lt;; using a raw < would cause a parsing error as the processor would expect a start-tag. Consider this invalid XML snippet:

<root> < if x > 0 </root>

<root> < if x > 0 </root>

This fails because the raw < and > are misinterpreted as tag boundaries. Instead, the valid form escapes them:

<root> &lt; if x &gt; 0 </root>

<root> &lt; if x &gt; 0 </root>

Similarly, an ampersand in content requires &amp; to avoid confusion with entity references, as in <root>&amp; is logical AND</root>. Failure to escape these can lead to well-formedness errors, halting processing.

Extended Character Entities from Standards

In XML, extended character entities beyond the core five predefined ones are derived from standardized sets originally defined in ISO 8879:1986 (SGML) and its associated technical report ISO/IEC 9573-13:1991, which specify entity sets for accented Latin characters, symbols, and other non-ASCII glyphs across various languages and domains. These sets include ISOlat1 (for Latin-1 supplement), ISOlat2 (for Latin-2 or Central European), ISOnum (for numeric symbols), and ISOsym (for general symbols), among others, providing named references for characters like accented vowels and mathematical operators that facilitate without relying solely on numeric codes. The (W3C) has maintained and updated these sets to align with , ensuring compatibility in XML documents while preserving the original SGML heritage. These entities are incorporated into XML documents through external declarations in the (DTD), using formal identifiers (FPIs) to the sets from standardized locations. For instance, a DTD might include <!ENTITY % ISOlat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN"> %ISOlat1;, which loads the definitions from an external file, allowing parsers to resolve names like &Aacute; to the character Á (U+00C1). Similarly, &ograve; maps to ò (U+00F2) from ISOlat1, while symbols like &divide; resolve to ÷ (U+00F7) from ISOnum. Later additions, such as &euro; for € (U+20AC), were incorporated into extended W3C-maintained sets like the "combined" or "HTMLspecial" subsets, reflecting post-ISO updates for modern currencies and symbols not present in the original 1986 standards. Common ISO-derived sets collectively provide over 250 entities when combined (e.g., ISOlat1 contributes approximately 82, ISOlat2 adds 118 for Eastern European scripts), covering diacritics, punctuation, and technical symbols essential for multilingual and technical XML content. A key limitation of these extended entities is their dependence on DTD processing; XML parsers are not required to load external DTDs by default for security and performance reasons, potentially leaving unresolved references in standalone documents. This contrasts with the always-available core entities like &amp;, emphasizing the need for explicit DTD inclusion or fallback mechanisms. For enhanced portability across diverse parsers and environments, numeric character references (e.g., &#x00C1; for Á) are recommended over named entities, as they do not rely on external declarations and directly leverage XML's foundation in Unicode (ISO/IEC 10646). The fifth edition of XML 1.0 (2008) reinforced this shift by fully aligning character handling with 5.0, while maintaining with SGML-derived entities but prioritizing numeric references for future-proofing against evolving standards and reducing dependency on legacy DTDs. This evolution underscores XML's transition from SGML subsets toward a self-contained, -centric model, where extended named entities serve primarily as a bridge for legacy or specialized applications.

Character Entities in HTML Standards

Entities in HTML 4.01 and XHTML 1.0

HTML 4.01 and XHTML 1.0 define a collection of named character entities that enable the inclusion of reserved characters, symbols, and international glyphs in document content without disrupting markup parsing. These entities inherit from SGML traditions, drawing primarily from ISO 8879 entity sets such as ISOlat1 for Latin-1 characters, ISOdia for diacritics, ISOgrk for Greek letters, and others, augmented by HTML-specific entities like   for . The 4.01 specification formalizes 252 such entities in its (DTD), allowing browsers to render them as corresponding characters even if not explicitly declared in all contexts. In 1.0, as a reformulation of 4.01 in XML, these entities require explicit within the DTD for validation, with undefined entities falling back to the five core XML predefined entities (&, <, >, ", ') to maintain strict XML compliance. This SGML-derived framework ensures while supporting richer , though browsers historically tolerate undefined entities by approximating glyphs from available fonts. The entities are organized by Unicode blocks, facilitating their use in categories like structural markup, , and multilingual text. Structural entities, essential for escaping characters, include < (U+003C, <), > (U+003E, >), & (U+0026, &), and " (U+0022, "), which prevent errors in attribute values and content. Symbols and international characters expand this base, such as © (U+00A9, ©) for and Æ (U+00C6, Æ) for the Latin capital ligature , supporting from Western European languages and beyond. The following tables provide representative examples from key categories, illustrating names, points, and rendered characters. Full definitions reside in the HTML 4.01 DTD subsets like %HTMLlat1; and %HTMLsymbol;.

Latin Extended-A and Basic Latin-1

These entities cover accented letters and common Western symbols from ISO 8859-1.
EntityUnicode Code PointRendered CharacterDescription
ÁU+00C1ÁLatin capital A with acute
áU+00E1áLatin small a with acute
ÉU+00C9ÉLatin capital E with acute
éU+00E9éLatin small e with acute
ÓU+00D3ÓLatin capital O with acute
óU+00F3óLatin small o with acute
ÚU+00DAÚLatin capital U with acute
úU+00FAúLatin small u with acute
 U+00A0 
¡U+00A1¡Inverted

Greek and Coptic

Derived from the ISOgrk entity set, these support classical and modern Greek script.
EntityUnicode Code PointRendered CharacterDescription
ΑU+0391ΑGreek capital letter Alpha
αU+03B1αGreek small letter alpha
ΒU+0392ΒGreek capital letter Beta
βU+03B2βGreek small letter beta
ΓU+0393ΓGreek capital letter Gamma
γU+03B3γGreek small letter gamma
ΔU+0394ΔGreek capital letter Delta
δU+03B4δGreek small letter delta
ΩU+03A9ΩGreek capital letter Omega
ωU+03C9ωGreek small letter omega

Mathematical Operators and Symbols

From ISOamsa, ISOamso, and HTMLsymbol sets, these enable technical and logical notation.
EntityUnicode Rendered CharacterDescription
U+2211N-ary
U+220FN-ary product
U+221E
U+221A
±U+00B1±Plus-minus
U+2260Not equal to
U+2264Less-than or equal to
U+2265Greater-than or equal to
U+222B
U+2234Therefore

General Punctuation and Letterlike Symbols

Including quotes, dashes, and special marks from ISOlat2 and ISOtech.
EntityUnicode Code PointRendered CharacterDescription
U+2018Left single quotation mark
U+2019Right single quotation mark
U+201CLeft double quotation mark
U+201DRight double quotation mark
U+2013En dash
U+2014Em dash
©U+00A9©Copyright sign
®U+00AE®Registered sign
U+2122Trade mark sign
°U+00B0°Degree sign
Validation of these entities occurs through the HTML 4.01 DTD, which references public identifiers for ISO subsets, ensuring consistent rendering across compliant user agents. In practice, web browsers like those supporting HTML 4 standards map these entities to Unicode points for display, even extending tolerance to non-standard ones for robustness.

Updates and Additions in HTML5

HTML5 introduced a substantial expansion of named character references, increasing the total from the approximately 252 entities in HTML 4.01 to 2,231, by incorporating references for a broader range of Unicode characters, including those for symbols, arrows, and control characters previously unsupported or inconsistently handled. This growth reflects the integration of entity sets from various ISO standards and direct mappings to Unicode code points, enabling richer content representation in web documents without relying solely on numeric references. Key additions include the formal standardization of ' for the apostrophe (U+0027), which was previously a browser-specific extension without guaranteed support across all parsers. Similarly, ‌ for the zero-width non-joiner (U+200C) and ‍ for the zero-width joiner (U+200D) were added to facilitate precise control over text rendering in complex scripts, such as Devanagari or emoji sequences. Other notable inclusions encompass new references for arrows (e.g., ↞ for U+219E, the leftwards two headed arrow) and symbols that support modern typography, alongside aliases for private-use Unicode areas to aid in legacy system interoperability. These updates were driven by the need to align with evolving standards, particularly 8.0 and later (released post-2015), to better accommodate , complex script rendering, and accessibility features like screen reader compatibility for diverse languages. The Living Standard emphasizes this alignment to ensure consistent across platforms, reducing discrepancies in global web content display. For backward compatibility with HTML 4.01, all legacy entities remain valid, and HTML5 parsers treat undefined references gracefully by falling back to literal ampersand interpretation or error handling, preventing document breakage while encouraging adoption of the expanded set. The following table highlights selected new or modified entities introduced in HTML5, focusing on those enhancing script support and symbolism:
EntityDescriptionUnicode Code Point
'ApostropheU+0027
Zero-width non-joinerU+200C
Zero-width joinerU+200D
Leftwards two headed arrowU+219E
Black star (emoji-like)U+2605

Entity Sets and Formal Identifiers

Public Identifiers for HTML DTD Subsets

Formal Public Identifiers (FPIs) serve as standardized, location-independent references to external sets within SGML-based Document Type Definitions (DTDs), enabling parsers to locate and load predefined character for markup languages like . In the context of , the (W3C) specified FPIs for distinct subsets of character entities to support internationalization and special symbols without embedding full definitions in every document. These identifiers follow the format "-//W3C//ENTITIES [Set Name]//EN//", where the set name denotes the category of characters covered, such as diacritics, , or mathematical symbols. The primary entity subsets defined via FPIs in HTML 4.01 and 1.0 include the Basic (core structural entities), Latin-1 (for Western European languages), Special (additional and quotes), and Symbols (for mathematical and typographic ). These subsets partition the approximately 252 named character entities supported in HTML 4.01, allowing modular inclusion in DTDs to optimize and validation. For instance, the Latin-1 subset covers accented characters and common symbols from ISO/IEC 8859-1, while the Symbols subset addresses Greek letters and operators in basic ASCII. Usage of these FPIs occurs within the internal subset of a DOCTYPE declaration or modular DTD files, where parameter entities reference the external sets for inclusion. An example from the HTML 4.01 Strict DTD is <!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin1//EN//HTML" "http://www.w3.org/TR/html4/HTMLlat1.ent"> %HTMLlat1;, which loads the Latin-1 entities into the document's entity pool for resolution during parsing. This mechanism ensures compatibility across validating tools and browsers that support SGML/XML processing, preventing errors from undefined references. System identifiers, often HTTP URLs to W3C-hosted .ent files, accompany FPIs to provide fallback resolution if catalogs are unavailable. Key FPIs for HTML DTD subsets, along with descriptions and entity counts, are outlined below:
Subset NameFPIDescriptionEntity CountSystem Identifier Example
Special"-//W3C//ENTITIES Special//EN//HTML"Core entities for structural markup, including , quotes, and angle brackets5http://www.w3.org/TR/html4/HTMLspecial.ent
Latin1"-//W3C//ENTITIES Latin1//EN//HTML"Entities for ISO 8859-1 characters 160–255, supporting Western European diacritics and fractions96http://www.w3.org/TR/html4/HTMLlat1.ent
Symbols"-//W3C//ENTITIES Symbols//EN//HTML"Mathematical, Greek, and typographic symbols for technical content151http://www.w3.org/TR/html4/HTMLsymbol.ent
Basic (subset of Special)Included in Special FPIPredefined XML-compatible entities like &, <, >, ", '5N/A (internal to Special)
Latin1 for XHTML"-//W3C//ENTITIES Latin 1 for XHTML//EN"XHTML variant of Latin-1, aligned with XML strictness96http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
Symbols for XHTML"-//W3C//ENTITIES Symbols for XHTML//EN"XHTML variant for symbols, ensuring XML 1.0 compliance151http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
Special for XHTML"-//W3C//ENTITIES Special for XHTML//EN"XHTML core special entities5http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
These FPIs originate from the HTML 4.01 specification and were adapted for XHTML modularization. In contemporary , these FPIs retain relevance primarily for XHTML authoring tools, validators, and legacy SGML systems, as they facilitate precise resolution in standards-compliant environments. Modern , however, de-emphasizes DTD-based validation in favor of schema-agnostic parsing, reducing reliance on external loading while still recognizing the named entities for .

Public Identifiers for ISO Entity Subsets

The public identifiers for ISO entity subsets originate from the Standard Generalized Markup Language (SGML) defined in ISO 8879:1986, specifically its non-normative Annex D, which standardizes 19 character entity sets to facilitate the representation of graphic characters beyond basic character sets in markup documents. These sets address needs in technical and reference publishing, including extensions for Latin scripts, Greek letters, Cyrillic alphabets, mathematical symbols, and diacritical marks, with examples such as ISOamsa for mathematical symbols (ordinary) and ISOgrk1 for Greek letters. Developed in the 1980s and refined through ISO/IEC TR 9573-13:1991, these identifiers follow the Formal Public Identifier (FPI) syntax outlined in ISO 8879, typically structured as "ISO 8879:1986//ENTITIES [Set Description]//EN". Key FPIs include "-//ISO 8879:1986//ENTITIES Added Latin 1//EN" for additional Latin characters used in Western European languages, "-//ISO 8879:1986//ENTITIES Greek//EN" for Greek alphabetic characters, and "-//ISO 8879:1986//ENTITIES Cyrillic//EN" for Russian and related scripts. These approximately 20 sets collectively define over 1,000 entities, enabling the inclusion of specialized characters like mathematical operators (e.g., ISOtech for general technical symbols) and box-drawing elements without relying on local encodings. In XML and HTML contexts, these FPIs are declared in Document Type Definitions (DTDs) to load external entity files, allowing parsers to resolve named references to Unicode characters; for instance, an XML DTD might include <!ENTITY % ISOlat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN"> %ISOlat1;. The XML 1.0 specification references these sets for extending the core predefined entities, supporting backward compatibility with SGML-based systems. Although the FPIs have remained unchanged since their establishment in the 1990s, the entity sets have been mapped to Unicode characters, with adaptations like the addition of "//XML" suffixes in some implementations to denote XML-compatible declarations. Post-2000, as Unicode became the dominant encoding standard for XML and HTML, these ISO subsets were increasingly deprecated in favor of direct Unicode code points or numeric character references (e.g., á for á), reducing reliance on external DTDs and named entities for broader interoperability. The W3C continues limited maintenance of these legacy sets through documents like the XML Entity Definitions for Characters, primarily for mathematical and historical applications, but recommends numeric references for new content.
Entity SetDescriptionExample FPI
ISOlat1Added Latin 1 characters"-//ISO 8879:1986//ENTITIES Added Latin 1//EN"
ISOlat2Added Latin 2 characters"-//ISO 8879:1986//ENTITIES Added Latin 2//EN"
ISOgrk1Greek letters"-//ISO 8879:1986//ENTITIES Greek Letters//EN"
ISOcyr1Russian Cyrillic"-//ISO 8879:1986//ENTITIES Russian Cyrillic//EN"
ISOamsaMathematical symbols (ordinary)"-//ISO 8879:1986//ENTITIES Added Math Symbols: Ordinary//EN"

Key Differences and Compatibility Considerations

Structural and Semantic Differences

In XML, character entity resolution is strictly governed by the specification, where only the five predefined entities (<, >, &, ', and ") are universally recognized without prior declaration, and any additional named entities must be explicitly declared in the (DTD) or internal subset before use. This ensures precise control over entity expansion during , treating undeclared references as fatal errors that halt processing to maintain document well-formedness. In contrast, HTML employs a built-in table of approximately 2,000 named character references defined in the standard, resolved via a state machine in the parser without requiring declarations, supplemented by heuristics that allow numeric references for any Unicode code point. For undefined named entities in HTML, the parser is forgiving, typically emitting the ampersand as literal text followed by the entity name (e.g., treating &foo; as "&foo" in the output stream) rather than failing, which promotes robustness in legacy or malformed content. Validation processes further highlight these structural variances: XML parsers enforce rigorous checking, issuing errors for undefined entities to uphold semantic integrity and prevent ambiguous interpretations, thereby ensuring that entity references consistently map to intended characters without approximation. HTML validation, however, is more lenient, with parsers continuing execution despite undefined entities by approximating their rendering—often as plain text—to avoid breaking page display, though tools like the W3C validator may flag them as warnings rather than halting. Semantically, both formats use entities to preserve literal meanings, such as employing < to represent the less-than sign without triggering tag interpretation, but HTML extends this flexibility to contexts like URLs and attributes, where entities can be embedded more permissively (e.g., in unquoted attribute values), potentially leading to broader interoperability in web environments while risking minor distortions in edge cases. Encoding interplay also underscores differences in structural handling: XML mandates support for UTF-8 and UTF-16, requiring an explicit encoding declaration for non-ASCII characters in the prolog to ensure accurate entity resolution across diverse inputs, with processors rejecting unsupported encodings as fatal errors. HTML, while fully supporting UTF-8 as the preferred encoding, relies on an inference mechanism that scans for byte-order marks (BOMs), HTTP headers, or meta elements within the first 1024 bytes to detect charset, allowing fallback to defaults like windows-1252 without mandatory declarations, which facilitates easier authoring but can introduce variability in non-UTF-8 scenarios. XHTML addresses these gaps by reformulating HTML as an XML application, enforcing XML's strict entity resolution and validation rules—including declaration requirements—on HTML-like documents, thus bridging interoperability by enabling seamless parsing in both XML and HTML user agents when served with appropriate media types like application/xhtml+xml.

Handling of Deprecated or Non-Standard Entities

In the transition from HTML 4.01 to HTML5, the handling of named character entities underwent significant standardization, with HTML5 defining a fixed set of 2,234 named character references derived from XML entity sets and Unicode mappings, excluding those for deprecated or obsolete Unicode codepoints such as certain control characters. For example, while the soft hyphen (U+00AD) retains its standard named reference ­, non-standard or legacy variants like &softHyphen;—occasionally used in older authoring tools or extensions—are not recognized in HTML5 and are treated as unknown. This consolidation prioritizes compatibility with modern Unicode while deprecating reliance on fragmented ISO entity subsets from prior standards, where entities like those in ISOnum could map to now-discouraged characters. XML takes a more rigid approach to deprecated or non-standard entities, recognizing only the five core predefined entities (<, >, &, ', ") without requiring declaration; any other named reference, including those from HTML-specific sets or browser extensions, triggers a fatal error if undeclared in the document's DTD or internal subset. This strict validation ensures document well-formedness but rejects legacy or non-standard names outright, preventing ambiguous parsing. For instance, an undeclared reference like ‎ (left-to-right mark, U+200E), which was initially supported in some early browsers before standardization, would cause parsing to fail in XML unless explicitly defined. Non-standard entities, often introduced by browser vendors for enhancements, further complicate compatibility; early implementations, for example, extended support to additional symbols like certain mathematical operators beyond ISO definitions, but these were never formalized and vary across user agents. In , such entities trigger an "unknown named character " parse during tokenization, with the parser reverting to the ambiguous ampersand state and rendering the literally (e.g., &nonstandard; appears as text "&nonstandard;") rather than substituting a character. Conformance checkers flag this as non-conforming, but browsers tolerate it for , unlike XML's rejection. To mitigate risks, numeric character (e.g., ‎ for left-to-right mark) are recommended, as they bypass named entity resolution entirely and work consistently in both and XML parsers. Migrating content with deprecated or non-standard entities involves replacing them with numeric equivalents or direct Unicode characters to ensure portability; for the soft hyphen, substituting ­ avoids reliance on potentially varying named forms. Tools such as HTML Tidy facilitate this by scanning documents, warning on unrecognized entities, and converting them to numeric references or validating against the HTML5 set. The current HTML5 specification aligns with 16.0 (released in 2024), incorporating updates for newly stable characters while excluding mappings to deprecated codepoints, thereby promoting long-term robustness over legacy quirks.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.