Recent from talks
Contribute something
Nothing was collected or created yet.
Document type declaration
View on WikipediaA document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document (for example, a web page) with a document type definition (DTD) (for example, the formal definition of a particular version of HTML 2.0 - 4.0).[1] In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.
The HTML layout engines in modern web browsers perform DOCTYPE "sniffing" or "switching", wherein the DOCTYPE in a document served as text/html determines a layout mode, such as "quirks mode" or "standards mode". The text/html serialization of HTML5, which is not SGML-based, uses the DOCTYPE only for mode selection. Since web browsers are implemented with special-purpose HTML parsers, rather than general-purpose DTD-based parsers, they do not use DTDs and never access them even if a URL is provided. The DOCTYPE is retained in HTML5 as a "mostly useless, but required" header only to trigger "standards mode" in common browsers.[2]
Syntax
[edit]The general syntax for a document type declaration is:
<!DOCTYPE root-element PUBLIC "/quotedFPI/" "/quotedURI/" [
<!-- internal subset declarations -->
]>
or
<!DOCTYPE root-element SYSTEM "/quotedURI/" [
<!-- internal subset declarations -->
]>
Document type name
[edit]The opening <!DOCTYPE syntax is followed by separating syntax[3]: 403–404 (such as spaces,[3]: 297–298, 372 or (except in XML) comments opened and closed by a doubled ASCII hyphen),[3]: 372, 391 followed by a document type name[3]: 403–404 (i.e. the name of the root element that the DTD applies to trees descending from). In XML, the root element that represents the document is the first element in the document. For example, in XHTML, the root element is <html>, being the first element opened (after the doctype declaration) and last closed.
Since the syntax for the external identifier and internal subset are both optional,[3]: 403–404 the document type name is the only information which it is mandatory to give in a DOCTYPE declaration.
External identifier
[edit]The DOCTYPE declaration can optionally contain an external identifier, following the root element name (and separating syntax such as spaces), but before any internal subset.[3]: 403–404 This begins with either the keyword SYSTEM or the keyword PUBLIC,[3]: 379 specifying whether the DTD is specified using a public identifier identifying it as a public text, i.e. one shared between multiple computer systems (regardless of whether it is an available public text available to the general public, or an unavailable public text shared only within an organisation).[3]: 180–182 If the PUBLIC keyword is used, it is followed by the public identifier enclosed in double or single ASCII quotation marks. The public identifier does not point to a storage location, but is rather a unique fixed string intended to be looked up in a table (such as an SGML catalog);[3]: 180 however, in some (but not all) SGML profiles, the public identifier must be constructed using a particular syntax called Formal Public Identifier (FPI), which specifies the owner as well as whether it is available to the general public.[3]: 182–183
The public identifier (if present) or SYSTEM keyword (otherwise) may (and, in XML, must)[4] be followed by a "system identifier" that is likewise enclosed in quotation marks. Although the interpretation of system identifiers in general SGML is entirely system-dependent (and might be a filename, database key, offset, or something else),[3]: 378 XML requires that they be URIs.[5] For example, the FPI for XHTML 1.1 is "-//W3C//DTD XHTML 1.1//EN" and, there are 3 possible system identifiers available for XHTML 1.1 depending on the needs. One of them is the URL reference "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd". It means that the XML parser must locate the DTD in a system specific fashion, in this case, by means of a URL reference of the DTD enclosed in double quote marks.
In XHTML documents, the doctype declaration must always explicitly specify a system identifier. In SGML-based documents like HTML, on the other hand, the appropriate system identifier may automatically be inferred from the given public identifier. This association might e.g. be performed by means of a catalog file resolving the FPI to a system identifier.[6] The SYSTEM keyword can (except in XML) also be used without a system identifier following, indicating that a DTD exists but should be inferred from the document type name.[3]: 378
Internal subset
[edit]The last, optional, part of a DOCTYPE declaration is surrounded by literal square brackets ([]), and called an internal subset. It can be used to add/edit entities or add/edit PUBLIC keyword behaviors.[7] It is possible, but uncommon, to include the entire DTD in-line in the document, within the internal subset, rather than referencing it from an external file.[3]: 402 Conversely, the internal subset is sometimes forbidden within simple SGML profiles, notably those for basic HTML parsers that don't implement a full SGML parser.
If both an internal DTD subset and an external identifier are included in a DOCTYPE declaration, the internal subset is processed first, and the external DTD subset is treated as if it were transcluded at the end of the internal subset. Since earlier definitions take precedence over later definitions in a DTD, this allows the internal subset to override definitions in the external subset.[3]: 402–403
Example
[edit]The first line of a World Wide Web page may read as follows:
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="ar" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">
This document type declaration for XHTML includes by reference a DTD, whose public identifier is -//W3C//DTD XHTML 1.0 Transitional//EN and whose system identifier is http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd. An entity resolver may use either identifier for locating the referenced external entity. No internal subset has been indicated in this example or the next ones. The root element is declared to be html and, therefore, it is the first tag to be opened after the end of the doctype declaration in this example and the next ones, too. The HTML tag is not part of the doctype declaration but has been included in the examples for orientation purposes.
Common DTDs
[edit]Some common DTDs have been put into lists. W3C has produced a list of DTDs commonly used in the web, which contains the "bare" HTML5 DTD, older XHTML/HTML DTDs, DTDs of common embedded XML-based formats like MathML and SVG as well as "compound" documents that combine those formats.[8] Both W3C HTML5 and its corresponding WHATWG version recommend browsers to only accept XHTML DTDs of certain FPIs and to prefer using internal logic over fetching external DTD files. It further specifies an "internal DTD" for XHTML which is merely a list of HTML entity names.[9]: §13.2
HTML 4.01 DTDs
[edit]Strict DTD does not allow presentational markup with the argument that Cascading Style Sheets should be used for that instead. This is how the Strict DTD looks:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
Transitional DTD allows some older PUBLIC and attributes that have been deprecated:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
If frames are used, the Frameset DTD must be used instead, like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
<html>
XHTML 1.0 DTDs
[edit]XHTML's DTDs are also Strict, Transitional and Frameset.
XHTML Strict DTD. No deprecated tags are supported and the code must be written correctly according to XML Specification.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
XHTML Transitional DTD is like the XHTML Strict DTD, but deprecated tags are allowed.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
XHTML Frameset DTD is the only XHTML DTD that supports Frameset. The DTD is below.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
XHTML 1.1 DTD
[edit]XHTML 1.1 is the most current finalized revision of XHTML, introducing support for XHTML Modularization. XHTML 1.1 has the stringency of XHTML 1.0 Strict.
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
XHTML Basic DTDs
[edit]XHTML Basic 1.0
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML Basic 1.0//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
XHTML Basic 1.1
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML Basic 1.1//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">
HTML5 DTD-less DOCTYPE
[edit]HTML5 uses a DOCTYPE declaration which is very short, due to its lack of references to a DTD in the form of a URL or FPI. All it contains is the tag name of the root element of the document, HTML.[10] In the words of the specification draft itself:
<!DOCTYPE html>, case-insensitively.
With the exception of the lack of a URI or the FPI string (the FPI string is treated case sensitively by validators), this format (a case-insensitive match of the string !DOCTYPE HTML) is the same as found in the syntax of the SGML based HTML 4.01 DOCTYPE. Both in HTML4 and in HTML5, the formal syntax is defined in upper case letters, even if both lower case and mixes of lower case upper case are also treated as valid.
In XHTML5 the DOCTYPE must be a case-sensitive match of the string "<!DOCTYPE html>". This is because in XHTML syntax all HTML element names are required to be in lower case, including the root element referenced inside the HTML5 DOCTYPE.
The DOCTYPE is optional in XHTML5 and may simply be omitted.[11] However, if the markup is to be processed as both XML and HTML, a DOCTYPE should be used.[12]
See also
[edit]- Document type definition contains an example
- RDFa
- XML schema
References
[edit]- ^ HTML2 HTML3 HTML4
- ^ "The HTML syntax ― HTML5". Retrieved 2011-06-05.
- ^ a b c d e f g h i j k l m n Goldfarb, Charles F. (1990). The SGML Handbook. Oxford: Clarendon Press. ISBN 0-19-853737-9.
- ^ Walsh, Norman (2001-08-06). "XML Catalogs". The Organization for the Advancement of Structured Information Standards (OASIS).
- ^ Clark, James (1997-12-15). "Comparison of SGML and XML". W3C. NOTE-sgml-xml-971215.
- ^ "The DOCTYPE Declaration". Archived from the original on 2011-08-14. Retrieved 2011-09-09.
- ^ "DOCTYPE Declaration". msdn.microsoft.com.
- ^ "W3C QA - Recommended list of Doctype declarations you can use in your Web document". www.w3.org. Retrieved 22 March 2019.
- ^ "HTML Standard". html.spec.whatwg.org. Retrieved 22 March 2019.
- ^ "The HTML syntax ― HTML5". Web Hypertext Application Technology Working Group. Retrieved 2011-06-05.
3. A string that is an ASCII case-insensitive match for the string "DOCTYPE". 5. A string that is an ASCII case-insensitive match for the string "HTML".
- ^ "The XHTML syntax ― HTML5". Web Hypertext Application Technology Working Group. Archived from the original on 2012-06-18. Retrieved 2009-09-01.
- ^ "Polyglot Markup: HTML-Compatible XHTML Documents". World Wide Web Consortium. Retrieved 2012-01-17.
External links
[edit]- HTML Doctype overview
- Recommended DTDs to use in your Web document - an informative (not normative) W3C Quality Assurance publication
- DOCTYPE grid - another overview table [Last modified 27 November 2006]
- Quirks mode and transitional mode
- Box model tweaking
Document type declaration
View on Grokipedia<!DOCTYPE root-element [public-identifier | system-identifier] [internal-subset]> where the root-element names the document's top-level element (e.g., html for HTML or the root tag in XML), the public or system identifier references an external DTD file via a URI, and an optional internal subset provides inline declarations.[2] For HTML 4.01, examples include <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> for the Strict variant, which enforces a rigorous structure without deprecated elements.[3] In XML 1.0, the declaration may contain or point to markup declarations forming the DTD's internal and external subsets, ensuring the document complies with specified constraints for validity.[2]
While DOCTYPE declarations remain essential for legacy HTML (up to HTML 4.01 and XHTML 1.0) and XML validation, modern HTML5 simplifies this by using a short <!DOCTYPE html> form without a formal DTD, relying instead on the living standard for conformance checking—though the declaration still triggers standards mode in browsers.[3] This evolution reflects a shift from rigid SGML-derived validation toward more flexible, error-tolerant parsing in web technologies.[1]
Fundamentals
Definition and Purpose
A document type declaration, often abbreviated as DOCTYPE, is a special instruction in SGML-based markup languages that associates a document instance with a document type definition (DTD), which serves as a formal schema specifying the permissible structure of the document. In SGML (ISO 8879:1986), the declaration identifies the document type and may include or reference a DTD that defines the legal building blocks, such as elements, their hierarchical relationships, attributes, and entities, thereby establishing rules for syntactic and structural validity.[4][5] This mechanism originated in SGML to enable generalized markup independent of specific formatting or presentation concerns.[2] The primary purpose of the document type declaration is to enable validation of the document against the associated DTD, ensuring conformance to predefined constraints on element types, attribute values, and entity usage, which in turn supports accurate parsing by processors.[6] By declaring the document's type, it informs processing tools—such as validators or parsers—of the expected grammar, allowing them to interpret and handle the markup correctly without ambiguity.[7] In XML, a subset of SGML, the declaration explicitly provides this grammar either internally within the document or via an external reference, facilitating type-validity checks.[8] Key benefits include promoting interoperability among diverse systems by enforcing a standardized structure, enabling early error detection during document creation or processing to prevent malformed outputs, and ensuring consistent rendering or interpretation in applications like browsers and data exchangers.[9] Unlike the document's content, which conveys semantic information, the document type declaration and its DTD focus exclusively on structural rules, separating form from meaning to enhance reusability and maintainability.[5]Historical Development
The document type declaration originated within the Standard Generalized Markup Language (SGML), formalized as International Standard ISO 8879 in 1986 by the International Organization for Standardization (ISO). SGML introduced the document type declaration as a key component of its formal public identifier system, enabling the specification of document type definitions (DTDs) to define the structure, elements, and rules for markup in machine-readable documents.[5] This mechanism separated content from presentation, promoting portability and longevity of textual data across systems, and laid the groundwork for subsequent markup languages by allowing users to declare the governing DTD at the start of a document. The adoption of the document type declaration in HTML began in the early 1990s, driven by Tim Berners-Lee's development of HTML as an SGML application at CERN to facilitate hypertext document sharing on the emerging World Wide Web.[10] By 1995, the Internet Engineering Task Force (IETF) published HTML 2.0 as the first official specification (RFC 1866), which incorporated SGML-style DOCTYPE declarations referencing specific DTDs to ensure consistent parsing and rendering of web documents. This evolved through subsequent versions, culminating in the World Wide Web Consortium's (W3C) HTML 4.01 recommendation in 1999, which retained and refined DOCTYPE declarations tied to modular DTDs for strict, transitional, and frameset variants, emphasizing SGML's influence on HTML's structural validation.[11] In 1998, the W3C integrated the document type declaration into the Extensible Markup Language (XML) 1.0 specification, positioning it as an optional mechanism for associating DTDs with documents to support validation while allowing non-validating processors for simpler conformance.[12] Early XML recommendations highlighted DTDs for enforcing schema rules, inheriting SGML's legacy to promote interoperability in data exchange. The 2008 fifth edition of XML 1.0 preserved full DTD support despite the emergence of alternatives like XML Schema, ensuring backward compatibility for legacy systems. Following the 2000s, usage of document type declarations for validation declined in XML contexts with the standardization of more expressive alternatives, including the W3C's XML Schema in 2001, which offered richer data typing and namespace support beyond DTD limitations, and the OASIS-approved RELAX NG in 2001, which provided a modular, grammar-based approach to schema definition. However, DOCTYPE declarations persisted in HTML for legacy purposes, primarily to trigger standards-compliant rendering modes in browsers and avoid quirks mode, a backward-compatibility mechanism emulating pre-standard behaviors.Syntax
Core Components
The document type declaration in SGML begins with the keyword<!DOCTYPE, followed by the document type name, an optional external identifier, an optional internal subset enclosed in square brackets [ ], and terminates with a greater-than sign >.[13] This structure separates the declaration into a prolog component that precedes the document instance, ensuring parsers can validate the markup against defined rules before processing content.[13]
The document type name serves as a generic identifier that specifies the root element or base structure for the document, uniquely defining the applicable set of markup declarations within the prolog.[13] An external identifier, if present, references an external subset of the document type definition, typically using a SYSTEM or PUBLIC keyword followed by a URI or formal public identifier to link to a separate DTD file for shared rules.[13] The internal subset, contained within square brackets, allows for local declarations such as entity definitions or element types that override or supplement the external subset, providing flexibility for document-specific customizations.[13]
Parameter entities enhance modularity within the subsets by enabling reusable definitions, invoked via %entityname; syntax, such as %hlto4 "h1 | h2 | h3 | h4" for heading elements, and must be declared within the same document type declaration.[13] In XML, an adaptation of SGML, the declaration follows a similar form: <!DOCTYPE root-element (external-ID)? ('[' markup-declarations ']')? >, where the root-element name must match the document's actual root element for validity.[6]
The declaration must appear at the very beginning of the document, immediately after any SGML declaration and before the root element or any other content, to establish the parsing context.[13] In SGML, keywords like DOCTYPE are case-insensitive, but entity and element names follow rules set by the NAMECASE parameter in the SGML declaration, often defaulting to case-insensitivity; in contrast, XML enforces case-sensitivity for all names.[13][14]
Invalid declarations, such as mismatched names or malformed subsets, trigger parsing errors in validating processors: SGML parsers report violations per error handling rules, ignoring redefinitions like duplicate entities without failing, while XML processors may halt validation or fall back to well-formedness checks only, depending on the implementation.[13][15]
Document Type Name
In a document type declaration (DTD), the document type name is the identifier that immediately follows the<!DOCTYPE keyword and specifies the expected name of the document's root element, such as html in standard HTML documents. This name serves to declare the root element type, enabling parsers to validate the document structure against the corresponding DTD rules.[16][17]
The naming rules for the document type name derive from SGML conventions, where it must form a valid name consisting of an initial letter followed by zero or more letters, digits, periods, hyphens, or underscores, with a maximum length of 72 characters in many applications. In XML, which subsets SGML, the name adheres to the stricter production [5] Name ::= NameStartChar (NameChar)*, where NameStartChar includes letters (A-Z, a-z), underscore (_), or colon (:), and NameChar extends this to include digits (0-9), hyphen (-), period (.), and certain combining characters, though colons are reserved primarily for namespace prefixes and not typically used in the root element name.[18]
By matching the root element, the document type name implies the overall document type, allowing validators to enforce element hierarchies, attributes, and content models specific to that type during parsing. A mismatch between the declared name and the actual root element triggers a validation error, as the parser expects the document to conform to the named DTD's grammar.[17]
Historically, early HTML specifications exhibited case variations, such as uppercase HTML in some DTD references versus lowercase html in document instances, which could cause inconsistencies in case-insensitive SGML parsers but became standardized to lowercase in modern HTML5 DOCTYPEs like <!DOCTYPE html>. Common pitfalls include such case mismatches in legacy systems or failing to align custom root elements with the declared name, leading to failed validation in tools expecting strict conformance.[19]
The flexibility of the document type name supports custom identifiers for proprietary or domain-specific markup languages, permitting organizations to define bespoke root elements like report or config tied to tailored DTDs for specialized validation needs.[16]
External Identifier
The external identifier in a document type declaration (DTD) specifies the location or unique identification of an external subset containing the full DTD definition, allowing processors to retrieve and apply predefined constraints from remote or local sources.[2] It appears immediately after the document type name and is optional, consisting of either aPUBLIC or SYSTEM keyword followed by a quoted literal.[2]
The PUBLIC keyword is used for widely available DTDs, pairing a formal public identifier (FPI) with an optional system literal to reference standardized resources, such as PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd".[2] In contrast, the SYSTEM keyword references local or URL-based DTDs via a system literal alone, for example, SYSTEM "http://example.com/custom.dtd", which directly points to a specific resource without a public designation.[2]
Formal public identifiers follow a structured format defined in the SGML standard (ISO 8879), typically delimited by double slashes (//) and consisting of several components: an owner identifier (e.g., "-//W3C" to denote the World Wide Web Consortium), a public text class (e.g., "DTD" for document type definitions), a public text description (e.g., "HTML 4.01 Strict"), a public text language (e.g., "EN" for English), and optionally a character set or display version (e.g., "ISO 8879:1986").[20] This breakdown ensures unique, machine-readable identification, with the owner identifier often starting with "-//" for registered entities or "+//" for unregistered ones, facilitating catalog-based resolution in tools like XML entity managers.[20]
XML processors resolve external identifiers by attempting to fetch the DTD using the system literal as a URI reference; if a public identifier is present and the system literal fails or is absent, the processor may map the FPI to an alternative URI via catalogs or predefined mappings.[2] Fetched external subsets are typically cached by the processor to avoid repeated retrievals during validation, though caching behavior is implementation-dependent.[2] If the external subset cannot be retrieved, processors fall back to any internal subset provided in the declaration, ensuring partial validation proceeds without halting.[2]
External identifier resolution introduces security risks, as fetching remote DTDs can enable XML external entity (XXE) attacks if the parser processes untrusted input and resolves external entities without restrictions, potentially allowing attackers to access internal files, perform denial-of-service, or execute remote requests._Processing) To mitigate this, modern parsers often disable external entity resolution by default or require explicit configuration.[21]
Internal Subset
The internal subset of a Document Type Declaration (DTD) in XML is an optional component that provides inline definitions directly within the DOCTYPE declaration, enclosed in square brackets immediately following the optional external identifier.[14] It contains markup declarations such as element types, attribute lists, entities, and notations, enabling authors to specify document-specific rules without relying on or altering external DTD files.[22] This feature supports customization by allowing the redefinition of entities or the inclusion of conditional sections, which can adapt the DTD's behavior based on processing instructions like INCLUDE or IGNORE.[23] Syntactically, the internal subset consists of one or more markup declarations separated by whitespace or parameter entity references, with the latter denoted by percent signs (e.g.,%entityName;) to promote modularity between declarations but prohibited within the declarations themselves.[22] Whitespace outside of these declarations is ignored, ensuring that only the structured markup content is parsed, while parameter entities facilitate reuse of common declaration blocks for efficiency in larger documents.[24] For instance, a simple entity declaration within the subset might redefine a general entity for document-specific text substitution, following the standard entity declaration syntax.[25]
Although the internal subset enhances flexibility, it has notable limitations: it cannot introduce redeclarations of elements or notations already defined in the external subset, as such duplicates result in validity errors, but it can override entity and attribute-list declarations from the external subset since the internal portion is logically processed after the external one, granting it precedence.[6] This processing order ensures that document-local adjustments take effect without conflicting with core structural definitions from shared external DTDs.[24]
In practice, the internal subset is commonly used for temporary entity definitions to improve document portability across systems, such as embedding short-lived parameter entities for testing environments or overriding default behaviors in isolated XML instances without distributing modified external files.[26] This approach is particularly valuable in scenarios requiring ad-hoc customizations, like prototyping markup structures or ensuring self-contained documents for legacy system integration.[6]
Syntax Examples
The simplest form of a document type declaration appears in HTML5 documents as a minimal legacy reference to ensure standards-compliant rendering by user agents. This declaration is<!DOCTYPE [html](/page/HTML)>, which is case-insensitive and omits any external identifier or internal subset.[27]
A more complete external declaration, as used in HTML 4.01 Strict, references a public identifier and an external DTD resource for validation against the full markup rules. The syntax is <!DOCTYPE [HTML](/page/HTML) PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">, where the public identifier specifies the W3C's HTML 4.01 Strict DTD, and the URL points to the external subset containing element and attribute definitions.
In XML documents, an internal subset allows declarations directly within the document type declaration, such as defining entities without referencing an external file. For example, <!DOCTYPE root [ <!ENTITY foo "bar"> ]> declares an internal general entity named foo with replacement text "bar", enabling its use throughout the document for substitution during parsing.[6]
A mixed declaration combines an external identifier with an internal subset, common in XML for overriding or extending external rules. Consider <!DOCTYPE book SYSTEM "book.dtd" [ <!ELEMENT chapter EMPTY> ]>, where SYSTEM "book.dtd" references an external subset for the primary structure, and the internal subset adds a declaration for an EMPTY chapter element, which processors interpret by first loading the external DTD and then applying internal overrides.[14]
Empty declarations provide a placeholder without subsets, such as <!DOCTYPE root>, which signals the root element type but relies on no additional constraints unless an external identifier is present.[8]
For SGML compatibility in external subsets, conditional sections allow inclusion or exclusion of markup declarations, as in <!ENTITY % conditional-section "<!ENTITY example 'included'>"> within an external DTD file referenced by the document type declaration; however, XML restricts such sections to external subsets only, ignoring them in internal ones.[28]
In strict HTML contexts, processors disregard any internal subset if present, treating the declaration as a simple trigger for quirks mode avoidance rather than a full DTD enforcer, unlike XML where both subsets are actively parsed for validation.[29]
Applications
In XML
In XML, the document type declaration serves as a mechanism to specify a grammar for the document's structure, enabling validation beyond basic well-formedness. Per the XML 1.0 specification, a DTD is optional for well-formed XML, which requires only proper syntax, nesting, and entity handling, but it is essential for validity, where the document must conform to the constraints defined in the DTD.[2] Namespaces, if used in the document, rely on prefix bindings in element attributes, though the DTD itself defines the qualified names for elements and attributes without native namespace support.[30] The declaration must appear in the prolog, immediately following the optional XML declaration (e.g.,<?xml version="1.0"?>) and any preceding comments or processing instructions, but before the document's root element. This positioning ensures the DTD is processed early, allowing subsequent validation of the instance. For instance, a basic declaration might read <!DOCTYPE root-element SYSTEM "example.dtd">, referencing an external DTD file, or include an internal subset for inline definitions.[2]
XML-specific features in DTDs include support for notations, which declare the format of non-XML data, such as images, via syntax like <!NOTATION GIF SYSTEM "image/gif">, and unparsed entities, which reference such external resources without parsing their contents as XML. The internal subset facilitates entity declarations tailored to XML, such as redefining or supplementing predefined entities like < (for <), > (for >), & (for &), ' (for '), and " (for "), which all processors must recognize regardless of explicit declaration. These features enhance modularity for handling binary or formatted data within XML contexts.[2]
Validity in XML distinguishes between well-formed documents, which pass syntactic checks without a DTD, and valid documents, which additionally satisfy the DTD's element, attribute, and entity rules. Tools like xmllint, part of the libxml2 library, enable practical validation; for example, running xmllint --noout --dtdvalid example.dtd document.xml reports errors if the document fails to conform, confirming validity or highlighting issues like undeclared elements.[2]
Despite these capabilities, DTDs have notable limitations in XML, including no integration or support for XML Schema constructs, such as rich datatypes (e.g., xs:date or xs:decimal) or namespace-qualified global type definitions, which restrict their use for intricate constraints. For simple structural validation without advanced typing or modularity needs, DTDs are straightforward and sufficient; however, for complex scenarios involving inheritance, patterns, or cross-namespace reuse, XML Schema is the recommended alternative due to its greater expressiveness and alignment with modern XML practices.[31]
In HTML and XHTML
In HTML, the document type declaration (DOCTYPE) plays a crucial role in determining how browsers render pages, primarily by triggering either standards mode or quirks mode. Standards mode, also known as strict mode, ensures compliance with web standards for consistent rendering across browsers, while quirks mode emulates the non-standard behavior of older browsers like Netscape 4 to maintain backward compatibility for legacy content. This distinction emerged around 2000, with Internet Explorer 5 for Macintosh introducing standards mode in response to the growing adoption of CSS, and subsequent browsers like Firefox and later IE versions following suit to support the Acid1 test for layout fidelity.[32][33] For HTML 4.01 Strict, the DOCTYPE declaration is<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">, which excludes deprecated presentational elements and attributes to promote semantic markup and stylesheet use. In XHTML 1.0 Strict, the equivalent is <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">, reformulating HTML 4.01 as an XML application with stricter syntax rules, such as case-sensitive tags and mandatory closing of elements. XHTML requires adherence to XML well-formedness, including proper nesting and quoted attributes, to enable parsing as XML rather than tag soup.[34][35][36]
To support backward compatibility, XHTML 1.0 also offers Transitional and Frameset DTDs, which permit some deprecated HTML features like presentational attributes while transitioning toward stricter XML compliance; for example, the Transitional DTD allows elements like <font> and <center> alongside modern semantic ones. In practice, browsers treat XHTML served as text/html in quirks or almost-standards mode if the DOCTYPE is incomplete, but full XML parsing requires application/xhtml+xml MIME type, which was rarely adopted due to compatibility issues.[37][38]
With the advent of HTML5, the simplified DOCTYPE <!DOCTYPE html> invokes the HTML5 parser in no-quirks mode without fetching or processing a full DTD, focusing instead on robust error recovery for forgiving parsing of real-world content. The W3C Markup Validation Service checks DOCTYPE correctness against standards, flagging missing or malformed declarations that can trigger quirks mode and lead to inconsistent rendering, such as box model discrepancies in CSS. The abandonment of XHTML 2.0 in 2009, when the W3C XHTML 2 Working Group ceased operations without renewing its charter, further diminished reliance on complex DTDs in favor of HTML5's streamlined approach.[39][40][41][42]
In Other Markup Languages
Document Type Declarations (DTDs) originated in SGML and found direct application in publishing workflows, particularly through the DocBook DTD, which was developed starting in 1991 by HaL Computer Systems and O'Reilly & Associates for structuring technical documentation in hardware and software industries.[43] This SGML-based DTD enabled consistent markup for books, articles, and reference materials, facilitating interchange and processing in professional publishing environments since its initial release around 1992.[43] Beyond core publishing, DTDs appeared in specialized markup languages derived from or compatible with SGML traditions. For instance, MathML 1.0, released in 1998 as an XML application, included a comprehensive DTD in its appendix to define elements for mathematical notation, such as<apply> for operations and <cn> for numbers, ensuring structured representation of equations.[44] Similarly, SVG 1.0, specified in 2001, provided an SGML-compatible DTD to validate vector graphics documents, outlining core elements like <svg>, <path>, and <circle> for scalable two-dimensional imagery.[45] In syndication formats, early RSS versions, such as RSS 0.90 from 1999, relied on RDF-based structures with associated DTDs to define feeds containing channels, items, and metadata like titles and links.
DTDs in full SGML contexts offered greater flexibility than in XML, permitting features like short reference maps—mappings of character strings to entity replacements without angle brackets—for concise markup in document instances, a capability excluded in XML to simplify parsing.[46] Entity handling also differed, with SGML allowing unclosed entity references and more lenient parameter entity inclusions in DTDs, whereas XML mandates explicit closure and restricts inclusions to simplify validation and reduce ambiguity.[46]
Today, DTD usage in non-XML markup remains limited to legacy SGML systems, primarily during migrations to XML where original DTDs must be converted to handle incompatibilities like omitted tags or subdocuments, often using tools to generate XML-compliant equivalents for archival or modernization efforts.[47]
