CDATA
CDATA
Main page

CDATA

logo
Community Hub0 subscribers
Read side by side
from Wikipedia

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

CDATA sections in XML

[edit]

In an XML document or external entity, a CDATA section is a piece of element content that is marked up to be interpreted literally, as textual data, not as marked-up content.[1] A CDATA section is merely an alternative syntax for expressing character data; there is no semantic difference between character data in a CDATA section and character data in standard syntax where, for example, "<" and "&" are represented by "&lt;" and "&amp;", respectively.

Syntax and interpretation

[edit]

A CDATA section starts with the following sequence:

<![CDATA[

and ends with the next occurrence of the sequence:

]]>

All characters enclosed between these two sequences are interpreted as characters, not markup or entity references. Every character is taken literally, the only exception being the ]]> sequence of characters. In:

<sender>John Smith</sender>

the start and end "sender" tags are interpreted as markup. However, the code:

<![CDATA[<sender>John Smith</sender>]]>

is equivalent to:

&lt;sender&gt;John Smith&lt;/sender&gt;

Thus, the "tags" will have exactly the same status as the "John Smith"; they will be treated as text.

Similarly, if the numeric character reference &#240; appears in element content, it will be interpreted as the single Unicode character 00F0 (small letter eth). But if the same appears in a CDATA section, it will be parsed as six characters: ampersand, hash mark, digit 2, digit 4, digit 0, semicolon.

Uses of CDATA sections

[edit]

New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. Some APIs for working with XML documents do offer options for independent access to CDATA sections, but such options exist above and beyond the normal requirements of XML processing systems, and still do not change the implicit meaning of the data. Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup. CDATA sections are useful for writing XML code as text data within an XML document. For example, if one wishes to typeset a book with XSL explaining the use of an XML application, the XML markup to appear in the book itself will be written in the source file in a CDATA section.

Nesting

[edit]

A CDATA section cannot contain the string "]]>" and therefore it is not possible for a CDATA section to contain nested CDATA sections. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write:

<![CDATA[]]]]><![CDATA[>]]>

This means that to encode "]]>" in the middle of a CDATA section, replace all occurrences of "]]>" with the following:

]]]]><![CDATA[>

This effectively stops and restarts the CDATA section.

Issues with encoding

[edit]

In text data, any Unicode character not available in the encoding declared in the <?xml ...?> header can be represented using a &#nnn; numeric character reference. But the text within a CDATA section is strictly limited to the characters available in the encoding.

Because of this, using a CDATA section programmatically to quote data that could potentially contain '&' or '<' characters can cause problems when the data happens to contain characters that cannot be represented in the encoding. Depending on the implementation of the encoder, these characters can get lost, can get converted to the characters of the &#nnn; character reference, or can cause the encoding to fail. But they will not be maintained.

Another issue is that an XML document can be transcoded from one encoding to another during transport. When the XML document is converted to a more limited character set, such as ASCII, characters that can no longer be represented are converted to &#nnn; character references for a lossless conversion. But within a CDATA section, these characters can not be represented at all, and have to be removed or converted to some equivalent, altering the content of the CDATA section.

Use of CDATA in program output

[edit]

CDATA sections in XHTML documents are liable to be parsed differently by web browsers if they render the document as HTML, since HTML parsers do not recognise the CDATA start and end markers, nor do they recognise HTML entity references such as &lt; within <script> tags. This can cause rendering problems in web browsers and can lead to cross-site scripting vulnerabilities if used to display data from untrusted sources, since the two kinds of parser will disagree on where the CDATA section ends.

Since it is useful to be able to use less-than signs (<) and ampersands (&) in web page scripts, and to a lesser extent styles, without having to remember to escape them, it is common to use CDATA markers around the text of inline <script> and <style> elements in XHTML documents. But so that the document can also be parsed by HTML parsers, which do not recognise the CDATA markers, the CDATA markers are usually commented-out, as in this JavaScript example:

<script type="text/javascript">
//<![CDATA[
document.write("<");
//]]>
</script>

or this CSS example:

<style type="text/css">
/*<![CDATA[*/
body { background-image: url("marble.png?width=300&height=300") }     
/*]]>*/
</style>

This technique is only necessary when using inline scripts and stylesheets, and is language-specific. CSS stylesheets, for example, only support the second style of commenting-out (/* … */), but CSS also has less need for the < and & characters than JavaScript and so less need for explicit CDATA markers.

CDATA in DTDs

[edit]

CDATA-type attribute value

[edit]

In document type definition (DTD) files for SGML and XML, an attribute value may be designated as being of type CDATA: arbitrary character data. Within a CDATA-type attribute, character and entity reference markup is allowed and will be processed when the document is read.

For example, if an XML DTD contains:

<!ATTLIST foo a CDATA #IMPLIED>

it means that elements named foo may optionally have an attribute named a, which is of type CDATA. In an XML document that is valid according to this DTD, an element like this might appear:

<foo a="1 &amp; 2 are &lt; &#51; &#x0A;" />

and an XML parser would interpret the a attribute's value as being the character data 1 & 2 are < 3.

CDATA-type entity

[edit]

An SGML or XML DTD may also include entity declarations in which the token CDATA is used to indicate that entity consists of character data. The character data may appear within the declaration itself or may be available externally, referenced by a URI. In either case, character reference and parameter entity reference markup is allowed in the entity, and will be processed as such when it is read.

<DISPLAY_NAME Attribute="Y"><![CDATA[PFTEST0__COUNTER_6__:4:199:, PFTEST0__COUNTER_7__:4:199:]]></DISPLAY_NAME>

<SVLOBJECT><LONG name="" val="" INTEGER name="" val="" LONG name="" val=""/></SVLOBJECT>

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In XML documents, CDATA sections provide a mechanism to include blocks of literal character data that would otherwise be interpreted as markup, allowing characters such as less-than signs (<) and ampersands (&) to appear unescaped.[1] These sections are delimited by the opening sequence <![CDATA[ and the closing sequence ]]>, and they may occur anywhere that character data is permitted, such as within element content but not in attribute values or document type declarations.[1] The content within a CDATA section is treated entirely as character data by XML processors, meaning it is not parsed for markup, entity references, or other structural elements, except for the closing delimiter itself, which cannot appear inside the section.[1] CDATA sections cannot nest, and the only recognized delimiter within them is the end sequence ]]>, ensuring that the enclosed text remains opaque to the parser.[1] This feature is particularly useful for embedding raw text, such as XML examples, scripts, or data containing reserved characters, without the need for individual entity escaping.[1] CDATA sections were first defined in the Extensible Markup Language (XML) 1.0 Recommendation by the World Wide Web Consortium (W3C), published on February 10, 1998, and remain a core part of XML syntax in both XML 1.0 and XML 1.1, first published on February 4, 2004 (second edition August 16, 2006), with no substantive changes to their behavior.[2][3] They are represented in the Document Object Model (DOM) as CDATASection nodes, which can be created, manipulated, or normalized during XML processing.[1]

Fundamentals of CDATA

Definition and Purpose

CDATA, an abbreviation for Character Data, refers to a designated section in XML documents that enables the direct inclusion of text blocks containing reserved characters such as <, >, and & without requiring entity escaping.[4] These sections are defined in the XML 1.0 specification as a mechanism to escape content that might otherwise be misinterpreted as markup by the parser.[4] The primary purpose of CDATA sections is to ensure that the enclosed text is treated as literal character data, bypassing XML parsing rules for markup and entity references, which simplifies the integration of unstructured or raw content like JavaScript code, CSS styles, or external data snippets into XML structures.[4] This approach avoids the complexity of repeatedly escaping special characters, making document authoring more efficient for certain use cases.[4] Unlike PCDATA (Parsed Character Data), which is the default for text content in XML elements and requires full parsingβ€”including entity expansion and markup recognitionβ€”CDATA sections remain unparsed except for the closing delimiter, preserving the original text intact.[4][5] For instance, the basic syntax <![CDATA[<warning>Caution: &amp; proceed!</warning>]]> includes the angle brackets and ampersand literally, preventing the parser from interpreting them as element tags or entity starts.[4]

Historical Development

The concept of CDATA originated in the Standard Generalized Markup Language (SGML), defined in ISO 8879:1986, where it served as a declaration for unparsed character data blocks, allowing inclusion of text without markup interpretation, alongside related types like PCDATA and RCDATA.[6] This feature enabled SGML documents to handle raw content, such as scripts or literal text, while maintaining structural integrity through declared content models.[6] CDATA was formalized and adapted in the Extensible Markup Language (XML) 1.0 specification, published as a W3C Recommendation on February 10, 1998, and edited by Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen.[7] As XML was designed as a simplified subset of SGML to facilitate web use and compatibility with existing data, CDATA sections were retained to support migration of legacy SGML content into the new format, ensuring that unparsed data could be preserved without requiring extensive reformatting.[7] The specification explicitly defines CDATA marked sections to begin with "", treating enclosed content as literal characters exempt from entity expansion or tag recognition.[7] Subsequent updates in XML 1.1, released as a W3C Recommendation on February 4, 2004 (with a second edition in 2006), retained CDATA with only minor adjustments to accommodate enhanced Unicode support and line-ending normalization using NEL (Next Line) characters, without altering its core functionality or syntax.[3] These changes focused on internationalization rather than restructuring CDATA, maintaining backward compatibility for SGML-derived applications.[3] CDATA's structure influenced related standards, notably serving as the basis for handling unparsed content in XHTML 1.0, a W3C Recommendation from January 26, 2000, which reformulated HTML as an XML application and explicitly recognized CDATA sections in its processing model.[8] This adoption extended to early web-based XML applications, where CDATA enabled embedding of complex data like JavaScript or stylesheets within documents.[8]

CDATA Sections in XML

Syntax and Delimiters

In XML documents, CDATA sections are delimited by specific markup sequences that instruct parsers to treat the enclosed content as literal character data rather than markup. The opening delimiter is the exact string <![CDATA[, where "CDATA" must appear in uppercase letters, as XML markup is case-sensitive.[4] This sequence must be written verbatim, without any preceding or intervening characters that would alter its recognition as a CDATA start tag.[9] The closing delimiter is the string ]]>, which signals the end of the CDATA section and resumes normal XML parsing. Within the CDATA section, only this closing sequence is recognized as markup; all other characters, including angle brackets (< and >) and ampersands (&), are preserved literally without requiring entity escaping.[4] CDATA sections can only appear in locations where character data is permitted, such as the content of elements, but not within attribute values, processing instructions, comments, or the document prolog.[4] Surrounding whitespace and line breaks around the delimiters are ignored during parsing, but any whitespace within the CDATA content itself is preserved exactly as written, contributing to the section's role in maintaining unparsed text blocks.[4] For instance, the following XML snippet demonstrates a valid CDATA section containing HTML-like markup that would otherwise trigger parsing errors:
<data><![CDATA[<script>alert('Hello, World!');</script>]]></data>
Here, the content <script>alert('Hello, World!');</script> is treated as plain text, avoiding the need to escape the angle brackets or other reserved characters.[4]

Parsing and Interpretation

XML parsers treat the content within a CDATA section as unmarked character data, distinct from parsed markup. Specifically, upon encountering the opening delimiter <![CDATA[, the parser switches to a mode where it does not recognize element tags, entity references, or attribute structures; instead, all characters up to the closing delimiter ]]> are output literally to the application without further interpretation or processing.[4] This behavior ensures that potentially problematic characters, such as < or &, are preserved exactly as they appear in the source document.[4] Entity references within CDATA sections are not resolved or expanded; for example, an unescaped & remains as the literal ampersand character rather than being interpreted as the start of an entity.[4] Line endings in the content undergo normalization according to XML's end-of-line handling rules, where any combination of carriage return (CR), line feed (LF), or CR-LF is converted to a single LF (#xA) before the data is passed to the application.[10] The only sequence recognized as markup inside a CDATA section is the closing delimiter ]]>, which must not appear within the content itself.[11] If the sequence ]]> occurs within the intended CDATA content, the document is not well-formed, as it violates the production rules for CData; parsers must report this as a fatal error, requiring the content to be restructured (e.g., by splitting the section) or the sequence escaped (e.g., as ]]&gt;).[11] CDATA sections contribute to the overall well-formedness of an XML document by ensuring their proper placement and termination anywhere character data is permitted, but they do not inherently impose or alter schema-level validity constraints unless explicitly defined in a schema or DTD.[12]

Common Uses and Benefits

CDATA sections are primarily employed to embed blocks of unparsed text, such as JavaScript code, CSS stylesheets, or HTML fragments, within XML-based documents like XHTML or SVG, where these elements would otherwise require extensive escaping of special characters including <, >, and &. For instance, in XHTML, wrapping the content of or
User Avatar
No comments yet.