Recent from talks
Nothing was collected or created yet.
Microsoft Office XML formats
View on Wikipedia| WordProcessingML | |
|---|---|
| Filename extension | .XML (XML document) |
| Developed by | Microsoft |
| Type of format | Document file format |
| Extended from | XML, DOC |
| DataDiagramingML | |
|---|---|
| Filename extension | .VDX (XML Drawing),.VSX (XML Stencil),.VTX (XML Template) |
| Developed by | Microsoft |
| Type of format | Diagramming vector graphics |
| Extended from | XML, VSD, VSS, VST |
| SpreadsheetML | |
|---|---|
| Filename extension | .XML (XML Spreadsheet) |
| Developed by | Microsoft |
| Type of format | Spreadsheet |
| Extended from | XML, XLS |
The Microsoft Office XML formats are XML-based document formats (or XML schemas) introduced in versions of Microsoft Office prior to Office 2007. Microsoft Office XP introduced a new XML format for storing Excel spreadsheets and Office 2003 added an XML-based format for Word documents.
These formats were succeeded by Office Open XML (ECMA-376) in Microsoft Office 2007.
File formats
[edit]- Microsoft Office Word 2003 XML Format — WordProcessingML or WordML (.XML)
- Microsoft Office Excel 2002 and Excel 2003 XML Format — SpreadsheetML (.XML)
- Microsoft Office Visio 2003 XML Format — DataDiagramingML (.VDX, .VSX, .VTX)
- Microsoft Office InfoPath 2003 XML Format — XML FormTemplate (.XSN) (Compressed XML templates in a Cabinet file)
- Microsoft Office InfoPath 2003 XML Format — XMLS FormTemplate (.XSN) (Compressed XML templates in a Cabinet file)
Limitations and differences with Office Open XML
[edit]Besides differences in the schema, there are several other differences between the earlier Office XML schema formats and Office Open XML.
- Whereas the data in Office Open XML documents is stored in multiple parts and compressed in a ZIP file conforming to the Open Packaging Conventions, Microsoft Office XML formats are stored as plain single monolithic XML files (making them quite large, compared to OOXML and the Microsoft Office legacy binary formats). Also, embedded items like pictures are stored as binary encoded blocks within the XML. In the case of Office Open XML, the header, footer, comments of a document etc. are all stored separately.
- XML Spreadsheet documents cannot store Visual Basic for Applications macros, auditing tracer arrows, charts and other graphic objects, custom views, drawing object layers, outlining, scenarios, shared workbook information and user-defined function categories.[1] In contrast, the newer Office Open XML formats support full document fidelity.
- Poor backward compatibility with the version of Word/Excel prior to the one in which they were introduced. For example, Word 2002 cannot open Word 2003 XML files unless a third-party converter add-in is installed.[2] Microsoft has released a Word 2003 XML Viewer which allows WordProcessingML files saved by Word 2003 to be viewed as HTML from within Internet Explorer.[3] For Office Open XML, Microsoft provides converters for Office 2003, Office XP and Office 2000.
- Office Open XML formats are also defined for PowerPoint 2007, equation editing (Office MathML), vector drawing, charts and text art (DrawingML).
Word XML format example
[edit]<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<w:wordDocument
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:o="urn:schemas-microsoft-com:office:office"
w:macrosPresent="no"
w:embeddedObjPresent="no"
w:ocxPresent="no"
xml:space="preserve">
<o:DocumentProperties>
<o:Title>This is the title</o:Title>
<o:Author>Darl McBride</o:Author>
<o:LastAuthor>Bill Gates</o:LastAuthor>
<o:Revision>1</o:Revision>
<o:TotalTime>0</o:TotalTime>
<o:Created>2007-03-15T23:05:00Z</o:Created>
<o:LastSaved>2007-03-15T23:05:00Z</o:LastSaved>
<o:Pages>1</o:Pages>
<o:Words>6</o:Words>
<o:Characters>40</o:Characters>
<o:Company>SCO Group, Inc.</o:Company>
<o:Lines>1</o:Lines>
<o:Paragraphs>1</o:Paragraphs>
<o:CharactersWithSpaces>45</o:CharactersWithSpaces>
<o:Version>11.6359</o:Version>
</o:DocumentProperties>
<w:fonts>
<w:defaultFonts
w:ascii="Times New Roman"
w:fareast="Times New Roman"
w:h-ansi="Times New Roman"
w:cs="Times New Roman" />
</w:fonts>
<w:styles>
<w:versionOfBuiltInStylenames w:val="4" />
<w:latentStyles w:defLockedState="off" w:latentStyleCount="156" />
<w:style w:type="paragraph" w:default="on" w:styleId="Normal">
<w:name w:val="Normal" />
<w:rPr>
<wx:font wx:val="Times New Roman" />
<w:sz w:val="24" />
<w:sz-cs w:val="24" />
<w:lang w:val="EN-US" w:fareast="EN-US" w:bidi="AR-SA" />
</w:rPr>
</w:style>
<w:style w:type="paragraph" w:styleId="Heading1">
<w:name w:val="heading 1" />
<wx:uiName wx:val="Heading 1" />
<w:basedOn w:val="Normal" />
<w:next w:val="Normal" />
<w:rsid w:val="00D93B94" />
<w:pPr>
<w:pStyle w:val="Heading1" />
<w:keepNext />
<w:spacing w:before="240" w:after="60" />
<w:outlineLvl w:val="0" />
</w:pPr>
<w:rPr>
<w:rFonts w:ascii="Arial" w:h-ansi="Arial" w:cs="Arial" />
<wx:font wx:val="Arial" />
<w:b />
<w:b-cs />
<w:kern w:val="32" />
<w:sz w:val="32" />
<w:sz-cs w:val="32" />
</w:rPr>
</w:style>
<w:style w:type="character" w:default="on" w:styleId="DefaultParagraphFont">
<w:name w:val="Default Paragraph Font" />
<w:semiHidden />
</w:style>
<w:style w:type="table" w:default="on" w:styleId="TableNormal">
<w:name w:val="Normal Table" />
<wx:uiName wx:val="Table Normal" />
<w:semiHidden />
<w:rPr>
<wx:font wx:val="Times New Roman" />
</w:rPr>
<w:tblPr>
<w:tblInd w:w="0" w:type="dxa" />
<w:tblCellMar>
<w:top w:w="0" w:type="dxa" />
<w:left w:w="108" w:type="dxa" />
<w:bottom w:w="0" w:type="dxa" />
<w:right w:w="108" w:type="dxa" />
</w:tblCellMar>
</w:tblPr>
</w:style>
<w:style w:type="list" w:default="on" w:styleId="NoList">
<w:name w:val="No List" />
<w:semiHidden />
</w:style>
</w:styles>
<w:docPr>
<w:view w:val="print" />
<w:zoom w:percent="100" />
<w:doNotEmbedSystemFonts />
<w:proofState w:spelling="clean" w:grammar="clean" />
<w:attachedTemplate w:val="" />
<w:defaultTabStop w:val="720" />
<w:punctuationKerning />
<w:characterSpacingControl w:val="DontCompress" />
<w:optimizeForBrowser />
<w:validateAgainstSchema />
<w:saveInvalidXML w:val="off" />
<w:ignoreMixedContent w:val="off" />
<w:alwaysShowPlaceholderText w:val="off" />
<w:compat>
<w:breakWrappedTables />
<w:snapToGridInCell />
<w:wrapTextWithPunct />
<w:useAsianBreakRules />
<w:dontGrowAutofit />
</w:compat>
</w:docPr>
<w:body>
<wx:sect>
<w:p>
<w:r>
<w:t>This is the first paragraph</w:t>
</w:r>
</w:p>
<wx:sub-section>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1" />
</w:pPr>
<w:r>
<w:t>This is a heading</w:t>
</w:r>
</w:p>
<w:sectPr>
<w:pgSz w:w="12240" w:h="15840" />
<w:pgMar w:top="1440"
w:right="1800"
w:bottom="1440"
w:left="1800"
w:header="720"
w:footer="720"
w:gutter="0" />
<w:cols w:space="720" />
<w:docGrid w:line-pitch="360" />
</w:sectPr>
</wx:sub-section>
</wx:sect>
</w:body>
</w:wordDocument>
Excel XML spreadsheet example
[edit]<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="https://www.w3.org/TR/html401/">
<Worksheet ss:Name="CognaLearn+Intedashboard">
<Table>
<Column ss:Index="1" ss:AutoFitWidth="0" ss:Width="110"/>
<Row>
<Cell><Data ss:Type="String">ID</Data></Cell>
<Cell><Data ss:Type="String">Project</Data></Cell>
<Cell><Data ss:Type="String">Reporter</Data></Cell>
<Cell><Data ss:Type="String">Assigned To</Data></Cell>
<Cell><Data ss:Type="String">Priority</Data></Cell>
<Cell><Data ss:Type="String">Severity</Data></Cell>
<Cell><Data ss:Type="String">Reproducibility</Data></Cell>
<Cell><Data ss:Type="String">Product Version</Data></Cell>
<Cell><Data ss:Type="String">Category</Data></Cell>
<Cell><Data ss:Type="String">Date Submitted</Data></Cell>
<Cell><Data ss:Type="String">OS</Data></Cell>
<Cell><Data ss:Type="String">OS Version</Data></Cell>
<Cell><Data ss:Type="String">Platform</Data></Cell>
<Cell><Data ss:Type="String">View Status</Data></Cell>
<Cell><Data ss:Type="String">Updated</Data></Cell>
<Cell><Data ss:Type="String">Summary</Data></Cell>
<Cell><Data ss:Type="String">Status</Data></Cell>
<Cell><Data ss:Type="String">Resolution</Data></Cell>
<Cell><Data ss:Type="String">Fixed in Version</Data></Cell>
</Row>
<Row>
<Cell><Data ss:Type="Number">0000033</Data></Cell>
<Cell><Data ss:Type="String">CognaLearn Intedashboard</Data></Cell>
<Cell><Data ss:Type="String">janardhana.l</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String">normal</Data></Cell>
<Cell><Data ss:Type="String">text</Data></Cell>
<Cell><Data ss:Type="String">always</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String">GUI</Data></Cell>
<Cell><Data ss:Type="String">2016-10-14</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String">public</Data></Cell>
<Cell><Data ss:Type="String">2016-10-14</Data></Cell>
<Cell><Data ss:Type="String">IE8 browser_Modules screen tool tip text is shown twice</Data></Cell>
<Cell><Data ss:Type="String">new</Data></Cell>
<Cell><Data ss:Type="String">open</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
See also
[edit]References
[edit]- ^ "Features and limitations of XML Spreadsheet format (broken)". Archived from the original on 2007-10-09. Retrieved 2007-11-01.
- ^ "Polar WordML add-in (broken)". Archived from the original on 2009-04-11. Retrieved 2007-11-01.
- ^ Word 2003 XML Viewer
- Overview of Office 2003 Developer Technologies
- Office 2003 XML. ISBN 0-596-00538-5
External links
[edit]Microsoft Office XML formats
View on Grokipedia<w:wordDocument>, supporting round-trip editing and features like real-time schema validation.[2] Similarly, Excel's XML Spreadsheet 2003 format (XMLSS) organizes worksheet data into hierarchical XML elements, facilitating mapping of custom schemas to cells for importing and exporting structured information such as financial reports or invoices.[3] These capabilities were made available under a royalty-free license starting December 2003, promoting broader adoption for automated processing and cross-application data reuse.[1]
Notable aspects include enhanced data extraction—separating content from presentation for easier assembly and reporting—and compatibility extensions for InfoPath forms, though PowerPoint lacked a dedicated XML schema until later iterations.[1] Overall, these formats marked a shift toward open, extensible standards in Office, laying groundwork for the zipped Open XML in Office 2007 while addressing limitations of proprietary binary files like .doc and .xls.[4]
Introduction
Definition and Purpose
Microsoft Office XML formats are open, XML-based schemas designed for representing documents created in Microsoft applications such as Word, Excel, Visio, and InfoPath. These schemas, including WordprocessingML for word processing documents, SpreadsheetML for spreadsheets, DataDiagramingML for diagrams, and XML structures for forms, allow documents to be stored as plain-text, human-readable files with a .xml extension, in contrast to the proprietary binary formats like .doc or .xls prevalent in prior versions. This structure enables direct editing with standard text editors and XML tools, promoting cross-platform compatibility without dependence on Microsoft Office software.[2][5][6][7] The primary purpose of these formats is to facilitate developer access and data exchange by decoupling document content from application-specific binaries, thereby enhancing interoperability across diverse systems and programming environments. By leveraging XML's extensible nature, the formats support integration with web services, databases, and other data sources, allowing for seamless manipulation, validation, and transformation of content using standards like XSLT. Additionally, the human-readable design reduces risks associated with file corruption, as issues can be identified and resolved more easily than in opaque binary files.[2][5][6] These single-file XML documents were developed to align with emerging XML standards of the early 2000s, enabling better web integration, data portability, and long-term archival of office productivity content in a non-proprietary manner. For instance, WordprocessingML and SpreadsheetML provide structured vocabularies that preserve rich formatting and data while supporting custom extensions for business-specific needs.[2][5]Historical Development
Microsoft's strategic embrace of XML for Office applications began in 2000 as part of its broader .NET initiative and XML Web Services Scenarios, announced at events like the Fusion 2000 conference, aiming to enable better data interoperability and web integration for productivity tools.[8][9] This shift aligned with the emerging industry adoption of XML 1.0, the W3C recommendation from 1998, which facilitated structured data exchange beyond proprietary binary formats like OLE (Object Linking and Embedding) used in earlier Office versions. The first implementation arrived with Microsoft Office XP in May 2001, introducing SpreadsheetML as an XML-based format for Excel spreadsheets, allowing users to export and import data in a structured, editable XML schema to support custom integrations and web publishing.[10] Building on this, Office 2003 expanded XML support significantly, adding WordProcessingML for Word documents, DataDiagramingML for Visio diagrams, and XML form templates for the newly launched InfoPath application, all released as optional save formats to enable XML export alongside traditional binary files.[2][6][7] These formats represented a move toward open, standards-based document representation, with Microsoft publishing the Office 2003 XML reference schemas under an open, royalty-free license in November 2003 to encourage developer adoption and third-party tooling.[1] Further schema documentation and resources for these formats were made publicly available in 2005, enhancing accessibility for developers working with Office 2003 XML.[11] However, with the release of Office 2007, these pre-OOXML formats were deprecated in favor of the new Office Open XML (OOXML) standard, which built upon but superseded the flat XML structures by introducing zipped packages for better efficiency and compatibility.[12] Despite this transition, Microsoft retained support for the legacy Office XML formats in subsequent versions, including compatibility modes for opening and converting files, extending through Microsoft 365 as of 2025 to accommodate existing documents without disrupting workflows.[13]Core Formats
WordProcessingML
WordProcessingML, also known as WordML, is an XML-based format introduced with Microsoft Office Word 2003 that enables the saving of documents as hierarchical XML files, representing the structure, content, styles, and metadata of text-based documents. This format allows for the separation of content from presentation, facilitating easier integration with other systems and programmatic manipulation while maintaining compatibility with Word's native features for basic documents. Files saved in this format use the .xml extension and adhere to a predefined schema that organizes document elements in a tree-like structure.[1] The core structure begins with the root element<w:wordDocument>, which encapsulates the entire document and declares necessary namespaces. Within this, the <w:body> element serves as the primary container for the document's content, including block-level elements such as paragraphs marked by <w:p>. Each paragraph can contain one or more runs of text via <w:r> elements, where individual text portions are held in <w:t> child elements, allowing for granular control over formatting and styling. This hierarchical approach mirrors the logical flow of a word processing document, prioritizing linear text organization over visual layout complexities.[14]
WordProcessingML supports essential rich text features, such as applying bold formatting with the <w:b> element and italics with the <w:i> element, both of which are nested within the run properties (<w:rPr>) to affect subsequent text without altering the entire paragraph. These elements enable straightforward markup for character-level styling, promoting accessibility for developers and tools that generate or parse documents. However, the format is limited to basic text and simple formatting, deliberately excluding advanced features like embedded objects, VBA macros, or intricate page layouts such as frames and floating elements to maintain a lightweight, XML-compliant representation suitable for data interchange.[15]
To ensure interoperability and prevent namespace collisions, WordProcessingML employs the prefix w: bound to the URI http://schemas.microsoft.com/office/word/2003/wordml, with additional namespaces for related vocabularies like VML for basic graphics. The full XML schema, including detailed element definitions and validation rules, has been publicly available for download from Microsoft's site since its release in 2003 under an open, royalty-free license, supporting widespread adoption in enterprise and development environments.[1]
SpreadsheetML
SpreadsheetML is an XML-based format introduced in Microsoft Excel XP and fully supported in Excel 2003, allowing spreadsheets to be saved as standalone .xml files that represent the workbook structure, data, and basic calculations in a human-readable manner.[16] This format enables data interchange with other applications by structuring worksheets hierarchically, with elements defining the overall workbook, individual sheets, rows, and cells, while supporting simple formulas for computations.[10] Unlike binary .xls files, SpreadsheetML uses XML schemas to ensure interoperability and extensibility, though it is limited to core spreadsheet features without advanced elements like pivot tables or charting.[17] The root element of a SpreadsheetML document is<ss:Workbook>, which encapsulates the entire spreadsheet and contains one or more <ss:Worksheet> elements, each representing a sheet with a name attribute like ss:Name="Sheet1".[10] Within each worksheet, a <ss:Table> element serves as the container for data, holding <ss:Row> elements that define rows, optionally using the ss:Index attribute to specify non-sequential positioning (e.g., ss:Index="3" skips to the third row).[18] Each row then includes <ss:Cell> elements for individual cells, which can use ss:Index for column positioning (e.g., ss:Index="2" for column B) and contain child <ss:Data> elements for values, typed via attributes like ss:Type="String" or ss:Type="Number".[10] This sparse structure allows omission of empty cells or rows, optimizing the XML for efficiency.[17]
SpreadsheetML employs the primary namespace ss: (urn:schemas-microsoft-com:office:spreadsheet) for core elements, with additional namespaces like x: (urn:schemas-microsoft-com:office:excel) for Excel-specific extensions, ensuring consistent parsing across tools.[10] For formulas, cells support the <ss:Formula> element or attribute, using either A1 or R1C1 notation to reference other cells; for instance, <ss:Formula>Sum(A1:A3)</ss:Formula> computes the sum of values in cells A1 through A3, with the result stored in a <ss:Data> child element upon evaluation.[18] However, this format does not include support for pivot tables, advanced charting, scenarios, objects, edit ranges, or password protection, focusing instead on tabular data and basic arithmetic.[10]
Extended Formats
DataDiagramingML for Visio
DataDiagramingML is the XML-based format used by Microsoft Visio to represent vector-based diagrams, introduced with Visio 2003 to enable programmatic manipulation and interoperability of diagram files. This format supports the creation, storage, and exchange of drawings, stencils, and templates through plain-text XML files, with extensions including .vdx for drawings, .vsx for stencils, and .vtx for templates. Unlike binary formats, DataDiagramingML files are uncompressed, allowing direct editing with standard XML tools for easy parsing and modification.[19] At the core of a DataDiagramingML document is the root element<VisioDocument>, which encapsulates the entire diagram structure, including pages, masters, and global properties. Within this root, the <Shapes> section defines individual diagram objects via <Shape> tags, each requiring a unique ID attribute to identify geometric and textual elements. Relationships between shapes are represented in the <Connects> section, where <Connect> elements specify connections, such as lines linking boxes in an organization chart, using attributes like FromSheet and ToSheet to reference connected shapes.[20][21]
The format supports advanced diagramming features through attributes and sections within <Shape> elements, including layers for organizing shapes, colors via cells like FillColor in the Fill section, and text formatting with properties such as Text. These are defined in ShapeSheet sections represented as XML rows and cells, enabling precise control over visual properties like fill patterns and font styles. For example, a shape's fill can be set using <Section N="Fill"><Row N="Fill1"><Cell N="FillColor" V="#FF0000"/></Row></Section> to apply a red color.[22]
DataDiagramingML facilitates integration with external data sources through XML data islands, allowing shapes to link dynamically to embedded or referenced XML datasets for data-driven diagrams. This uses the schema namespace http://schemas.microsoft.com/visio/2003/core, with diagram-specific elements such as <Shapes> and <Connects>, enabling automated updates when underlying data changes.[23]
InfoPath XML Forms
InfoPath XML forms, introduced with Microsoft Office InfoPath in 2003, utilize .xsn files as form templates for creating electronic forms that capture structured data. These .xsn files are cabinet (CAB)-compressed packages containing a collection of XML files, including a manifest (.xsf), schema definitions (.xsd), and view transformations (.xsl), which together define the data structure, user interface, and behavior of the forms.[24][25] The manifest file serves as the central component, with a root element of<xsf:form> that ties together all form elements, specifying properties such as views, data sources, rules, and extensions to enable cohesive form functionality.[25]
Key to data integrity in these forms are the .xsd schema files, which provide structural and validation rules for the underlying XML data entered by users. Each form template includes at least one primary .xsd file designated as the root schema, ensuring that input conforms to predefined constraints during editing and submission.[26] Views are rendered using .xsl files based on XSLT, allowing for customizable presentations of the form data while maintaining separation from the core XML structure.[24] Additionally, InfoPath supports XPath expressions throughout the form definition for dynamic behaviors, such as binding controls to data nodes, conditional formatting, and rule-based validations that respond to user interactions.[26][25]
The CAB-compressed nature of .xsn packages facilitates easy distribution and deployment, with the manifest orchestrating components like initial data templates (.xml) and resource files for a complete form experience. Forms support role-based editing through defined roles in the manifest, allowing permissions to restrict or enable modifications based on user context, and digital signatures via signed data blocks to ensure data authenticity and integrity.[24][25] For broader use, these XML forms are deployable to SharePoint for workflow integration, where they can connect to lists, web services, and other data sources to automate business processes.[27][25]
Technical Specifications
XML Schema Design
The Microsoft Office XML formats rely on the W3C XML Schema Definition (XSD) language to define the permissible structure, data types, and constraints for each document format, enabling rigorous validation and consistent parsing across applications. This design choice, based on the XML Schema 1.0 recommendation, allows for precise specification of document components, such as text runs in WordProcessingML or cell values in SpreadsheetML, ensuring that XML instances conform to expected patterns for interoperability.[2] Central to the schema design are element hierarchies that reflect the nested organization of office documents, where child elements like<p> (paragraph) are contained within parent elements such as <body> to represent logical flow. Attribute restrictions enforce data integrity, for instance through enumerated values for properties like text alignment (e.g., "left", "right", "center", or "justify" in WordProcessingML style definitions) or required string patterns for identifiers. Namespaces modularize the schemas by partitioning definitions—core namespaces like http://schemas.microsoft.com/office/word/[2003](/page/2003)/wordml handle primary elements, while separate ones address extensions, preventing conflicts and promoting reusability across formats.[2]
Microsoft published these schemas in 2003 under a royalty-free license, making them available as downloadable .xsd files via the Office 2003 XML Reference Schemas package, which developers could integrate for custom validation and transformation. For WordProcessingML, the package comprises multiple interconnected .xsd files to comprehensively cover document facets, from basic markup to advanced features like revision tracking.[16]
The schemas incorporate extensibility through xsd:anyType elements, permitting the inclusion of custom XML content within defined boundaries to support specialized applications without altering core definitions. Versioning is managed via attributes like rsid (revision save ID), which uniquely identify modifications to elements or styles, thereby preventing breakage during updates and enabling tools to detect and merge changes across document versions.[2]
Document Structure and Elements
Microsoft Office XML formats, such as WordprocessingML and SpreadsheetML, employ a monolithic structure where the entire document is represented within a single XML file, lacking the multi-part packaging found in later standards.[2][10] This linear hierarchy begins with an XML declaration specifying version 1.0 and UTF-8 encoding, followed by a processing instruction identifying the application (e.g.,<?mso-application progid="Word.Document"?> for WordprocessingML), and declarations for namespaces and referenced schemas in the header section.[2] The body contains the core content, while metadata such as document properties (e.g., author, title) appears in dedicated elements like <o:DocumentProperties>, often placed near the root or end of the file.[2] This design ensures self-containment but can introduce redundancy, as shared elements like styles must be fully defined or repeated without external linking.[16]
At the core of this organization are common hierarchical patterns across the formats: a root element encapsulates the document, child nodes represent structural divisions, and leaf nodes hold atomic data. In WordprocessingML, the root <w:wordDocument> (from namespace w at https://schemas.microsoft.com/office/word/2003/wordml) contains child elements like <w:body> for the main content, which in turn holds sections, paragraphs (<w:p>), and runs (<w:r>); leaf nodes such as <w:t> store text values.[2] Similarly, SpreadsheetML uses <ss:Workbook> (namespace ss at urn:schemas-microsoft-com:office:spreadsheet) as the root, with child nodes like <ss:Worksheet>, <ss:Table>, and <ss:Row> organizing sheets and rows, culminating in leaf <ss:Data> or <ss:Cell> elements for values, formulas, or cell data.[10] These patterns enforce a tree-like descent from high-level containers to granular details, promoting readability while adhering to XML schema rules for validation.
A key concept in these formats is the distinction between inline and referenced elements to manage redundancy in the single-file model. For instance, styles are typically defined once in a central <w:styles> or <ss:Styles> section at the document level, then applied via attributes like w:style or ss:StyleID referencing unique IDs, rather than repeating full definitions inline for each occurrence.[2][10] Inline formatting, such as bold via <w:b/>, can override references for specific instances. This approach, combined with namespace prefixes (e.g., w: for WordprocessingML, ss: for SpreadsheetML), enables efficient reuse without internal hyperlinks or parts, though it relies on schema enforcement to maintain integrity. The absence of relational mechanisms means all content and metadata coexist linearly, facilitating simple parsing but potentially increasing file size for complex documents.[16]
Practical Examples
WordProcessingML Sample
WordProcessingML documents are structured using a root<w:wordDocument> element that encapsulates the entire content, including metadata and the document body. A basic example illustrates this by including document properties such as title and author, followed by a body containing paragraphs with runs of text and inline formatting. The namespace for WordProcessingML elements is defined as w:http://schemas.microsoft.com/office/word/2003/wordml, ensuring compatibility with the schema developed by Microsoft for Office 2003.[2][28]
The following XML snippet represents a simple WordProcessingML document titled "Sample Document," authored by an example user, with a heading paragraph and a body paragraph featuring bold formatting on a word:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<w:wordDocument
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:o="urn:schemas-microsoft-com:office:office">
<o:DocumentProperties>
<o:Title>Sample Document</o:Title>
<o:Author>Example Author</o:Author>
<o:LastAuthor>Example Author</o:LastAuthor>
<o:Created>2025-11-11T00:00:00Z</o:Created>
</o:DocumentProperties>
<w:body>
<w:p>
<w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
<w:r><w:t>Sample Document</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>The </w:t></w:r>
<w:r>
<w:rPr><w:b/></w:rPr>
<w:t>quick brown fox</w:t>
</w:r>
<w:r><w:t> jumps over the lazy dog.</w:t></w:r>
</w:p>
<w:sectPr/>
</w:body>
</w:wordDocument>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<w:wordDocument
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:o="urn:schemas-microsoft-com:office:office">
<o:DocumentProperties>
<o:Title>Sample Document</o:Title>
<o:Author>Example Author</o:Author>
<o:LastAuthor>Example Author</o:LastAuthor>
<o:Created>2025-11-11T00:00:00Z</o:Created>
</o:DocumentProperties>
<w:body>
<w:p>
<w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
<w:r><w:t>Sample Document</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>The </w:t></w:r>
<w:r>
<w:rPr><w:b/></w:rPr>
<w:t>quick brown fox</w:t>
</w:r>
<w:r><w:t> jumps over the lazy dog.</w:t></w:r>
</w:p>
<w:sectPr/>
</w:body>
</w:wordDocument>
<w:body> element to contain paragraphs (<w:p>), where each paragraph consists of runs (<w:r>) that hold text (<w:t>) and properties like bold (<w:b> within <w:rPr>).[2] The structure highlights the format's simplicity, allowing the file to be saved with a .xml extension and opened directly in Microsoft Word 2003 or later versions, as well as in any standard text editor for manual inspection or programmatic parsing.[2]
To ensure correctness, this XML can be validated against the official WordProcessingML schema using tools like XMLSpy, which supports schema association and error reporting for XML documents. The schema files are available from Microsoft's official repository, enabling verification of element compliance and namespace usage.[28]
SpreadsheetML Sample
SpreadsheetML provides a structured way to represent Excel worksheets in XML, enabling programmatic creation and manipulation of spreadsheet data. The following example demonstrates a basic workbook with a single worksheet containing a table of named items and their values, culminating in a summation formula. This illustrates core elements such as rows, cells with data types, and formula computation, using the standard namespace for SpreadsheetML.[10]<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Worksheet ss:Name="Sheet1">
<ss:Table>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Name</ss:Data></ss:Cell>
<ss:Cell><ss:Data ss:Type="String">Value</ss:Data></ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Item1</ss:Data></ss:Cell>
<ss:Cell ss:StyleID="s1"><ss:Data ss:Type="Number">10</ss:Data></ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Item2</ss:Data></ss:Cell>
<ss:Cell ss:StyleID="s1"><ss:Data ss:Type="Number">20</ss:Data></ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Total</ss:Data></ss:Cell>
<ss:Cell ss:Formula="=SUM(B2:B3)" ss:StyleID="s1">
<ss:Data ss:Type="Number">30</ss:Data>
</ss:Cell>
</ss:Row>
</ss:Table>
<ss:Names>
<ss:NamedRange ss:Name="SumRange" ss:RefersTo="=Sheet1!B2:B3"/>
</ss:Names>
</ss:Worksheet>
</ss:Workbook>
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Worksheet ss:Name="Sheet1">
<ss:Table>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Name</ss:Data></ss:Cell>
<ss:Cell><ss:Data ss:Type="String">Value</ss:Data></ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Item1</ss:Data></ss:Cell>
<ss:Cell ss:StyleID="s1"><ss:Data ss:Type="Number">10</ss:Data></ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Item2</ss:Data></ss:Cell>
<ss:Cell ss:StyleID="s1"><ss:Data ss:Type="Number">20</ss:Data></ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell><ss:Data ss:Type="String">Total</ss:Data></ss:Cell>
<ss:Cell ss:Formula="=SUM(B2:B3)" ss:StyleID="s1">
<ss:Data ss:Type="Number">30</ss:Data>
</ss:Cell>
</ss:Row>
</ss:Table>
<ss:Names>
<ss:NamedRange ss:Name="SumRange" ss:RefersTo="=Sheet1!B2:B3"/>
</ss:Names>
</ss:Worksheet>
</ss:Workbook>
<ss:Workbook> element declares the primary namespace ss="urn:schemas-microsoft-com:office:spreadsheet", which defines elements like <ss:Worksheet>, <ss:Table>, <ss:Row>, and <ss:Cell>.[10] Each <ss:Cell> encapsulates data via <ss:Data> with an ss:Type attribute for formatting, such as "String" for text headers like "Name" and "Value", or "Number" for numeric values like 10 and 20.[10] The ss:StyleID attribute references predefined styles (e.g., "s1" for number formatting like decimals or currency, defined elsewhere in a full document's <ss:Styles> section).[10]
The total cell employs an <ss:Formula> attribute with the A1-style expression =SUM(B2:B3) to compute the sum of values in cells B2 and B3, storing the result in <ss:Data ss:Type="Number">30</ss:Data>.[10] A named range "SumRange" is defined in <ss:Names> referencing =Sheet1!B2:B3 for potential reuse in formulas, such as =SUM(SumRange).[10] This example represents a fundamental table structure suitable for simple data tracking; as rows increase, the plain XML file size expands linearly due to the uncompressed text format, differing from the zipped packages in Office Open XML.[10][29]
For parsing this XML, tools like XQuery can query specific elements, such as //ss:Data[@ss:Type="Number"]/text() to retrieve all numeric cell values, facilitating data extraction in applications.[10]
Limitations and Compatibility
Inherent Constraints
The Microsoft Office XML formats, introduced in Office 2003, employ a monolithic single-file structure consisting of a single XML document, which contrasts with later packaged formats and results in significantly larger file sizes due to the absence of built-in compression mechanisms such as ZIP archiving.[30] This design simplifies basic data interchange but exacerbates storage and transmission inefficiencies, particularly for documents with repetitive or verbose XML markup.[31] A core inherent constraint is the omission of support for advanced features like Visual Basic for Applications (VBA) macros, which are entirely lost upon saving in these formats, as the XML schemas do not encompass binary code execution elements.[32] Similarly, embedded binary content such as images, charts, and drawing objects is not preserved; for instance, in SpreadsheetML, charts and drawing layers are discarded during XML export, limiting the formats to text-based content only.[32] These omissions stem from the formats' focus on structured data representation rather than full fidelity to binary Office features. The absence of package relationships further compounds these issues, as the single-file architecture lacks mechanisms to define inter-part dependencies or modular updates, rendering manual modifications error-prone and increasing the risk of structural corruption during editing.[30] Content is restricted to UTF-8 encoded text, excluding binary blobs or non-textual elements, which aligns with XML standards but precludes seamless integration of multimedia or proprietary Office objects.[33] File size limitations are inherently tied to XML parser capabilities and the 32-bit architecture of Office 2003, capping practical document sizes at approximately 2 GB, beyond which parsing errors or memory exhaustion occur during load or save operations.[34] This constraint, combined with the uncompressed nature of the files, often makes handling large datasets inefficient compared to binary alternatives.[31]Backward and Forward Compatibility
Microsoft Office XML formats, introduced primarily in Office 2003, face significant backward compatibility challenges with earlier versions such as Office XP (2002). These pre-2003 applications lack native support for XML-based documents like WordML or SpreadsheetML, requiring third-party viewers or converters to access the content, often resulting in partial loss of advanced formatting and features during viewing or editing attempts.[35][2] In contrast, forward compatibility is generally stronger, as Office 2007 and subsequent versions can open legacy Office XML files natively while operating in compatibility mode to preserve core functionality. However, upon saving, these files are typically converted to the newer Office Open XML (OOXML) format, which may introduce minor adjustments to ensure alignment with modern standards. For InfoPath XML forms specifically, support was maintained through Office 2013, the final version of the product, following Microsoft's deprecation announcement in 2014, with extended support ending on July 14, 2026.[13][36][37] As of 2025, legacy support for Office XML formats persists through compatibility packs and modes in current Microsoft Office applications, enabling basic opening and editing. Nevertheless, files like Visio 2003 XML (VDX) are not fully renderable in modern Visio without conversion to updated formats such as VSDX, as certain legacy elements may not display or function optimally due to schema differences.[38][39] To address compatibility with even older systems, Office 2003 provides save-as options allowing users to export XML documents to binary formats (e.g., .doc or .xls) for broader accessibility. This process, however, involves round-tripping between XML and binary representations, which can lead to risks of data loss or formatting inconsistencies, particularly for complex documents where not all XML-specific features map perfectly back to the binary structure.[2]Comparison with Office Open XML
Architectural Differences
The legacy Microsoft Office XML formats, introduced in Office 2003, employ a flat, uncompressed structure consisting of single, monolithic XML files for each document type, such as WordML for Word documents or SpreadsheetML for Excel spreadsheets.[2][13] These files encapsulate all document content, properties, and formatting within one self-contained XML instance, without any internal directory organization, folder hierarchies, or explicit relationship mechanisms between components.[40] This design simplifies basic parsing but limits scalability for complex documents, as there are no provisions for modular parts or external references beyond inline embedding.[41] In contrast, Office Open XML (OOXML), adopted as the default format starting with Office 2007, utilizes a zipped package model based on the Open Packaging Conventions (OPC), transforming documents into compressed ZIP archives containing multiple discrete XML parts organized in virtual folders.[42][43] For instance, a Word document's core content resides in/word/document.xml, while supporting elements like styles, headers, and images are stored in separate parts such as /word/styles.xml or /word/media/.[44] Relationships between these parts are managed through .rels files in a /_rels/ folder, which define hyperlinks, embeddings, and dependencies using a standardized XML syntax.[45] Additionally, the root [Content_Types].xml file specifies MIME types for all parts, enabling precise identification and handling of diverse content within the package.[46]
This architectural evolution from a flat, single-file XML approach to a modular ZIP-based package in 2007 addressed key limitations of the legacy formats, particularly in file size and maintainability.[47] Legacy XML files are typically significantly larger—often several times the size of equivalent OOXML files—for the same content, due to the absence of compression and the redundancy of embedding all elements in one file.[48][42] The package model enhances modularity by allowing independent editing of parts, facilitates partial repairs without affecting the entire document, and supports advanced tools for validation and interoperability.[49][50]
