Recent from talks
Contribute something
Nothing was collected or created yet.
Compound File Binary Format
View on WikipediaCompound File Binary Format (CFBF), also called Compound File, Compound Document format,[1] or Composite Document File V2[2] (CDF), is a compound document file format for storing numerous files and streams within a single file on a disk. CFBF is developed by Microsoft and is an implementation of Microsoft COM Structured Storage.[3][4][5] The file format is used for storing storage objects and stream objects in a hierarchical structure within a single file.[6]
Microsoft has opened the format for use by others and it is now used in a variety of programs from Microsoft Word and Microsoft Access to Business Objects.[citation needed] It also forms the basis of the Advanced Authoring Format.[7]
Overview
[edit]At its simplest, the Compound File Binary Format is a container, with little restriction on what can be stored within it.
A CFBF file structure loosely resembles a FAT file system. The file is partitioned into Sectors which are chained together with a File Allocation Table (not to be mistaken with the file system of the same name) which contains chains of sectors related to each file, a Directory holds information for contained files with a Sector ID (SID) for the starting sector of a chain and so on.
Structure
[edit]The CFBF file consists of a 512-byte header record followed by a number of Sectors whose size is defined in the header. The literature defines Sectors to be either 512 or 4096 bytes in length, although the format is potentially capable of supporting sectors ranging in size from 128 bytes upwards, in powers of two (128, 256, 512, 1024, etc.). The lower limit of 128 is the minimum required to fit a single directory entry in a Directory Sector.[relevant?]
There are several types of sector that may be present in a CFBF file:
- File Allocation Table (FAT) Sector – contains chains of sector indices much as a FAT does in the FAT/FAT32 filesystems
- MiniFAT Sectors – similar to the FAT but storing chains of mini-sectors within the Mini-Stream
- Double-Indirect FAT (DIFAT) Sector – contains chains of FAT sector indices
- Directory Sector – contains directory entries
- Stream Sector – contains arbitrary file data
- Range Lock Sector – contains the byte-range locking area of a large file
More detail is given below for the header and each sector type.
CFBF header format
[edit]The CFBF header occupies the first 512 bytes of the file and information required to interpret the rest of the file. The C-style structure declaration below (extracted from the AAFA's Low-Level Container Specification) shows the members of the CFBF header and their purpose:
typedef unsigned long ULONG; // 4 bytes
typedef unsigned short USHORT; // 2 bytes
typedef short OFFSET; // 2 bytes
typedef ULONG SECT; // 4 bytes
typedef ULONG FSINDEX; // 4 bytes
typedef USHORT FSOFFSET; // 2 bytes
typedef USHORT WCHAR; // 2 bytes
typedef ULONG DFSIGNATURE; // 4 bytes
typedef unsigned char BYTE; // 1 byte
typedef unsigned short WORD; // 2 bytes
typedef unsigned long DWORD; // 4 bytes
typedef ULONG SID; // 4 bytes
typedef GUID CLSID; // 16 bytes
struct StructuredStorageHeader { // [offset from start (bytes), length (bytes)]
BYTE _abSig[8]; // [00H,08] {0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1,
// 0x1a, 0xe1} for current version
CLSID _clsid; // [08H,16] reserved must be zero (WriteClassStg/
// GetClassFile uses root directory class id)
USHORT _uMinorVersion; // [18H,02] minor version of the format: 33 is
// written by reference implementation
USHORT _uDllVersion; // [1AH,02] major version of the dll/format: 3 for
// 512-byte sectors, 4 for 4 KB sectors
USHORT _uByteOrder; // [1CH,02] 0xFFFE: indicates Intel byte-ordering
USHORT _uSectorShift; // [1EH,02] size of sectors in power-of-two;
// typically 9 indicating 512-byte sectors
USHORT _uMiniSectorShift; // [20H,02] size of mini-sectors in power-of-two;
// typically 6 indicating 64-byte mini-sectors
USHORT _usReserved; // [22H,02] reserved, must be zero
ULONG _ulReserved1; // [24H,04] reserved, must be zero
FSINDEX _csectDir; // [28H,04] must be zero for 512-byte sectors,
// number of SECTs in directory chain for 4 KB
// sectors
FSINDEX _csectFat; // [2CH,04] number of SECTs in the FAT chain
SECT _sectDirStart; // [30H,04] first SECT in the directory chain
DFSIGNATURE _signature; // [34H,04] signature used for transactions; must
// be zero. The reference implementation
// does not support transactions
ULONG _ulMiniSectorCutoff; // [38H,04] maximum size for a mini stream;
// typically 4096 bytes
SECT _sectMiniFatStart; // [3CH,04] first SECT in the MiniFAT chain
FSINDEX _csectMiniFat; // [40H,04] number of SECTs in the MiniFAT chain
SECT _sectDifStart; // [44H,04] first SECT in the DIFAT chain
FSINDEX _csectDif; // [48H,04] number of SECTs in the DIFAT chain
SECT _sectFat[109]; // [4CH,436] the SECTs of first 109 FAT sectors
};
File Allocation Table (FAT) sectors
[edit]When taken together as a single stream the collection of FAT sectors define the status and linkage of every sector in the file. Each entry in the FAT is 4 bytes in length and contains the sector number of the next sector in a FAT chain or one of the following special values:
- FREESECT (0xFFFFFFFF) – denotes an unused sector
- ENDOFCHAIN (0xFFFFFFFE) – marks the last sector in a FAT chain
- FATSECT (0xFFFFFFFD) – marks a sector used to store part of the FAT
- DIFSECT (0xFFFFFFFC) – marks a sector used to store part of the DIFAT
Range Lock Sector
[edit]This section needs expansion. You can help by adding to it. (November 2009) |
The Range Lock Sector must exist in files greater than 2 GB in size, and must not exist in files smaller than 2 GB. The Range Lock Sector must contain the byte range 0x7FFFFF00 to 0x7FFFFFFF in the file. This area is reserved by Microsoft's COM implementation for storing byte-range locking information for concurrent access.
Glossary
[edit]- FAT – File Allocation Table; also known as SAT – Sector Allocation Table
- DIFAT – Double-Indirect File Allocation Table
- FAT Chain – a group of FAT entries which indicate the Sectors allocated to a Stream in the file
- Stream – a virtual file which occupies a number of Sectors within the CFBF
- Sector – the unit of allocation within the CFBF, usually 512 or 4096 Bytes in length
See also
[edit]References
[edit]- ^ "Apache POI – POIFS". POI Project. Archived from the original on 26 April 2011. Retrieved 10 May 2011.
- ^ "How to convert documents between LibreOffice and Microsoft Office file formats on Linux". Archived from the original on 21 September 2019. Retrieved 25 November 2016.
- ^ "Compound Files (Windows)". Microsoft Developers Network (MSDN) library – COM SDK. Microsoft Corporation. 20 November 2008. Retrieved 23 September 2009.
- ^ "Containers: Compound Files". Microsoft Developers Network (MSDN) library – Visual Studio 2008 documentation. Microsoft Corporation. Retrieved 23 September 2009.
- ^ "Understand Compound Files". Microsoft Developers Network (MSDN) library – ActiveDirectory Rights Management. 25 June 2009. Retrieved 23 September 2009.
- ^ "Microsoft Compound File Binary File Format, Version 4". www.loc.gov. 28 January 2020. Retrieved 13 June 2024.
- ^ AMW Association (formerly AAF Association) Archived 15 August 2000 at the Wayback Machine
External links
[edit]- "[MS-CFB]: Compound File Binary File Format". Microsoft. Retrieved 6 July 2019.
- "Microsoft Compound Document File Format" (PDF). OpenOffice.org CFBF description. Retrieved 22 May 2006.
- "Advanced Authoring Format Low-Level Container Specification" (PDF). Microsoft Structured Storage version 3 specification (PDF). Archived from the original (PDF) on 9 August 2011. Retrieved 22 May 2006.
- "Microsoft Compound File Binary File Format, Version 3". Library of Congress, Digital Formats web site. Retrieved 6 July 2019.
Compound File Binary Format
View on GrokipediaIntroduction
Overview
The Compound File Binary Format (CFBF) is a general-purpose file format that provides a file-system-like structure within a single file for the storage of arbitrary, application-specific streams of data.[4] It supports two primary object types: storages, which function like directories for hierarchical organization, and streams, which act as file-like containers for data.[5] CFBF emulates a simplified FAT file system by dividing the file into fixed-size sectors, typically 512 bytes or 4096 bytes, to enable efficient data management and access.[5] This structure allows multiple data types to be embedded within one file, facilitating modifications to individual components without requiring a full file rewrite, which is particularly useful for compound documents in applications like Microsoft Office.[1] The minimum file size is three sectors: one for the header, one for the File Allocation Table (FAT), and one for the directory.[1] Originally developed as part of the OLE 2.0 structured storage system, CFBF has evolved into a standardized format documented in Microsoft's [MS-CFB] specification, with version 12.0 published in April 2024 (last revised October 2024).[4] It organizes data hierarchically via directory entries and chains sectors using allocation tables.[5]History and Development
The Compound File Binary Format (CFBF), originally known as the Compound Document File format, was developed by Microsoft in the early 1990s as a core component of Object Linking and Embedding (OLE) 2.0, introduced to enable structured storage within a single file for compound documents in Windows applications.[6][3] This format provided a file-system-like hierarchy of storages and streams, drawing conceptual inspiration from the FAT12 and FAT16 file allocation mechanisms of earlier DOS and Windows file systems, but adapted to embed multiple data objects efficiently within one binary file.[3] Early beta implementations appeared in late 1992, supporting OLE 2.0's object model under the Component Object Model (COM), with the format's signature evolving to its current form by the mid-1990s.[3] CFBF became integral to structured storage in Microsoft Office applications starting in 1993, such as with Excel 5.0 and Word 6.0, and gained prominence with the release of Office 95, facilitating the embedding and linking of diverse data types in documents while maintaining compatibility with Windows 95 and NT operating systems.[1] The format's integration with COM in these platforms allowed for seamless interoperability across applications, marking a key milestone in Microsoft's push toward component-based software architecture. Major version 4 of CFBF supports 4096-byte sectors for handling larger files and improved performance.[4] Microsoft formalized and publicly documented CFBF through the Open Specifications program, beginning with the initial [MS-CFB] specification release on July 16, 2010 (version 1.0), which detailed the format for third-party interoperability.[4] Subsequent revisions addressed security enhancements, compatibility issues, and sector allocation refinements, with major updates including version 2.0 in October 2010, version 4.0 in November 2013, and the latest version 12.0 in April 2024.[4] Although not standardized by an international body like ISO, the format is maintained via Microsoft's Open Specifications, ensuring ongoing documentation and support for cross-platform use in various applications beyond Office.[4]Core File Structure
File Header
The Compound File Header is a fixed 512-byte structure located at the beginning of every Compound File Binary Format (CFBF) file, serving as the entry point that contains critical metadata for parsing the file's sector-based organization, allocation tables, and directory. It identifies the file format, specifies sector sizes, and provides starting locations for key components like the directory chain, File Allocation Table (FAT), Mini FAT, and Double-Indirect FAT (DIFAT). For CFBF version 4 files, the header extends to 4,096 bytes with padding bytes beyond the first 512, but all functional fields reside in the initial portion.[7] The header opens with an 8-byte signature at bytes 0–7, fixed as the hexadecimal sequence D0 CF 11 E0 A1 B1 1A E1, which uniquely identifies the file as adhering to the CFBF specification. This signature represents the little-endian byte order of a specific Unicode string pattern, ensuring compatibility and detection by applications.[7] Subsequent fields define the file's structural parameters. At bytes 30–31, the Sector Shift field is a 16-bit unsigned integer indicating the base-2 logarithm of the sector size: a value of 9 corresponds to 512-byte sectors (common in version 3 files), while 12 denotes 4,096-byte sectors in version 4 files. Bytes 32–33 hold the Mini Sector Shift, fixed at 6 to specify 64-byte mini sectors used for small streams below a certain size threshold. The Number of Directory Sectors field (bytes 40–43) is a 32-bit unsigned integer counting the sectors allocated to the directory entry chain, which organizes the file's object hierarchy; in version 3 files, this field must be 0, with the directory size determined from the root entry. Similarly, bytes 44–47 contain the Number of FAT Sectors, a 32-bit value tallying the total sectors in the FAT, which maps logical to physical sector locations.[7] Navigation to core components is facilitated by starting sector indicators. Bytes 48–51 specify the First Directory Sector Location as a 32-bit unsigned integer, pointing to the initial sector of the directory chain. Bytes 52–55 house the Transaction Signature, a 32-bit unsigned integer used for detecting concurrent modifications or transaction states in multi-user environments, though it is typically zero if transactions are not supported. For handling small streams, bytes 60–63 indicate the First Mini FAT Sector Location, and bytes 64–67 provide the Number of Mini FAT Sectors, both as 32-bit unsigned integers; these allocate a secondary FAT for streams under the mini stream cutoff size of 4,096 bytes. DIFAT management fields follow: bytes 68–71 denote the First DIFAT Sector Location, and bytes 72–75 the Number of DIFAT Sectors, each 32-bit unsigned integers that extend the FAT sector index beyond the header's capacity.[7] The header concludes with an embedded DIFAT array at bytes 76–511 (436 bytes total), comprising the first 109 entries as 32-bit unsigned integers, each pointing to a FAT sector's location; this array bootstraps the double-indirect allocation mechanism for larger files, with additional DIFAT sectors referenced if needed. Reserved fields, such as bytes 34–39 and the Class ID at bytes 8–23 (all zeros), ensure alignment and future extensibility without altering the core structure.[7]| Field Name | Byte Offset | Size (bytes) | Value/Description |
|---|---|---|---|
| Header Signature | 0–7 | 8 | Fixed: D0 CF 11 E0 A1 B1 1A E1 (format identifier) |
| Sector Shift | 30–31 | 2 | 9 (512-byte sectors) or 12 (4,096-byte sectors) |
| Mini Sector Shift | 32–33 | 2 | 6 (64-byte mini sectors) |
| Number of Directory Sectors | 40–43 | 4 | Count of sectors for directory chain |
| Number of FAT Sectors | 44–47 | 4 | Total FAT sectors in file |
| First Directory Sector Location | 48–51 | 4 | Starting sector for directory |
| Transaction Signature | 52–55 | 4 | Transaction sequence number (often 0) |
| First Mini FAT Sector Location | 60–63 | 4 | Starting sector for Mini FAT |
| Number of Mini FAT Sectors | 64–67 | 4 | Count of Mini FAT sectors |
| First DIFAT Sector Location | 68–71 | 4 | Starting sector for additional DIFAT |
| Number of DIFAT Sectors | 72–75 | 4 | Count of DIFAT sectors |
| DIFAT Array | 76–511 | 436 | First 109 FAT sector pointers |
Sectors and Sector Types
The Compound File Binary Format (CFBF) divides the file into fixed-size sectors, which serve as the fundamental units for organizing and storing all data, metadata, and allocation information. Sector sizes are determined by the Sector Shift field in the file header: for major version 3, the size is 512 bytes (Sector Shift = 0x0009), while for major version 4, it is 4096 bytes (Sector Shift = 0x000C).[7] These sizes apply uniformly to all sectors except mini sectors, which are fixed at 64 bytes regardless of version to handle small streams efficiently. Sectors are identified by nonnegative 32-bit integers starting from 0, with the header occupying sector 0 at file offset 0. Valid sector numbers range from 0x00000000 to 0xFFFFFFFA (MAXREGSECT), while unallocated free sectors are marked with 0xFFFFFFFF (FREESECT). Reserved values include 0xFFFFFFFE for end-of-chain markers (ENDOFCHAIN) and specific codes for allocation structures like FAT sectors (FATSECT = 0xFFFFFFFD). Beyond the header, each sector consists of 512 or 4096 bytes of raw data, indices, or metadata, depending on its type, and sectors are linked into chains for larger structures.[8] CFBF defines several sector types to support its file-system-like structure:- Header Sector: A single fixed sector at position 0 containing essential metadata, such as version information, sector size, and pointers to key structures like the directory and FAT. It is the only sector not numbered in the general allocation scheme.[7]
- FAT Sectors: Contain the File Allocation Table entries that map sector chains for streams and storages, with each entry being a 4-byte sector number.[8]
- Directory Sectors: Hold the directory entries (128 bytes each) that describe the hierarchy of storage and stream objects, including names, sizes, and starting sector numbers.[8]
- Mini FAT Sectors: Similar to FAT sectors but for allocating mini sectors in the mini stream, with 128 entries per 512-byte sector in version 3 or 1024 entries per 4096-byte sector in version 4.[9]
- Mini Stream Sectors: 64-byte units within the dedicated mini stream, used for storing data of small streams (typically under 4096 bytes) to optimize space.
- Normal Sectors: General-purpose sectors holding user data for large streams, chained together via FAT entries.[8]
- Free Sectors: Unallocated space available for future use, identified by the FREESECT value and potentially scattered throughout the file or at the end.[8]
Allocation Mechanisms
Double-Indirect File Allocation Table (DIFAT)
The Double-Indirect File Allocation Table (DIFAT) is a critical component of the Compound File Binary Format (CFB), serving as an array of 32-bit unsigned integers that store sector numbers pointing to the locations of File Allocation Table (FAT) sectors within the file. Each entry in the DIFAT is a sector identifier (SECT), where valid values represent the physical sector numbers of FAT sectors, 0xFFFFFFFE indicates the end of the DIFAT chain (ENDOFCHAIN), and 0xFFFFFFFF denotes a free sector or an unused DIFAT entry (FREESECT). Additionally, DIFAT sectors themselves are marked in the FAT with the special value DIFSECT (0xFFFFFFFC) to reserve space for them. This structure enables the CFB to manage large numbers of FAT sectors indirectly, supporting files that exceed the space limitations of the file header alone.[11][5] The DIFAT is primarily located in the file header and extended into dedicated DIFAT sectors as needed. The header reserves the first 109 entries (DIFAT through DIFAT) at byte offsets 76 through 511 (436 bytes total), sufficient for files smaller than approximately 7 MB with 512-byte sectors. For larger files, additional DIFAT entries are stored in DIFAT sectors, whose count is specified in the header's "Number of DIFAT Sectors" field (byte offset 72, a 32-bit unsigned integer). The chain of these DIFAT sectors begins at the sector number given in the header's "DIFAT Start Sector Location" field (byte offset 68), allowing the DIFAT to scale dynamically.[12][11] Each DIFAT sector has a capacity determined by the sector size minus space for chaining. In a 512-byte sector (version 3 files), it holds 127 entries (512 / 4 - 1), with the final 4 bytes as the "Next DIFAT Sector Location" field pointing to the subsequent DIFAT sector or ENDOFCHAIN to terminate the chain. For 4,096-byte sectors (version 4 files), this expands to 1,023 entries (4,096 / 4 - 1). This design theoretically supports up to around 4 billion FAT sectors, limited by the 32-bit addressing in the FAT itself, enabling CFB files to handle vast amounts of data through indirect mapping. The DIFAT's primary purpose is to provide a complete, ordered list of all FAT sector locations, which is essential for reconstructing the full FAT array before accessing stream or storage data.[11][4] DIFAT sectors form a singly linked chain starting from the header's start sector, where each sector's last field links to the next, ensuring sequential access to all entries. The header's initial 109 entries are concatenated with those from the chained sectors to form the complete DIFAT array, with index n pointing to the (n+1)th FAT sector. This chaining mechanism reserves space in the FAT for DIFAT sectors using DIFSECT markers, preventing their reuse for data.[11][13] Validation of the DIFAT ensures file integrity by confirming it forms a complete, non-duplicative list of unique FAT sector locations without cycles or invalid references. Implementers must verify that all sector numbers are valid (less than or equal to the maximum regular sector count, 0xFFFFFFFA), that the chain terminates properly with ENDOFCHAIN, and that no sector is referenced multiple times across the DIFAT or FAT. Invalid DIFAT entries, such as those pointing beyond the file end or creating loops, can lead to parsing failures or security vulnerabilities like denial-of-service from excessive reads. Full validation requires loading the entire DIFAT and checking against the FAT, which is computationally intensive for large files but necessary for robust parsers.[14][8]File Allocation Table (FAT)
The File Allocation Table (FAT) serves as the primary mechanism for managing the allocation and chaining of sectors belonging to large streams in the Compound File Binary Format, enabling efficient navigation through non-contiguous data blocks within the file. Each FAT sector consists of an array of 32-bit unsigned integers (DWORDs), with the number of entries determined by the sector size divided by 4 bytes; for example, a standard 512-byte sector accommodates 128 entries, while a 4,096-byte sector holds 1,024 entries. These entries map a given sector index to the location of the subsequent sector in a stream's chain, facilitating the reconstruction of stream data by linking sectors logically rather than requiring physical contiguity on disk.[15] FAT entry values encode the status and linkage of sectors using specific constants defined in the format specification. A value of 0x00000000 through the maximum valid sector number (typically up to 0xFFFFFFFA for normal sectors) represents the index of the next sector in the chain, allowing streams to span arbitrary locations in the file. The constant 0xFFFFFFFE denotes ENDOFCHAIN, signaling the termination of a sector chain. Entries set to 0xFFFFFFFF indicate FREESECT, marking unallocated or available sectors that can be reused. Reserved sectors, such as those used for FAT or DIFAT, are marked with special values—FATSECT (0xFFFFFFFD) for FAT sectors and DIFSECT (0xFFFFFFFC) for DIFAT sectors—to reserve them and prevent reuse. These values ensure that the FAT operates like a simplified file system bitmap extended with chaining capabilities.[15][3] The FAT sectors themselves are not stored contiguously but are referenced by the Double-Indirect File Allocation Table (DIFAT), which provides their sector indices, with the total number of FAT sectors specified in the file header's csectFat field (a 32-bit unsigned integer at offset 0x44). This design allows the FAT to scale with file size, supporting up to approximately 4 billion sectors in theory due to the 32-bit addressing, though practical limits are imposed by the overall file size and sector shift values in the header. To resolve a chain for a stream starting at sector S, the process begins by reading sector S, then retrieves the next sector from the FAT entry at index S, repeating until an ENDOFCHAIN value is encountered; this traversal reconstructs the full stream without loading the entire file into memory.[15][3] Allocation in the FAT follows strict rules to maintain integrity: sectors assigned to a stream form a unidirectional chain where each points only to the next, ensuring no overlaps or branches, as each sector can belong to at most one chain. When extending a stream, free sectors (marked FREESECT) are selected and updated to point to the new sector, with the previous end-of-chain entry revised to link forward; the chain remains logically contiguous but may be physically scattered across the file for performance in fragmented storage. This approach supports dynamic growth of normal streams larger than the sector size threshold, distinct from smaller streams handled elsewhere.[15][3] Detection of corruption in FAT chains is essential for robust file handling, with invalid configurations indicating structural damage. Common errors include cycles, where a chain loops back on itself (e.g., sector A points to B, B to A), out-of-bounds pointers exceeding the valid sector count from the header, or references to reserved sectors like the header (sector 0) or metadata areas; such anomalies trigger repair attempts or file rejection in compliant readers. The specification recommends verifying chain integrity during parsing to prevent infinite loops or data loss.[15][3]Mini File Allocation Table (Mini FAT)
The Mini File Allocation Table (Mini FAT) serves as an allocation mechanism within the Compound File Binary Format (CFBF) specifically for managing small streams that are below a defined size threshold, enabling efficient use of space without the overhead of full-sized sectors.[9] Streams smaller than the Mini Stream Cutoff value—specified in bytes 56 through 59 of the file header as 0x00001000 (4096 bytes)—are allocated using the Mini FAT and stored in the Mini Stream, while larger streams utilize the standard File Allocation Table (FAT).[7] This threshold ensures that small data objects, such as metadata or short content streams, avoid wasting space in larger 512-byte or 4096-byte sectors.[9] Structurally, the Mini FAT mirrors the FAT but is adapted for mini sectors, consisting of a chain of 32-bit entries that represent sector numbers within the Mini Stream rather than the main file.[15] Each entry points to the next mini sector index, with the number of entries per Mini FAT sector varying by the overall file's sector size: 128 entries for 512-byte sectors (Major Version 3) or 1024 entries for 4096-byte sectors (Major Version 4).[9] The Mini FAT sectors themselves are stored as a chain in normal file sectors, beginning at the location indicated by the Mini FAT Start Sector field in the header (bytes 60 through 63, a 4-byte unsigned integer), with the total count provided in the subsequent Number of Mini FAT Sectors field (bytes 64 through 67).[7] If no small streams exist, the Mini FAT Start Sector is set to the end-of-chain marker 0xFFFFFFFE, indicating that the Mini FAT and Mini Stream are unnecessary.[9] Mini sectors are fixed at 64 bytes each, providing finer granularity for small data allocation compared to standard sectors.[15] In chain mechanics, each Mini FAT entry holds a 32-bit value representing the index of the next mini sector in the Mini Stream; to access the data, this index is multiplied by 64 to obtain the byte offset within the Mini Stream.[9] The chain terminates with the value 0xFFFFFFFE (ENDOFCHAIN), signaling the end of the allocated sectors for a given stream, which prevents unnecessary space allocation and supports efficient storage of data under 4096 bytes without fragmentation issues associated with larger sectors.[9] The Mini FAT integrates with the Mini Stream, a dedicated stream object in the root storage (directory entry index 0) whose starting sector is referenced in the root entry's Starting Sector Location field.[15] All mini sectors for small streams are contained within this Mini Stream, which itself is chained via the standard FAT like any other stream, allowing seamless access to small data through the Mini FAT's indexing.[9] This setup ensures that the Mini FAT operates as a lightweight allocator tailored for the Mini Stream's 64-byte granularity, optimizing the CFBF for compound files with numerous small components.[15]Object Hierarchy
Directory Entries
The directory entries in the Compound File Binary Format (CFB) constitute an array of fixed-size records that define the hierarchical structure of storage and stream objects within the file, serving as the metadata backbone for object navigation and properties.[3] These entries are organized as a virtual stream composed of one or more directory sectors, which are chained together using the File Allocation Table (FAT).[3] The chain begins at the sector index specified in the file header's_sectDirStart field, typically starting from sector 1 in simple files, and continues until an end-of-chain marker (0xFFFFFFFE) is encountered.[3] Each 512-byte directory sector accommodates up to four 128-byte entries; larger files may span multiple sectors.[3] The array terminates when an entry with an empty name (all zeros in the name field) is reached or upon encountering special reserved entries.[3]
Each directory entry is precisely 128 bytes long and encodes essential metadata for an object, including its name, type, relationships in the hierarchy, timestamps, and location or size information.[3] The structure follows a rigid byte layout, as outlined in the following table:
| Byte Offset | Size (bytes) | Field Name | Description |
|---|---|---|---|
| 0x00 | 64 | _ab (Name) | Unicode (UTF-16LE) name as 32 wide characters, null-terminated and zero-padded to 64 bytes; supports up to 31 characters plus null terminator. |
| 0x40 | 2 | _cb (Name Length) | Length of the name in bytes (0 to 64, multiple of 2), including the null terminator. |
| 0x42 | 1 | _mse (Type) | Object type: 0 (invalid/empty), 1 (storage object), 2 (stream object), 5 (root entry). |
| 0x43 | 1 | _bflags (Color) | Node color for red-black tree balancing: 0 (red), 1 (black). |
| 0x44 | 4 | _sidLeftSib | Index (SID) of the left sibling entry in the red-black tree. |
| 0x48 | 4 | _sidRightSib | Index (SID) of the right sibling entry in the red-black tree. |
| 0x4C | 4 | _sidChild | Index (SID) of the first child entry (for storage objects only). |
| 0x50 | 16 | _clsId (CLSID) | Class identifier (GUID) for the storage object; unused for streams. |
| 0x60 | 4 | _dwUserFlags | State bits for storage objects (low 4 bits: version number 0-15; higher bits reserved and zero). Ignored for streams. |
| 0x64 | 8 | _time[0] | Creation timestamp in FILETIME format (100-nanosecond intervals since January 1, 1601 UTC); for storage objects. |
| 0x6C | 8 | _time[1] | Modification timestamp in FILETIME format; for storage objects. |
| 0x74 | 4 | _sectStart | Starting sector index for the object's data chain (for streams) or size in sectors if empty; for root, points to Mini Stream. |
| 0x78 | 8 | _ulSize | Size of the stream in bytes (for streams and root); 0 for empty objects. |
| 0x80 | 48 | (Reserved) | Unused bytes, must be zero. |
_sectStart and _ulSize fields directly indicate the data location and length, while storage objects use these fields as zero or for internal purposes like the root's Mini Stream ownership.[3]
The root entry, always at index 0 (SID 0), is a special storage object of type 5 (STGTY_ROOT) with a conventional name of "Root Entry" (or shortened to "R" in some legacy files), and it serves as the top-level container for the entire hierarchy.[3] It has no siblings (marked with 0xFFFFFFFF) and owns the Mini Stream, where its _sectStart points to the first sector of small streams (under 4096 bytes) and _ulSize specifies the Mini Stream's total byte length, typically around 4096 bytes or more depending on content.[3] The root entry's child SID links to the first top-level storage or stream, establishing the file's root directory equivalent.[3]
The directory entries collectively form a tree-structured hierarchy through sibling and child pointers, implemented as a balanced red-black tree to ensure efficient searching and insertion by name.[3] Each storage object's _sidChild points to its first child, while left and right sibling SIDs (_sidLeftSib and _sidRightSib) organize children into a balanced binary search tree ordered first by name length and then lexicographically.[3] The color flags enforce red-black invariants: the root is black, no two reds are adjacent, and subtrees maintain balance, with all leaves at equivalent depths.[3] SIDs (Stream IDs) are zero-based indices into the directory array, providing stable references that remain valid even as the file grows, and the tree structure allows traversal from any entry back to the root via implicit parent links derived from child pointers.[3] This design supports the file-system-like organization of CFB, enabling nested directories (storages) and files (streams) within a single binary file.[3]
