Hubbry Logo
Compound File Binary FormatCompound File Binary FormatMain
Open search
Compound File Binary Format
Community hub
Compound File Binary Format
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Compound File Binary Format
Compound File Binary Format
from Wikipedia

Compound File Binary Format (CFBF), also called Compound File, Compound Document format,[1] or Composite Document File V2[2] (CDF), is a compound document file format for storing numerous files and streams within a single file on a disk. CFBF is developed by Microsoft and is an implementation of Microsoft COM Structured Storage.[3][4][5] The file format is used for storing storage objects and stream objects in a hierarchical structure within a single file.[6]

Microsoft has opened the format for use by others and it is now used in a variety of programs from Microsoft Word and Microsoft Access to Business Objects.[citation needed] It also forms the basis of the Advanced Authoring Format.[7]

Overview

[edit]

At its simplest, the Compound File Binary Format is a container, with little restriction on what can be stored within it.

A CFBF file structure loosely resembles a FAT file system. The file is partitioned into Sectors which are chained together with a File Allocation Table (not to be mistaken with the file system of the same name) which contains chains of sectors related to each file, a Directory holds information for contained files with a Sector ID (SID) for the starting sector of a chain and so on.

Structure

[edit]

The CFBF file consists of a 512-byte header record followed by a number of Sectors whose size is defined in the header. The literature defines Sectors to be either 512 or 4096 bytes in length, although the format is potentially capable of supporting sectors ranging in size from 128 bytes upwards, in powers of two (128, 256, 512, 1024, etc.). The lower limit of 128 is the minimum required to fit a single directory entry in a Directory Sector.[relevant?]

There are several types of sector that may be present in a CFBF file:

  • File Allocation Table (FAT) Sector – contains chains of sector indices much as a FAT does in the FAT/FAT32 filesystems
  • MiniFAT Sectors – similar to the FAT but storing chains of mini-sectors within the Mini-Stream
  • Double-Indirect FAT (DIFAT) Sector – contains chains of FAT sector indices
  • Directory Sector – contains directory entries
  • Stream Sector – contains arbitrary file data
  • Range Lock Sector – contains the byte-range locking area of a large file

More detail is given below for the header and each sector type.

CFBF header format

[edit]

The CFBF header occupies the first 512 bytes of the file and information required to interpret the rest of the file. The C-style structure declaration below (extracted from the AAFA's Low-Level Container Specification) shows the members of the CFBF header and their purpose:

typedef unsigned long ULONG;    // 4 bytes
typedef unsigned short USHORT;  // 2 bytes
typedef short OFFSET;           // 2 bytes
typedef ULONG SECT;             // 4 bytes
typedef ULONG FSINDEX;          // 4 bytes
typedef USHORT FSOFFSET;        // 2 bytes
typedef USHORT WCHAR;           // 2 bytes
typedef ULONG DFSIGNATURE;      // 4 bytes
typedef unsigned char BYTE;     // 1 byte
typedef unsigned short WORD;    // 2 bytes
typedef unsigned long DWORD;    // 4 bytes
typedef ULONG SID;              // 4 bytes
typedef GUID CLSID;             // 16 bytes

struct StructuredStorageHeader { // [offset from start (bytes), length (bytes)]
    BYTE _abSig[8];             // [00H,08] {0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1,
                                // 0x1a, 0xe1} for current version
    CLSID _clsid;               // [08H,16] reserved must be zero (WriteClassStg/
                                // GetClassFile uses root directory class id)
    USHORT _uMinorVersion;      // [18H,02] minor version of the format: 33 is
                                // written by reference implementation
    USHORT _uDllVersion;        // [1AH,02] major version of the dll/format: 3 for
                                // 512-byte sectors, 4 for 4 KB sectors
    USHORT _uByteOrder;         // [1CH,02] 0xFFFE: indicates Intel byte-ordering
    USHORT _uSectorShift;       // [1EH,02] size of sectors in power-of-two;
                                // typically 9 indicating 512-byte sectors
    USHORT _uMiniSectorShift;   // [20H,02] size of mini-sectors in power-of-two;
                                // typically 6 indicating 64-byte mini-sectors
    USHORT _usReserved;         // [22H,02] reserved, must be zero
    ULONG _ulReserved1;         // [24H,04] reserved, must be zero
    FSINDEX _csectDir;          // [28H,04] must be zero for 512-byte sectors,
                                // number of SECTs in directory chain for 4 KB
                                // sectors
    FSINDEX _csectFat;          // [2CH,04] number of SECTs in the FAT chain
    SECT _sectDirStart;         // [30H,04] first SECT in the directory chain
    DFSIGNATURE _signature;     // [34H,04] signature used for transactions; must
                                // be zero. The reference implementation
                                // does not support transactions
    ULONG _ulMiniSectorCutoff;  // [38H,04] maximum size for a mini stream;
                                // typically 4096 bytes
    SECT _sectMiniFatStart;     // [3CH,04] first SECT in the MiniFAT chain
    FSINDEX _csectMiniFat;      // [40H,04] number of SECTs in the MiniFAT chain
    SECT _sectDifStart;         // [44H,04] first SECT in the DIFAT chain
    FSINDEX _csectDif;          // [48H,04] number of SECTs in the DIFAT chain
    SECT _sectFat[109];         // [4CH,436] the SECTs of first 109 FAT sectors
 };

File Allocation Table (FAT) sectors

[edit]

When taken together as a single stream the collection of FAT sectors define the status and linkage of every sector in the file. Each entry in the FAT is 4 bytes in length and contains the sector number of the next sector in a FAT chain or one of the following special values:

  • FREESECT (0xFFFFFFFF) – denotes an unused sector
  • ENDOFCHAIN (0xFFFFFFFE) – marks the last sector in a FAT chain
  • FATSECT (0xFFFFFFFD) – marks a sector used to store part of the FAT
  • DIFSECT (0xFFFFFFFC) – marks a sector used to store part of the DIFAT

Range Lock Sector

[edit]

The Range Lock Sector must exist in files greater than 2 GB in size, and must not exist in files smaller than 2 GB. The Range Lock Sector must contain the byte range 0x7FFFFF00 to 0x7FFFFFFF in the file. This area is reserved by Microsoft's COM implementation for storing byte-range locking information for concurrent access.

Glossary

[edit]
  • FAT – File Allocation Table; also known as SAT – Sector Allocation Table
  • DIFAT – Double-Indirect File Allocation Table
  • FAT Chain – a group of FAT entries which indicate the Sectors allocated to a Stream in the file
  • Stream – a virtual file which occupies a number of Sectors within the CFBF
  • Sector – the unit of allocation within the CFBF, usually 512 or 4096 Bytes in length

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Compound File Binary Format (CFB) is a developed by that enables the storage of hierarchical data within a single file, mimicking a structure through storage objects (functioning as directories) and stream objects (functioning as files). It provides a general-purpose mechanism for organizing arbitrary, application-specific data streams in a structured manner, addressing the need to embed multiple object types efficiently within compound documents. Introduced as part of the (OLE) 2.0 technology in the early and integral to the (COM), the CFB format—also known as structured storage—facilitates seamless management of complex files by treating them as self-contained entities suitable for operations like copying, backing up, or emailing. Its core structure begins with a 512-byte header containing a magic number (D0 CF 11 E0 A1 B1 1A E1 in ) for identification, followed by version information, byte order details (little-endian), and pointers to key components such as the (FAT) for tracking sector chains, the Directory for managing object metadata, and the Mini FAT for smaller under 4096 bytes. Data is organized into fixed-size sectors—typically 512 bytes in version 3, the most common iteration, though version 4 supports 4096-byte sectors for larger files up to approximately 2^44 bytes—using chain markers like ENDOFCHAIN (0xFFFFFFFE) and FREESECT (0xFFFFFFFF) to delineate allocated and unused space. The format underpins numerous applications, serving as the basis for binary file types in suites from 1997 to 2003, including Word (.doc), Excel (.xls), and PowerPoint (.ppt) documents, as well as email messages (.msg) and thumbnail caches. It supports a maximum of about 16 million directory entries and is fully documented under Microsoft's Open Specifications Promise, ensuring while prioritizing performance for non-streaming scenarios due to its fixed stream sizes and sector-based allocation. Although largely superseded by XML-based formats like in later products, the CFB remains relevant for legacy file handling and certain system files in Windows environments.

Introduction

Overview

The Compound File Binary Format (CFBF) is a general-purpose that provides a file-system-like structure within a single file for the storage of arbitrary, application-specific streams of data. It supports two primary object types: storages, which function like directories for hierarchical organization, and streams, which act as file-like containers for data. CFBF emulates a simplified FAT file system by dividing the file into fixed-size sectors, typically 512 bytes or 4096 bytes, to enable efficient data management and access. This structure allows multiple data types to be embedded within one file, facilitating modifications to individual components without requiring a full file rewrite, which is particularly useful for compound documents in applications like . The minimum file size is three sectors: one for the header, one for the (FAT), and one for the directory. Originally developed as part of the OLE 2.0 structured storage system, CFBF has evolved into a standardized format documented in Microsoft's [MS-CFB] specification, with version 12.0 published in April 2024 (last revised October 2024). It organizes data hierarchically via directory entries and chains sectors using allocation tables.

History and Development

The Compound File Binary Format (CFBF), originally known as the Compound Document File format, was developed by in the early 1990s as a core component of (OLE) 2.0, introduced to enable structured storage within a single file for compound documents in Windows applications. This format provided a file-system-like of storages and , drawing conceptual inspiration from the FAT12 and FAT16 file allocation mechanisms of earlier DOS and Windows file systems, but adapted to embed multiple data objects efficiently within one . Early beta implementations appeared in late 1992, supporting OLE 2.0's object model under the (COM), with the format's signature evolving to its current form by the mid-1990s. CFBF became integral to structured storage in applications starting in 1993, such as with Excel 5.0 and Word 6.0, and gained prominence with the release of Office 95, facilitating the embedding and linking of diverse data types in documents while maintaining compatibility with and NT operating systems. The format's integration with COM in these platforms allowed for seamless across applications, marking a key milestone in Microsoft's push toward component-based . Major version 4 of CFBF supports 4096-byte sectors for handling larger files and improved performance. Microsoft formalized and publicly documented CFBF through the Open Specifications program, beginning with the initial [MS-CFB] specification release on July 16, 2010 (version 1.0), which detailed the format for third-party interoperability. Subsequent revisions addressed security enhancements, compatibility issues, and sector allocation refinements, with major updates including version 2.0 in October 2010, version 4.0 in November 2013, and the latest version 12.0 in April 2024. Although not standardized by an international body like ISO, the format is maintained via Microsoft's Open Specifications, ensuring ongoing documentation and support for cross-platform use in various applications beyond Office.

Core File Structure

File Header

The Compound File Header is a fixed 512-byte structure located at the beginning of every Compound File Binary Format (CFBF) file, serving as the entry point that contains critical metadata for parsing the file's sector-based organization, allocation tables, and directory. It identifies the file format, specifies sector sizes, and provides starting locations for key components like the directory chain, File Allocation Table (FAT), Mini FAT, and Double-Indirect FAT (DIFAT). For CFBF version 4 files, the header extends to 4,096 bytes with padding bytes beyond the first 512, but all functional fields reside in the initial portion. The header opens with an 8-byte signature at bytes 0–7, fixed as the hexadecimal sequence D0 CF 11 E0 A1 B1 1A E1, which uniquely identifies the file as adhering to the CFBF specification. This signature represents the little-endian byte order of a specific Unicode string pattern, ensuring compatibility and detection by applications. Subsequent fields define the file's structural parameters. At bytes 30–31, the Sector Shift field is a 16-bit unsigned integer indicating the base-2 logarithm of the sector size: a value of 9 corresponds to 512-byte sectors (common in version 3 files), while 12 denotes 4,096-byte sectors in version 4 files. Bytes 32–33 hold the Mini Sector Shift, fixed at 6 to specify 64-byte mini sectors used for small streams below a certain size threshold. The Number of Directory Sectors field (bytes 40–43) is a 32-bit unsigned integer counting the sectors allocated to the directory entry chain, which organizes the file's object hierarchy; in version 3 files, this field must be 0, with the directory size determined from the root entry. Similarly, bytes 44–47 contain the Number of FAT Sectors, a 32-bit value tallying the total sectors in the FAT, which maps logical to physical sector locations. Navigation to core components is facilitated by starting sector indicators. Bytes 48–51 specify the First Directory Sector Location as a 32-bit unsigned integer, pointing to the initial sector of the directory chain. Bytes 52–55 house the Transaction Signature, a 32-bit unsigned integer used for detecting concurrent modifications or transaction states in multi-user environments, though it is typically zero if transactions are not supported. For handling small streams, bytes 60–63 indicate the First Mini FAT Sector Location, and bytes 64–67 provide the Number of Mini FAT Sectors, both as 32-bit unsigned integers; these allocate a secondary FAT for streams under the mini stream cutoff size of 4,096 bytes. DIFAT management fields follow: bytes 68–71 denote the First DIFAT Sector Location, and bytes 72–75 the Number of DIFAT Sectors, each 32-bit unsigned integers that extend the FAT sector index beyond the header's capacity. The header concludes with an embedded DIFAT array at bytes 76–511 (436 bytes total), comprising the first 109 entries as 32-bit unsigned integers, each pointing to a FAT sector's location; this array bootstraps the double-indirect allocation mechanism for larger files, with additional DIFAT sectors referenced if needed. Reserved fields, such as bytes 34–39 and the Class ID at bytes 8–23 (all zeros), ensure alignment and future extensibility without altering the core structure.
Field NameByte OffsetSize (bytes)Value/Description
Header Signature0–78Fixed: D0 CF 11 E0 A1 B1 1A E1 (format identifier)
Sector Shift30–3129 (512-byte sectors) or 12 (4,096-byte sectors)
Mini Sector Shift32–3326 (64-byte mini sectors)
Number of Directory Sectors40–434Count of sectors for directory chain
Number of FAT Sectors44–474Total FAT sectors in file
First Directory Sector Location48–514Starting sector for directory
Transaction Signature52–554Transaction sequence number (often 0)
First Mini FAT Sector Location60–634Starting sector for Mini FAT
Number of Mini FAT Sectors64–674Count of Mini FAT sectors
First DIFAT Sector Location68–714Starting sector for additional DIFAT
Number of DIFAT Sectors72–754Count of DIFAT sectors
DIFAT Array76–511436First 109 FAT sector pointers

Sectors and Sector Types

The Compound File Binary Format (CFBF) divides the file into fixed-size sectors, which serve as the fundamental units for organizing and storing all data, metadata, and allocation information. Sector sizes are determined by the Sector Shift field in the file header: for major version 3, the size is 512 bytes (Sector Shift = 0x0009), while for major version 4, it is 4096 bytes (Sector Shift = 0x000C). These sizes apply uniformly to all sectors except mini sectors, which are fixed at 64 bytes regardless of version to handle small streams efficiently. Sectors are identified by nonnegative 32-bit integers starting from 0, with the header occupying sector 0 at file offset 0. Valid sector numbers range from 0x00000000 to 0xFFFFFFFA (MAXREGSECT), while unallocated free sectors are marked with 0xFFFFFFFF (FREESECT). Reserved values include 0xFFFFFFFE for end-of-chain markers (ENDOFCHAIN) and specific codes for allocation structures like sectors (FATSECT = 0xFFFFFFFD). Beyond the header, each sector consists of 512 or 4096 bytes of , indices, or metadata, depending on its type, and sectors are linked into chains for larger structures. CFBF defines several sector types to support its file-system-like structure:
  • Header Sector: A single fixed sector at position 0 containing essential metadata, such as version information, sector size, and pointers to key structures like the directory and . It is the only sector not numbered in the general allocation scheme.
  • FAT Sectors: Contain the entries that map sector chains for s and storages, with each entry being a 4-byte sector number.
  • Directory Sectors: Hold the directory entries (128 bytes each) that describe the of storage and objects, including names, sizes, and starting sector numbers.
  • Mini FAT Sectors: Similar to FAT sectors but for allocating mini sectors in the mini , with 128 entries per 512-byte sector in version 3 or 1024 entries per 4096-byte sector in version 4.
  • Mini Stream Sectors: 64-byte units within the dedicated mini , used for storing data of small streams (typically under 4096 bytes) to optimize .
  • Normal Sectors: General-purpose sectors holding user data for large streams, chained together via FAT entries.
  • Free Sectors: Unallocated available for future use, identified by the FREESECT value and potentially scattered throughout the file or at the end.
The file size may not be an exact multiple of the sector size, resulting in unused partial sectors at the end that are treated as free space. To ensure basic functionality, every CFBF file must be at least three sectors long: one for the header, one for the , and one for the directory. Version 3 files are limited to 2 GB for compatibility, while version 4 supports larger sizes via the 4096-byte sectors.

Allocation Mechanisms

Double-Indirect File Allocation Table (DIFAT)

The Double-Indirect File Allocation Table (DIFAT) is a critical component of the Compound File Binary Format (CFB), serving as an array of 32-bit unsigned integers that store sector numbers pointing to the locations of (FAT) sectors within the file. Each entry in the DIFAT is a sector identifier (SECT), where valid values represent the physical sector numbers of FAT sectors, 0xFFFFFFFE indicates the end of the DIFAT chain (ENDOFCHAIN), and 0xFFFFFFFF denotes a free sector or an unused DIFAT entry (FREESECT). Additionally, DIFAT sectors themselves are marked in the FAT with the special value DIFSECT (0xFFFFFFFC) to reserve space for them. This structure enables the CFB to manage large numbers of FAT sectors indirectly, supporting files that exceed the space limitations of the file header alone. The DIFAT is primarily located in the file header and extended into dedicated DIFAT sectors as needed. The header reserves the first 109 entries (DIFAT through DIFAT) at byte offsets 76 through 511 (436 bytes total), sufficient for files smaller than approximately 7 MB with 512-byte sectors. For larger files, additional DIFAT entries are stored in DIFAT sectors, whose count is specified in the header's "Number of DIFAT Sectors" field (byte offset 72, a 32-bit unsigned ). The chain of these DIFAT sectors begins at the sector number given in the header's "DIFAT Start Sector Location" field (byte offset 68), allowing the DIFAT to scale dynamically. Each DIFAT sector has a capacity determined by the sector size minus space for chaining. In a 512-byte sector (version 3 files), it holds 127 entries (512 / 4 - 1), with the final 4 bytes as the "Next DIFAT Sector Location" field pointing to the subsequent DIFAT sector or ENDOFCHAIN to terminate the chain. For 4,096-byte sectors (version 4 files), this expands to 1,023 entries (4,096 / 4 - 1). This design theoretically supports up to around 4 billion FAT sectors, limited by the 32-bit addressing in the FAT itself, enabling CFB files to handle vast amounts of data through indirect mapping. The DIFAT's primary purpose is to provide a complete, ordered list of all FAT sector locations, which is essential for reconstructing the full FAT array before accessing stream or storage data. DIFAT sectors form a singly linked chain starting from the header's start sector, where each sector's last field links to the next, ensuring to all entries. The header's initial 109 entries are concatenated with those from the chained sectors to form the complete DIFAT array, with index n pointing to the (n+1)th FAT sector. This chaining mechanism reserves space in the for DIFAT sectors using DIFSECT markers, preventing their reuse for data. Validation of the DIFAT ensures file integrity by confirming it forms a complete, non-duplicative list of unique FAT sector locations without cycles or invalid references. Implementers must verify that all sector numbers are valid (less than or equal to the maximum regular sector count, 0xFFFFFFFA), that the chain terminates properly with ENDOFCHAIN, and that no sector is referenced multiple times across the DIFAT or . Invalid DIFAT entries, such as those pointing beyond the file end or creating loops, can lead to parsing failures or vulnerabilities like denial-of-service from excessive reads. Full validation requires loading the entire DIFAT and checking against the , which is computationally intensive for large files but necessary for robust parsers.

File Allocation Table (FAT)

The File Allocation Table (FAT) serves as the primary mechanism for managing the allocation and chaining of sectors belonging to large in the Compound File Binary Format, enabling efficient navigation through non-contiguous data blocks within the file. Each FAT sector consists of an of 32-bit unsigned integers (DWORDs), with the number of entries determined by the sector size divided by 4 bytes; for example, a standard 512-byte sector accommodates 128 entries, while a 4,096-byte sector holds 1,024 entries. These entries map a given sector index to the location of the subsequent sector in a stream's chain, facilitating the reconstruction of stream data by linking sectors logically rather than requiring physical contiguity on disk. FAT entry values encode the status and linkage of sectors using specific constants defined in the format specification. A value of 0x00000000 through the maximum valid sector number (typically up to 0xFFFFFFFA for normal sectors) represents the index of the next sector in the chain, allowing streams to span arbitrary locations in the file. The constant 0xFFFFFFFE denotes ENDOFCHAIN, signaling the termination of a sector chain. Entries set to 0xFFFFFFFF indicate FREESECT, marking unallocated or available sectors that can be reused. Reserved sectors, such as those used for FAT or DIFAT, are marked with special values—FATSECT (0xFFFFFFFD) for FAT sectors and DIFSECT (0xFFFFFFFC) for DIFAT sectors—to reserve them and prevent reuse. These values ensure that the FAT operates like a simplified file system bitmap extended with chaining capabilities. The FAT sectors themselves are not stored contiguously but are referenced by the Double-Indirect File Allocation Table (DIFAT), which provides their sector indices, with the total number of FAT sectors specified in the file header's csectFat field (a 32-bit unsigned at offset 0x44). This design allows the FAT to scale with file size, supporting up to approximately 4 billion sectors in theory due to the 32-bit addressing, though practical limits are imposed by the overall file size and sector shift values in the header. To resolve a chain for a starting at sector S, the process begins by reading sector S, then retrieves the next sector from the FAT entry at index S, repeating until an ENDOFCHAIN value is encountered; this traversal reconstructs the full without loading the entire file into memory. Allocation in the FAT follows strict rules to maintain integrity: sectors assigned to a stream form a unidirectional chain where each points only to the next, ensuring no overlaps or branches, as each sector can belong to at most one chain. When extending a stream, free sectors (marked FREESECT) are selected and updated to point to the new sector, with the previous end-of-chain entry revised to link forward; the chain remains logically contiguous but may be physically scattered across the file for performance in fragmented storage. This approach supports dynamic growth of normal streams larger than the sector size threshold, distinct from smaller streams handled elsewhere. Detection of corruption in FAT chains is essential for robust file handling, with invalid configurations indicating structural damage. Common errors include cycles, where a chain loops back on itself (e.g., sector A points to B, B to A), out-of-bounds pointers exceeding the valid sector count from the header, or references to reserved sectors like the header (sector 0) or metadata areas; such anomalies trigger repair attempts or file rejection in compliant readers. The specification recommends verifying chain integrity during parsing to prevent infinite loops or data loss.

Mini File Allocation Table (Mini FAT)

The Mini File Allocation Table (Mini FAT) serves as an allocation mechanism within the Compound File Binary Format (CFBF) specifically for managing small streams that are below a defined size threshold, enabling efficient use of space without the overhead of full-sized sectors. Streams smaller than the Mini Stream Cutoff value—specified in bytes 56 through 59 of the file header as 0x00001000 (4096 bytes)—are allocated using the Mini FAT and stored in the Mini Stream, while larger streams utilize the standard (FAT). This threshold ensures that small data objects, such as metadata or short content streams, avoid wasting space in larger 512-byte or 4096-byte sectors. Structurally, the Mini FAT mirrors the FAT but is adapted for mini sectors, consisting of a chain of 32-bit entries that represent sector numbers within the Mini Stream rather than the main file. Each entry points to the next mini sector index, with the number of entries per Mini FAT sector varying by the overall file's sector size: 128 entries for 512-byte sectors (Major Version 3) or 1024 entries for 4096-byte sectors (Major Version 4). The Mini FAT sectors themselves are stored as a chain in normal file sectors, beginning at the location indicated by the Mini FAT Start Sector field in the header (bytes 60 through 63, a 4-byte unsigned integer), with the total count provided in the subsequent Number of Mini FAT Sectors field (bytes 64 through 67). If no small streams exist, the Mini FAT Start Sector is set to the end-of-chain marker 0xFFFFFFFE, indicating that the Mini FAT and Mini Stream are unnecessary. Mini sectors are fixed at 64 bytes each, providing finer granularity for small data allocation compared to standard sectors. In chain mechanics, each Mini FAT entry holds a 32-bit value representing the index of the next mini sector in the Mini Stream; to access the data, this index is multiplied by 64 to obtain the byte offset within the Mini Stream. The chain terminates with the value 0xFFFFFFFE (ENDOFCHAIN), signaling the end of the allocated sectors for a given stream, which prevents unnecessary space allocation and supports efficient storage of under 4096 bytes without fragmentation issues associated with larger sectors. The Mini FAT integrates with the Mini Stream, a dedicated stream object in the root storage (directory entry index 0) whose starting sector is referenced in the root entry's Starting Sector Location field. All mini sectors for small streams are contained within this Mini Stream, which itself is chained via the standard like any other stream, allowing seamless access to small through the Mini FAT's indexing. This setup ensures that the Mini FAT operates as a lightweight allocator tailored for the Mini Stream's 64-byte granularity, optimizing the CFBF for compound files with numerous small components.

Object Hierarchy

Directory Entries

The directory entries in the Compound File Binary Format (CFB) constitute an array of fixed-size records that define the hierarchical structure of storage and stream objects within the file, serving as the metadata backbone for object navigation and properties. These entries are organized as a virtual stream composed of one or more directory sectors, which are chained together using the File Allocation Table (FAT). The chain begins at the sector index specified in the file header's _sectDirStart field, typically starting from sector 1 in simple files, and continues until an end-of-chain marker (0xFFFFFFFE) is encountered. Each 512-byte directory sector accommodates up to four 128-byte entries; larger files may span multiple sectors. The array terminates when an entry with an empty name (all zeros in the name field) is reached or upon encountering special reserved entries. Each directory entry is precisely 128 bytes long and encodes essential metadata for an object, including its name, type, relationships in the hierarchy, timestamps, and location or size information. The structure follows a rigid byte layout, as outlined in the following table:
Byte OffsetSize (bytes)Field NameDescription
0x0064_ab (Name)Unicode (UTF-16LE) name as 32 wide characters, null-terminated and zero-padded to 64 bytes; supports up to 31 characters plus null terminator.
0x402_cb (Name Length)Length of the name in bytes (0 to 64, multiple of 2), including the null terminator.
0x421_mse (Type)Object type: 0 (invalid/empty), 1 (storage object), 2 (stream object), 5 (root entry).
0x431_bflags (Color)Node color for red-black tree balancing: 0 (red), 1 (black).
0x444_sidLeftSibIndex (SID) of the left sibling entry in the red-black tree.
0x484_sidRightSibIndex (SID) of the right sibling entry in the red-black tree.
0x4C4_sidChildIndex (SID) of the first child entry (for storage objects only).
0x5016_clsId (CLSID)Class identifier (GUID) for the storage object; unused for streams.
0x604_dwUserFlagsState bits for storage objects (low 4 bits: version number 0-15; higher bits reserved and zero). Ignored for streams.
0x648_time[0]Creation timestamp in FILETIME format (100-nanosecond intervals since January 1, 1601 UTC); for storage objects.
0x6C8_time[1]Modification timestamp in FILETIME format; for storage objects.
0x744_sectStartStarting sector index for the object's data chain (for streams) or size in sectors if empty; for root, points to Mini Stream.
0x788_ulSizeSize of the stream in bytes (for streams and root); 0 for empty objects.
0x8048(Reserved)Unused bytes, must be zero.
This format ensures consistent parsing across implementations, with all multi-byte values stored in little-endian byte order. For stream objects, the _sectStart and _ulSize fields directly indicate the and , while storage objects use these fields as zero or for internal purposes like the root's Mini Stream ownership. The root entry, always at index 0 (SID 0), is a special storage object of type 5 (STGTY_ROOT) with a conventional name of "Root Entry" (or shortened to "" in some legacy files), and it serves as the top-level for the entire . It has no siblings (marked with 0xFFFFFFFF) and owns the Mini Stream, where its _sectStart points to the first sector of small (under 4096 bytes) and _ulSize specifies the Mini Stream's total byte length, typically around 4096 bytes or more depending on content. The root entry's child SID links to the first top-level storage or stream, establishing the file's equivalent. The directory entries collectively form a tree-structured hierarchy through sibling and child pointers, implemented as a balanced red-black tree to ensure efficient searching and insertion by name. Each storage object's _sidChild points to its first child, while left and right sibling SIDs (_sidLeftSib and _sidRightSib) organize children into a balanced binary search tree ordered first by name length and then lexicographically. The color flags enforce red-black invariants: the root is black, no two reds are adjacent, and subtrees maintain balance, with all leaves at equivalent depths. SIDs (Stream IDs) are zero-based indices into the directory array, providing stable references that remain valid even as the file grows, and the tree structure allows traversal from any entry back to the root via implicit parent links derived from child pointers. This design supports the file-system-like organization of CFB, enabling nested directories (storages) and files (streams) within a single binary file.

Storage Objects

In the Compound File Binary Format (CFB), storage objects serve as container-like elements that organize the hierarchical structure of the file, functioning similarly to directories in a traditional file system. They are defined by directory entries where the type field (_mse) is set to STGTY_STORAGE (value 1), allowing them to hold child storages or streams without storing any data themselves. The hierarchy of storage objects begins with the root storage, which corresponds to directory entry SID 0 and acts as the top-level container. Child objects—either additional storages or —are linked through the _sidChild field in a parent's directory entry, enabling the creation of nested folders that mirror a tree-like . This structure supports arbitrary levels of nesting, with siblings connected via _sidLeftSib and _sidRightSib fields to form a red-black tree for efficient ordering and balancing. Storage objects inherit standard metadata fields from directory entries, including a Unicode name stored in the _ab array with its length in _cb (padded to 64 bytes), creation and modification timestamps in the _time array (using FILETIME format), and a 16-byte CLSID in _clsId to identify the storage's class. The _bflags field indicates the node's color (0 for red, 1 for black) for red-black tree maintenance. State bits for versioning are in _dwUserFlags. To traverse the hierarchy, the directory array— an ordered list of all entries indexed by SID—is parsed to reconstruct the using the parent-child and links; parent relationships are inferred by matching a child's SID to its parent's _sidChild. This process enforces acyclicity through the red-black properties, such as no two consecutive red nodes and equal black-node depths along any path from root to leaf. The storage holds special significance, as it owns the global Mini Stream by storing its starting sector in _sectStart and size in _ulSize, facilitating access to smaller . Unlike streams, all storage objects, including the root, allocate no direct data and thus have _ulSize set to 0 and _sectStart to 0 (or the Mini Stream details for the root). For validation, the storage hierarchy must conform to a proper red-black tree, with the root always black and all paths from root to having the same number of black nodes; empty storages are indicated by a size of 0 and absence of a starting sector. These rules ensure structural integrity and prevent malformed files.

Stream Objects

In the Compound File Binary Format (CFB), stream objects represent the leaf-level data containers within the hierarchical structure, functioning analogously to individual files in a by holding sequences of raw bytes that applications can read or write. Each stream object is defined by a directory entry with an object type value of 0x02, and it must be parented by a storage object or the root storage. Unlike storage objects, which organize hierarchies, stream objects serve as endpoints for without further nesting. The allocation and access of stream object data depend on the stream's size, as specified in the 64-bit stream size field of its directory entry. For streams with a size of 4,096 bytes or larger, the starting sector location field provides a sector number in the main file's sector chain, allocated and chained using the (FAT) to store across full-sized sectors (typically 512 or 4,096 bytes). To access the , the starting sector is resolved through the FAT chain, reading sequential sectors until the specified size is reached. For smaller streams under 4,096 bytes, the starting sector location instead serves as an index into the Mini Stream, with allocated and chained via the Mini File Allocation Table (Mini FAT) using 64-byte mini sectors. This dual mechanism optimizes storage for small payloads by leveraging the more granular Mini Stream. Empty streams, indicated by a stream size of zero in the directory entry, require no sector allocation and typically have a starting sector location set to NOSTREAM (0xFFFFFFFF), serving as placeholders in the without consuming storage space. The directory entry for a stream object provides essential metadata—including its name (up to 31 characters, null-terminated), size, and starting location—for locating and retrieving the data by traversing the appropriate allocation chain. Common examples of stream objects include user-facing data such as the main document content in files (e.g., the "WordDocument" stream) or embedded images in OLE documents, as well as internal metadata streams like property sets (e.g., "SummaryInformation"). These streams encapsulate application-specific payloads while adhering to the CFB's allocation rules for efficient file management.

Specialized Components

Mini Stream

The Mini Stream serves as a specialized internal stream within the Compound File Binary Format, designed to efficiently store data from small streams that are too compact to justify allocation in full-sized sectors. Its location is determined by the starting sector identifier (SID) field in the root directory entry, which typically points to sector 1 in newly created files, though this can vary based on file structure; the stream's total size is specified in the root entry's size field, allowing it to span multiple sectors allocated through the standard (FAT). This setup positions the Mini Stream as a root-owned that aggregates all small stream data, thereby minimizing overhead from the FAT for numerous tiny allocations while itself being treated as a single large stream managed via normal FAT chains. Internally, the Mini Stream is divided into mini sectors, each exactly 64 bytes in length for files using 512-byte sectors, with indexing starting from 0 up to the value calculated as (Mini Stream size / 64) - 1. These mini sectors form the granular storage units for small contents, enabling precise data placement without the waste associated with larger sector sizes. For accessing data in a small , the starting SID in its directory entry functions as the initial mini sector index within the Mini Stream, from which subsequent mini sectors are chained using the Mini File Allocation Table (Mini FAT) to retrieve the data sequentially. The last mini sector may be partially filled, with any unused bytes padded to maintain the 64-byte boundary, and any entirely unused mini sectors are marked as free in the Mini FAT to support future allocations. As one of the five primary internal streams in the format—alongside the Double-Indirect File Allocation Table (DIFAT) sectors, sectors, Mini FAT sectors, and directory sectors—the Mini Stream is not directly accessible to user-defined objects but is essential for the file's structural integrity and efficient small-data handling. Small streams, defined as those with a size less than 4096 bytes (below which full sector use would be inefficient), are directed here to leverage this mechanism.

Range Lock Sector Allocation

The Range Lock Sector in the Compound File Binary Format (CFB) serves to support byte-range locking, enabling concurrency, transactions, and multi-user access in shared file environments by reserving specific offsets to prevent overlapping modifications. This mechanism is particularly relevant for collaborative scenarios where multiple users or processes access the same compound file simultaneously. The structure is a single sector that covers the fixed file byte range 0x7FFFFF00 to 0x7FFFFFFF immediately before the 2 GB boundary, and it contains no user-defined data or fields such as lock counts or range boundaries. Instead, it acts as reserved space for system- or application-level locking operations, ensuring no data sectors overlap with this area. Other components like the header, DIFAT, FAT, Mini FAT, and directory chains must not reference this sector. For files using 512-byte sectors, this corresponds to sector number 0x3FFFFE. Allocation occurs within the FAT chain when the file size exceeds 2 GB, where the sector is marked with ENDOFCHAIN (0xFFFFFFFE); it is deallocated and marked FREESECT (0xFFFFFFFF) if the file shrinks below this threshold. For 512-byte sector files, which are limited to 2 GB for compatibility reasons, no such allocation is needed. This chaining mirrors the general mechanism but applies solely to this reserved sector. In practice, the Range Lock Sector is utilized in environments supporting shared OLE documents or server-based access, though it remains unused in typical single-user files. While integral to the CFB specification for large files, its relevance has diminished since the early 2000s with the adoption of XML-based formats like as defaults in and later.

Applications and Considerations

Usage in Microsoft Products

The Compound File Binary Format (CFB) serves as the primary container for pre-2007 Office binary documents, enabling the structured storage of multiple within a single file. For instance, (.doc) files use CFB to embed containing text, formatting, and embedded objects; Excel (.xls) files organize worksheets, charts, and formulas into hierarchical and storages; and PowerPoint (.ppt) files store slides, images, and animations similarly. Beyond Office suites, CFB is employed in other Microsoft applications, such as Outlook for .msg email files, which encapsulate body, attachments, and metadata in streams; Visio for .vsd files, storing shapes and connections; and Publisher for .pub layout files, managing pages and graphics. CFB is also used in Windows operating system files, such as thumbs.db thumbnail databases, to store cached image previews in a hierarchical structure. Additionally, even in the XML-based (.docx, .xlsx, .pptx) formats introduced in 2007, CFB persists for specific components like embedded OLE objects and legacy binary parts. In (OLE) scenarios, CFB facilitates the integration of objects across applications, such as embedding an Excel chart within a Word document, by representing these as sub-storages and streams that preserve the source application's data structure. For programmatic access and creation, provides Windows APIs including StgCreateDocfile for initializing new compound files and the IStorage and IStream COM interfaces for manipulating storages and streams, respectively, ensuring compatibility in legacy Windows environments. Although Microsoft transitioned to the Office Open XML (OOXML) standard for core document formats starting with Office 2007, CFB remains supported for backward compatibility, (VBA) macros stored in binary modules, and handling attachments or embedded legacy content. Third-party software, including and , provides read/write support for CFB-based files to ensure with binary formats. Forensics tools also leverage CFB parsing for analyzing embedded data in investigations.

Limitations and Security Issues

The Compound File Binary Format (CFB) imposes several inherent limitations due to its design as a filesystem-like container. Fixed sector sizes—512 bytes in version 3 and 4096 bytes in version 4—create inefficiencies for very small files, where the minimum allocation of one sector leads to significant overhead for streams under the mini stream cutoff of 4096 bytes. For larger files, version 3 is capped at 2 GB for compatibility, though version 4 supports up to nearly 16 terabytes via 64-bit stream sizing. The format lacks built-in compression or encryption, requiring applications to implement these features separately, which increases complexity and potential vulnerabilities. Additionally, its monolithic structure, reliant on a File Allocation Table (FAT) for sector chaining, makes it unsuitable for web delivery or real-time streaming, as partial access requires full file parsing. Performance constraints arise from the format's allocation mechanism. Although designed to avoid complete file rewrites on modifications by enabling in-place updates to , repeated edits often lead to sector reallocation and fragmentation, degrading access times over time, particularly in applications that frequently or resize content. Tools exist to defragment CFB files, underscoring this as a practical issue in long-term use. The FAT-based chaining can also introduce seek overhead on disk, limiting efficiency for patterns compared to modern flat-file or ZIP-based formats. Security issues stem primarily from CFB's role as a flexible container for embedding executable content. Streams can store (VBA) macros, facilitating delivery when documents are opened, as macros execute arbitrary code with user privileges. This has been exploited in vulnerabilities like CVE-2017-11882, a corruption flaw in the Equation Editor OLE component embedded via CFB structures in RTF files, allowing remote code execution. Similarly, CVE-2022-30190 (Follina) leverages OLE objects within CFB to invoke the Support Diagnostic Tool for command execution without user interaction. The absence of native digital signing or integrity checks exposes files to tampering, enabling attackers to modify streams undetected. Mitigations include Microsoft Office features like Protected View, which opens potentially unsafe files—including legacy CFB-based ones from the internet—in a sandboxed read-only mode to block macro execution and OLE loading. Default macro disabling and user prompts further reduce risks, while the transition to (OOXML) in and later has diminished CFB usage for new documents, favoring ZIP-based structures with better security controls. In digital forensics, CFB's opaque binary nature complicates inspection, as data is distributed across fragmented sectors without clear boundaries. Specialized tools like oletools provide extraction and analysis of streams, including macros and embedded objects, aiding malware detection. As an aging format, CFB faces obsolescence challenges; recent specification revisions, such as the October 2024 update to [MS-CFB], address parsing bugs, while version 4's larger sectors mitigate some size limits, but ongoing CVEs, such as CVE-2025-21298 (a zero-click RCE vulnerability in OLE object handling), highlight persistent risks in legacy deployments.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.