Hubbry Logo
File formatFile formatMain
Open search
File format
Community hub
File format
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
File format
File format
from Wikipedia
wav-file: 2.1 megabytes.
ogg-file: 154 kilobytes.

A file format is the way that information is encoded for storage in a computer file. It may describe the encoding at various levels of abstraction including low-level bit and byte layout as well high-level organization such as markup and tabular structure. A file format may be standarized (which can be proprietary or open) or it can be an ad hoc convention.

Some file formats are designed for very particular types of data: PNG files, for example, store bitmapped images using lossless data compression. Other file formats, however, are designed for storage of several different types of data: the Ogg format can act as a container for different types of multimedia including any combination of audio and video, with or without text (such as subtitles), and metadata. A text file can contain any stream of characters, including possible control characters, and is encoded in one of various character encoding schemes. Some file formats, such as HTML, scalable vector graphics, and the source code of computer software are text files with defined syntaxes that allow them to be used for specific purposes.

Specification

[edit]

Some file formats have a published specification describing the format and possibly how to verify correctness of data in that format. Such a document is not available for every format; sometimes because the format is considered a trade secret, and sometimes because a document was not authored. Sometimes, a format is defined de facto by the behavior of the program that accesses the file.

If there is no specification available, a developer might reverse engineer the format by inspecting files in that format or acquire the specification for a fee and by signing a non-disclosure agreement. Due to the time and money cost of these approaches, file formats with publicly available specifications tend to be supported by more programs.

Intellectual property protection

[edit]

Patent law (rather than copyright) can be used to protect the intellectual property inherent in a file format. Although a patent for a file format is not directly permitted under US law, some formats encode data using a patented algorithm. For example, prior to 2004, using compression with the GIF file format required the use of a patented algorithm, and though the patent owner did not initially enforce their patent, they later began collecting royalty fees. This has resulted in a significant decrease in the use of GIFs, and is partly responsible for the development of the alternative PNG format. However, the GIF patent expired in the US in mid-2003, and worldwide in mid-2004.

Identification

[edit]

Both users and applications need to identify a file's format so that the file can be used appropriately. Generally, the methods for identification vary by operating system, with each approach having its advantages and disadvantages.

Filename extension

[edit]

One popular method used by many operating systems, including Windows, macOS, CP/M, MS-DOS, VMS, and VM/CMS, is to indicate the format of a file with a suffix of the file name, known as the extension. For example, an HTML document is identified by a file name that ends with .html or .htm, and a GIF image by .gif.

In the now-antiquated FAT file system, file names were limited to eight characters for the base name plus a three-character extension, known as an 8.3 filename. Due to the prevalence of this naming scheme, many formats still use three-character extensions even though modern systems support longer extensions. Since there is no standardized list of extensions, more than one format can use the same extension – especially for three-letter extensions since there is a limited number of three-letter combinations. This situation can confuse both users and applications.

One implication of indicating the file type with the extension is that the users and applications can be tricked into treating a file as a different format simply by renaming it. For example, an HTML file can be treated as plain text by adding (or changing the existing) extension filename.txt. Although this strategy is useful, it is can be confusing to less technical users who accidentally make a file unusable (or "lose" it). To try to avoid this scenario, Windows and macOS support hiding the extension.

Hiding the extension, however, can create the appearance of multiple files with the same name in the same folder; which is confusing. For example, an image may be needed both in .eps format (for publishing) and .png format (for web sites) and one might name them with the same base name; for example CompanyLogo.eps and CompanyLogo.png. With extensions hidden they appear to have the same name: CompanyLogo.

Hiding extensions can also pose a security risk.[1] For example, a malicious user could create an executable program with an innocent name such as "Holiday photo.jpg.exe". The ".exe" would be hidden and an unsuspecting user would see "Holiday photo.jpg", which would appear to be a JPEG image, usually unable to harm the machine. However, the operating system would still see the ".exe" extension and run the program, which would then be able to cause harm to the computer. The same is true with files with only one extension: as it is not shown to the user, no information about the file can be deduced without explicitly investigating the file. To further trick users, it is possible to store an icon inside the program, in which case some operating systems' icon assignment for the executable file (.exe) would be overridden with an icon commonly used to represent JPEG images, making the program look like an image. Extensions can also be spoofed: some Microsoft Word macro viruses create a Word file in template format and save it with a .doc extension. Since Word generally ignores extensions and looks at the format of the file, these would open as templates, execute, and spread the virus.[citation needed] This represents a practical problem for Windows systems where extension-hiding is turned on by default.

Internal metadata

[edit]

A file's format may be indicated inside the file itself – either as information intended for this purpose or as identifiable data within the format that can be used for identification even though that is not its intended purpose.

Often intentionally placed information is located at the beginning of a file since this is relatively easy to read from a file both by users and applications. When the information at the beginning of the file is a structure that contains other metadata, then the structure is often called a file header. When the file starts with a relatively small datum that only indicates the format, then it is often called a magic number.

File header

[edit]

The metadata contained in a file header are usually stored at the start of the file, but might be present in other areas too, often including the end, depending on the file format or the type of data contained. Character-based (text) files usually have character-based headers, whereas binary formats usually have binary headers, although this is not a rule. Text-based file headers usually take up more space, but being human-readable, they can easily be examined by using simple software such as a text editor or a hexadecimal editor.

As well as indicating the file format, file headers may contain metadata about the file and its contents. For example, most image files store information about image format, size, resolution and color space, and optionally authoring information such as who made the image, when and where it was made, what camera model and photographic settings were used (Exif), and so on. Such metadata may be used by software reading or interpreting the file during the loading process and afterwards.

File headers may be used by an operating system to quickly gather information about a file without loading it all into memory, but doing so uses more of a computer's resources than reading directly from the directory information. For instance, when a graphic file manager has to display the contents of a folder, it must read the headers of many files before it can display the appropriate icons, but these will be located in different places on the storage medium thus taking longer to access. A folder containing many files with complex metadata such as thumbnail information may require considerable time before it can be displayed.

If a header is binary hard-coded such that the header itself needs complex interpretation in order to be recognized, especially for metadata content protection's sake, there is a risk that the file format can be misinterpreted. It may even have been badly written at the source. This can result in corrupt metadata which, in extremely bad cases, might even render the file unreadable.[clarification needed]

A more complex example of file headers are those used for wrapper (or container) file formats.

Magic number

[edit]

One way to incorporate file type metadata is to store a "magic number" inside the file itself. Originally, this term was used for 2-byte identifiers at the start of files, but since any binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition that starts with <!DOCTYPE html, or, for XHTML, the XML identifier, which begins with <?xml. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML.

The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, file name and metadata-based methods need to check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where file types do not lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if the file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand, a valid magic number does not guarantee that the file is not corrupt or is of a correct type.

So-called shebang lines in script files are a special case of magic numbers. There, the magic number consists of human-readable text within the file that identifies a specific interpreter and options to be passed to it.

Another operating system using magic numbers is AmigaOS, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in Hunk executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with the Amiga standard Datatype recognition system. Another method was the FourCC method, originating in OSType on Macintosh, later adapted by Interchange File Format (IFF) and derivatives.

External metadata

[edit]

A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself.

This approach keeps the metadata separate from both the main data and the name, but is also less portable than either filename extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions— for instance, for compatibility with MS-DOS's three character limit— most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.

Note that zip files and other archive files solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP transmissions or sent by email as an attachment. At the destination, the single file received has to be unzipped by a compatible utility to be useful. The problems of handling metadata are solved this way using zip files or archive files.

Mac OS type-codes

[edit]

The Mac OS' Hierarchical File System and HFS+ file system, and the Apple File System, store codes for creator and type as part of the directory entry for each file. These codes are referred to as OSTypes. These codes could be any 4-byte sequence but were often selected so that the ASCII representation formed a sequence of meaningful characters, such as an abbreviation of the application's name or the developer's initials. For instance a HyperCard "stack" file has a creator of WILD (from Hypercard's previous name, "WildCard") and a type of STAK. The BBEdit text editor has a creator code of R*ch referring to its original programmer, Rich Siegel. The type code specifies the format of the file, while the creator code specifies the default program to open it with when double-clicked by the user. For example, the user could have several text files all with the type code of TEXT, but each open in a different program, due to having differing creator codes. This feature was intended so that, for example, human-readable plain-text files could be opened in a general-purpose text editor, while programming or HTML code files would open in a specialized editor or IDE. However, this feature was often the source of user confusion, as which program would launch when the files were double-clicked was often unpredictable.

RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions—e.g. the hexadecimal number FF5 is "aliased" to PoScript, representing a PostScript file.

macOS uniform type identifiers (UTIs)

[edit]

A Uniform Type Identifier (UTI) is a method used in macOS for uniquely identifying "typed" classes of entities, such as file formats. It was developed by Apple as a replacement for OSType (type & creator codes).

The UTI is a Core Foundation string, which uses a reverse-DNS string. Some common and standard types use a domain called public (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.

In addition to file formats, UTIs can also be used for other entities which can exist in macOS, including:

  • Pasteboard data
  • Folders (directories)
  • Translatable types (as handled by the Translation Manager)
  • Bundles
  • Frameworks
  • Streaming data
  • Aliases and symlinks

VSAM Catalog

[edit]

In IBM OS/VS through z/OS, the VSAM catalog (prior to ICF catalogs) and the VSAM Volume Record in the VSAM Volume Data Set (VVDS) (with ICF catalogs) identifies the type of VSAM dataset.

VTOC

[edit]

In IBM OS/360 through z/OS, a format 1 or 7 Data Set Control Block (DSCB) in the Volume Table of Contents (VTOC) identifies the Dataset Organization (DSORG) of the dataset described by it.

OS/2 extended attributes

[edit]

The HPFS, FAT12, and FAT16 (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value, and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under OS/2). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types.

The NTFS filesystem also allows storage of OS/2 extended attributes, as one of the file forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.

POSIX extended attributes

[edit]

On Unix and Unix-like systems, the ext2, ext3, ext4, ReiserFS version 3, XFS, JFS, FFS, and HFS+ filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique and a value can be accessed through its related name.

PRONOM unique identifiers (PUIDs)

[edit]

The PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique, and unambiguous identifiers for file formats, which has been developed by The National Archives of the UK as part of its PRONOM technical registry service. PUIDs can be expressed as Uniform Resource Identifiers using the info:pronom/ namespace. Although not yet widely used outside of the UK government and some digital preservation programs, the PUID scheme does provide greater granularity than most alternative schemes.

MIME types

[edit]

MIME types are widely used in many Internet-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by IANA) consisting of a type and a sub-type, separated by a slash—for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail, independent of the source and target operating systems. MIME types identify files on BeOS, AmigaOS 4.0 and MorphOS, as well as store unique application signatures for application launching. In AmigaOS and MorphOS, the Mime type system works in parallel with Amiga specific Datatype system.

There are problems with the MIME types though; several organizations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.

File format identifiers (FFIDs)

[edit]

File format identifiers are another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part indicates the organization origin/maintainer (this number represents a value in a company/standards organization database), and the 2 following digits categorize the type of file in hexadecimal. The final part is composed of the usual filename extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the International Organization for Standardization (ISO).

File content based format identification

[edit]

Another less popular way to identify the file format is to examine the file contents for distinguishable patterns among file types. The contents of a file are a sequence of bytes and a byte has 256 unique permutations (0–255). Thus, counting the occurrence of byte patterns that is often referred to as byte frequency distribution gives distinguishable patterns to identify file types. There are many content-based file type identification schemes that use a byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types.[2]

File structure

[edit]

There are several types of ways to structure data in a file. The most usual ones are described below.

Unstructured formats (raw memory dumps)

[edit]

Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file.

This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal string is not recognized as such in C). On the other hand, developing tools for reading and writing these types of files is very simple.

The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.

Chunk-based formats

[edit]

In this kind of file structure, each piece of data is embedded in a container that somehow identifies the data. The container's scope can be identified by start- and end-markers of some kind, by an explicit length field somewhere, or by fixed requirements of the file format's definition.

Throughout the 1970s, many programs used formats of this general kind. For example, word-processors such as troff, Script, and Scribe, and database export files such as CSV. Electronic Arts and Commodore-Amiga also used this type of file format in 1985, with their IFF (Interchange File Format) file format.

A container is sometimes called a "chunk", although "chunk" may also imply that each piece is small, and/or that chunks do not contain other chunks; many formats do not impose those requirements.

The information that identifies a particular "chunk" may be called many different things, often terms including "field name", "identifier", "label", or "tag". The identifiers are often human-readable, and classify parts of the data: for example, as a "surname", "address", "rectangle", "font name", etc. These are not the same thing as identifiers in the sense of a database key or serial number (although an identifier may well identify its associated data as such a key).

With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand. Depending on the actual meaning of the skipped data, this may or may not be useful (CSS explicitly defines such behavior).

This concept has been used again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (Distinguished Encoding Rules) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and Structured Data Exchange Format (SDXF).

Indeed, any data format must somehow identify the significance of its component parts, and embedded boundary-markers are an obvious way to do so:

  • MIME headers do this with a colon-separated label at the start of each logical line. MIME headers cannot contain other MIME headers, though the data content of some headers has sub-parts that can be extracted by other conventions.
  • CSV and similar files often do this using a header records with field names, and with commas to mark the field boundaries. Like MIME, CSV has no provision for structures with more than one level.
  • XML and its kin can be loosely considered a kind of chunk-based format, since data elements are identified by markup that is akin to chunk identifiers. However, it has formal advantages such as schemas and validation, as well as the ability to represent more complex structures such as trees, DAGs, and charts. If XML is considered a "chunk" format, then SGML and its predecessor IBM GML are among the earliest examples of such formats.
  • JSON is similar to XML without schemas, cross-references, or a definition for the meaning of repeated field-names, and is often convenient for programmers.
  • YAML is similar to JSON, but use indentation to separate data chunks and aim to be more human-readable than JSON or XML.
  • Protocol Buffers are in turn similar to JSON, notably replacing boundary-markers in the data with field numbers, which are mapped to/from names by some external mechanism.

Directory-based formats

[edit]

This is another extensible format, that closely resembles a file system (OLE Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk images, executables, OLE documents TIFF, libraries.

Some file formats like ODT and DOCX, being PKZIP-based, are both chunked and carry a directory.[citation needed]

The structure of a directory-based file format lends itself to modifications more easily than unstructured or chunk-based formats.[citation needed] The nature of this type of format allows users to carefully construct files that causes reader software to do things the authors of the format never intended to happen. An example of this is the zip bomb. Directory-based file formats also use values that point at other areas in the file but if some later data value points back at data that was read earlier, it can result in an infinite loop for any reader software that assumes the input file is valid and blindly follows the loop.[citation needed]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A file format is the standard structure and encoding method used to organize and store digital information within a computer file, enabling software applications to read, interpret, and manipulate the data accurately. This specification defines how bytes of data—represented as binary sequences of 0s and 1s—are arranged, including headers for metadata and the layout for content, ensuring compatibility across systems. File formats are commonly identified by extensions appended to filenames, such as .txt for plain text files or .pdf for portable documents, which signal the operating system to launch the suitable program for opening and processing the file. These extensions, typically three to four characters long, originated in early operating systems like MS-DOS to categorize files efficiently, though modern systems also rely on internal file headers—unique byte sequences at the beginning of the file—for more reliable identification. While extensions facilitate quick recognition, the actual format is determined by the file's internal structure, which can sometimes lead to mismatches if manually altered. The diversity of file formats reflects the breadth of digital data types, broadly categorized into text-based formats like CSV for tabular data and XML for structured markup, raster image formats such as JPEG for compressed photos and PNG for lossless graphics, audio formats including MP3 for compressed sound, video containers like MP4, and proprietary document formats like the older .doc. Binary formats dominate for efficiency in handling multimedia and executables, while open formats—publicly documented and non-proprietary—promote widespread interoperability and long-term preservation by reducing dependency on specific vendors. File formats play a pivotal role in computing by ensuring data portability, enabling seamless sharing across devices and platforms, and supporting archival integrity against technological obsolescence. Standardization efforts by bodies like the International Organization for Standardization (ISO) and the World Wide Web Consortium (W3C) have driven the adoption of robust, future-proof formats, mitigating risks in fields such as scientific research, cultural heritage, and software development where data longevity is essential.

Fundamentals

Definition and Purpose

A file format is a standardized method for encoding, organizing, and interpreting digital data within a computer file, encompassing both text-based and binary structures to ensure consistent storage and retrieval. This encoding defines the structure, layout, and semantics of the data, allowing software to parse and process it reliably across various platforms. The core purpose of file formats is to facilitate interoperability among diverse software applications, hardware devices, and operating systems, enabling seamless data exchange, transmission, and rendering without loss of integrity. They also ensure data persistence, preserving information for long-term access and reuse, which is essential for archiving, collaboration, and computational workflows. File formats are broadly distinguished as proprietary or open, with the former owned and controlled by specific organizations, often requiring proprietary software for full access and risking obsolescence due to restricted specifications. Open formats, by contrast, feature publicly documented specifications maintained by standards bodies, fostering broad compatibility and sustainability without licensing barriers. For instance, the plain text format (.txt) serves as a simple open standard for storing unformatted character data using encodings like ASCII or UTF-8, prioritizing ease of use and universality. In comparison, the PDF format exemplifies a more intricate open standard, optimized for fixed-layout documents that maintain visual fidelity during interchange. Over time, the role of file formats has expanded from rudimentary data representation on early media such as punched cards and magnetic tapes to accommodating sophisticated multimedia elements and database structures in modern computing environments.

Historical Development

The origins of file formats trace back to the early days of computing in the 1950s and 1960s, when data storage was primarily handled through physical media like punch cards and magnetic tapes. Binary executables and simple data dumps were encoded on punch cards, which served as the first automated information storage devices, allowing programs and data to be fed into mainframe computers like the IBM 701. Magnetic tapes, such as the Uniservo introduced with UNIVAC I in 1951 and the IBM 726 in 1952, enabled mass storage of binary data streams, marking a shift from manual to automated data handling in early mainframe systems. IBM's influence was profound during this era, with the development of EBCDIC (Extended Binary Coded Decimal Interchange Code) in the early 1960s for its System/360 mainframes, providing an eight-bit character encoding standard that became ubiquitous in enterprise computing despite competition from ASCII, which was standardized in 1963 for broader interoperability. The 1970s and 1980s saw the rise of personal computing, driving the adoption of more accessible text and application-specific formats. ASCII emerged as the dominant plain-text encoding, facilitating simple file exchanges on systems like CP/M and early PCs, while proprietary formats proliferated with software like WordStar, the first widely used word processor released in 1978, which stored documents in a binary format optimized for non-visual editing. The era also birthed early multimedia formats, exemplified by the GIF (Graphics Interchange Format) introduced by CompuServe in June 1987, which used LZW compression to enable efficient color image sharing over dial-up connections. These developments reflected a transition from mainframe-centric, sequential storage to user-friendly, disk-based files on microcomputers. In the 1990s and 2000s, the explosive growth of the World Wide Web spurred web-driven standardization of file formats for cross-platform compatibility. HTML, proposed by Tim Berners-Lee in 1990 and formalized in specifications through the mid-1990s, became the foundational markup format for web documents, while JPEG emerged in 1992 as an ISO-standardized image compression format ideal for photographs, revolutionizing online visuals. The open-source movement gained traction with XML, recommended by the W3C in 1998 as a flexible data structuring language derived from SGML, and PDF, developed by Adobe in 1993 and adopted as ISO 32000 in 2008, ensuring portable document rendering. Key events like ARPANET's establishment of FTP in 1971 laid groundwork for network file transfers, influencing later internet protocols. The 2010s to the present have been defined by cloud computing, AI, and mobile ecosystems, emphasizing lightweight, interoperable formats for data exchange and multimedia. JSON, popularized after Douglas Crockford's 2001 specification, became the de facto standard for web APIs and configuration files by the mid-2010s due to its simplicity and native JavaScript integration. Image formats evolved with WebP, released by Google in 2010 as an open, royalty-free alternative to JPEG and PNG, optimizing for web performance. Container formats like MP4, based on MPEG-4 Part 14 standardized in 2003 but widely adopted in the streaming era, support efficient video delivery across devices. A notable controversy arose in the 1990s when Unisys enforced patents on LZW compression used in GIF, prompting alternatives like PNG in 1996 and accelerating the push toward open standards for broader interoperability.

Specification and Standards

Formal Specifications

Formal specifications for file formats are detailed technical documents that precisely define the syntax, semantics, and constraints governing how data is structured and interpreted within the format. These documents ensure interoperability by providing unambiguous rules for encoding, decoding, and validation, often employing formal notations such as Backus-Naur Form (BNF) grammars or Extended BNF (EBNF) to describe the hierarchical structure of the file. For instance, BNF is commonly used to specify the lexical and syntactic rules for binary file formats, allowing parsers to be generated automatically from the grammar. Additionally, specifications may include pseudocode to illustrate parsing algorithms, clarifying the logical steps for processing file contents without tying to a specific programming language. Key components of these specifications include definitions of file headers, which typically contain magic numbers or signatures for identification, followed by layouts for data fields that delineate offsets, lengths, and types (e.g., integers, strings, or arrays). Encoding rules are also specified, covering aspects like byte order (endianness, such as big-endian or little-endian), character sets (e.g., UTF-8), and compression methods, including algorithms like Huffman coding for entropy reduction or Lempel-Ziv-Welch (LZW) for dictionary-based compression. These elements collectively enforce consistency, preventing ambiguities that could lead to data corruption or misinterpretation across systems. Prominent examples of formal specifications include the ISO/IEC 32000 standard for Portable Document Format (PDF), which outlines the syntax for objects, streams, and cross-reference tables using a descriptive notation akin to pseudocode, ensuring device-independent rendering. For internet-related formats, Request for Comments (RFC) documents provide rigorous definitions; RFC 8259 specifies the JavaScript Object Notation (JSON) syntax using ABNF, a BNF variant, for lightweight data interchange in HTTP bodies as referenced in RFC 9110, which defines HTTP message formats. Another example is the Portable Network Graphics (PNG) specification, documented by the World Wide Web Consortium (W3C), which details chunk-based structures with CRC checksums for integrity. The development of these specifications follows iterative processes, beginning with prototypes to test feasibility, followed by public reviews and revisions to incorporate feedback. Versioning is a core aspect, as seen in PNG's progression from version 1.0 (released in 1996) to 1.2 (2003), with each iteration adding features like ancillary chunks while maintaining compatibility through errata publications that address ambiguities without altering the core structure. This evolution ensures the specification remains relevant without disrupting existing implementations. Challenges in crafting formal specifications revolve around balancing exhaustive detail for precision with readability to aid developers, often requiring modular organization to avoid overwhelming complexity. Ensuring backward compatibility is particularly demanding, as new versions must support legacy files—PNG achieves this by mandating that decoders ignore unknown chunks, preserving functionality for older encoders—while avoiding feature bloat that could fragment adoption.

Standardization Processes

The standardization of file formats involves collaborative efforts by international bodies to establish interoperable, widely adopted specifications that ensure compatibility across systems and applications. These processes typically begin with identifying needs for new or revised formats and culminate in formal ratification, often spanning several years due to the complexity of technical consensus-building. Key organizations oversee file format standardization, each with domain-specific expertise. The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) Joint Technical Committee 1 (JTC 1) develops international standards for various formats, such as the JPEG image format defined in ISO/IEC 10918, which specifies digital compression for continuous-tone still images. The Internet Engineering Task Force (IETF) standardizes network-related formats through Request for Comments (RFCs), such as the Network File System (NFS) protocol in RFC 1094, enabling transparent remote file access. The World Wide Web Consortium (W3C) focuses on web technologies, including the Scalable Vector Graphics (SVG) format, an XML-based language for two-dimensional vector graphics standardized as a W3C Recommendation. Standardization procedures generally follow structured stages to achieve consensus and technical rigor. For ISO, the process starts with a New Work Item Proposal (NWIP) submitted for a three-month vote by national bodies, followed by working group development of drafts, circulation of a Draft International Standard (DIS) for 12-week balloting requiring two-thirds approval, public comments integration, and final ratification via a Final Draft International Standard (FDIS) vote; complex formats can take 3-5 years or more. IETF processes emphasize community-driven rough consensus, progressing from Internet-Drafts to Proposed Standards via working group reviews and last-call comments, with advancement to Internet Standard status after demonstrated interoperability, often requiring 1-3 years. W3C employs a similar track, involving working drafts, candidate recommendations for implementation testing, proposed recommendations for public feedback, and final W3C Recommendation status after advisory committee approval. Standardization can be open or proprietary, influencing accessibility and adoption. Open processes, such as those by the Organization for the Advancement of Structured Information Standards (OASIS), promote collaborative development of XML-based formats through technical committees open to members and public review, as seen in standards like the OpenDocument Format. In contrast, proprietary formats like Microsoft's original DOC transitioned to open standards via ECMA International's adoption of Office Open XML (OOXML) in 2006, followed by ISO/IEC 29500 ratification in 2008, enabling broader interoperability. Versioning and updates ensure formats evolve with technology while maintaining backward compatibility, including deprecation of obsolete ones. Consortia like the Khronos Group manage graphics formats, developing glTF 2.0 as a royalty-free 3D asset delivery standard, ratified as ISO/IEC 12113 in 2022 through working group extensions and community input. Deprecation examples include Adobe Flash, phased out after 2020 in favor of HTML5 standards supported by W3C, due to security and performance issues, with browsers blocking Flash content from 2021. These efforts have global impact by harmonizing formats to prevent fragmentation and promote universal access. The Unicode Consortium, for instance, maintains the Unicode Standard as a universal character encoding system, unifying diverse text representations in file formats to support worldwide languages and scripts.

Identification Techniques

Filename-Based Identification

Filename-based identification is a primary method for determining a file's format through human-readable suffixes appended to the filename, typically consisting of three or four letters following a period. For instance, the .jpg extension indicates a JPEG image file, while .docx signifies an Office Open XML document used by Microsoft Word. These extensions enable operating systems to associate files with specific applications, facilitating automatic opening and processing without deeper analysis. Conventions for these extensions are established through industry standards and registries, with the Internet Assigned Numbers Authority (IANA) maintaining an official list of media types (MIME types) that often include corresponding file extensions for common formats. Certain compound formats employ multiple extensions to denote layered structures, such as .tar.gz, where .tar represents a tape archive and .gz indicates GNU Zip compression applied atop it. This approach originated in the 1970s with the CP/M operating system, which introduced the 8.3 filename convention limiting the base name to eight characters and the extension to three, a structure designed for efficient disk directory management on early microcomputers. Microsoft adopted this format for MS-DOS in the early 1980s to ensure compatibility with CP/M applications, enforcing the same constraints due to underlying FAT file system limitations. Over time, modern operating systems like Windows and Unix variants have evolved to support longer filenames and extensions, though legacy 8.3 compatibility remains in some contexts. In everyday usage, file extensions drive functionality in graphical user interfaces, where file explorers use them to assign icons and default handlers, such as associating .pdf with a PDF reader. Command-line environments in Unix-like systems leverage extensions for MIME type mapping, enabling tools to route files appropriately based on suffix patterns. Automation scripts frequently parse extensions for batch processing, for example, identifying all .txt files for text indexing or .jpg for image conversion. However, this method has notable limitations, including non-uniqueness, as a single extension like .dat can denote diverse formats such as generic data files, Amiga disk images, or database exports depending on the application. Security risks arise from spoofing, where attackers append benign extensions (e.g., .txt) to malicious executables to trick users or bypass filters, potentially leading to unintended execution. These issues highlight the superficial nature of extension-based identification compared to more robust techniques.

Metadata-Based Identification

Metadata-based identification of file formats relies on structured data embedded within the file or stored externally in association with it, providing a more robust mechanism than superficial naming conventions. Internal metadata, such as file headers, often includes specific byte sequences known as magic numbers that uniquely signal the format at the beginning of the file. For instance, Portable Network Graphics (PNG) files start with the eight-byte signature 89 50 4E 47 0D 0A 1A 0A in hexadecimal, which serves to verify the file's integrity and format compliance. Similarly, Executable and Linkable Format (ELF) files, commonly used for executables on Unix-like systems, begin with the four-byte magic number 7F 45 4C 46, enabling loaders to confirm the file type before processing. External metadata complements internal indicators by leveraging operating system or application-level tags to describe file properties. In classic Mac OS, files are tagged with four-character type codes, such as 'PDF ' for Portable Document Format files, which help the system associate documents with appropriate applications. POSIX-compliant systems support extended attributes (xattrs), allowing key-value pairs like format tags to be attached to files for identification purposes, as defined in the POSIX standard for filesystem metadata. MIME types, standardized by the Internet Engineering Task Force (IETF), provide another external layer, with examples like image/png used in web and email contexts to denote content type. In digital preservation efforts, the PRONOM Persistent Unique Identifier (PUID) scheme assigns unique codes, such as fmt/12 for PNG, to catalog formats comprehensively within registries maintained by The National Archives. Modern and legacy systems extend this approach with specialized metadata frameworks. Apple's macOS employs Uniform Type Identifiers (UTIs), abstract tags like public.jpeg for JPEG images, which unify type recognition across applications and replace older type codes. In OS/2, extended attributes (EAs) store file type information, such as .TYPE entries, enabling the Workplace Shell to categorize and handle files appropriately. Mainframe environments, like IBM z/OS, use VSAM catalogs and the Volume Table of Contents (VTOC) to maintain dataset metadata, including format details for identification and access control. Tools and libraries automate metadata-based detection for practical use. The libmagic library, underlying the Unix file command, parses magic numbers and other metadata patterns from a compiled database to determine file types reliably across diverse formats. This integration appears in file managers like GNOME Files or macOS Finder, where it supports automated handling without relying on potentially unreliable filename extensions. Overall, metadata-based methods offer advantages in reliability and automation, as they embed or associate verifiable format information directly with the file, reducing errors from user modifications or cross-platform inconsistencies.

Content-Based Identification

Content-based identification involves analyzing the binary content of a file to determine its format, relying on inherent patterns, statistical properties, or structural signatures rather than external metadata or filenames. This method is particularly useful for files lacking reliable external indicators or those that have been renamed, fragmented, or altered. It employs algorithmic techniques to scan byte sequences, compute statistical measures, or apply machine learning models to classify the format with high accuracy. One fundamental technique is byte pattern matching, also known as signature-based detection, where specific sequences of bytes, or "magic numbers," at fixed offsets within the file are compared against known format signatures. For instance, JPEG image files typically begin with the byte sequence 0xFF 0xD8 0xFF, marking the start of image (SOI) marker, which allows immediate identification even in partial files. This approach is efficient for well-defined formats and is the basis for many identification tools, though it may fail if signatures are obfuscated or if the file is truncated before the pattern. Another technique is entropy analysis, which measures the randomness or compressibility of the file's byte distribution to distinguish between file types. Text-based files, such as plain ASCII documents, exhibit low entropy (around 1-4 bits per byte) due to repetitive patterns and limited character sets, while compressed or encrypted files, like ZIP archives or ransomware-encrypted data, show high entropy (close to 8 bits per byte) indicating uniform byte distributions. This method serves as a quick preprocessing step to categorize files broadly before more detailed analysis, though it cannot pinpoint exact formats and requires combination with other techniques for precision. For ambiguous or variant cases, machine learning classifiers are employed, training on features like byte frequency distributions or n-gram sequences extracted from known file samples. Approaches using classifiers like k-nearest neighbors (KNN) on selected byte frequency features have achieved over 90% accuracy on common formats such as DOC, EXE, GIF, HTML, JPG, and PDF. Naive Bayes classifiers, often combined with n-gram analysis of byte sequences, provide another effective method for file type detection, particularly on file fragments. These classifiers excel in handling noisy or partial data but demand large training datasets and computational resources. Advanced methods incorporate statistical models to assign likelihood scores to potential formats while accounting for variations like endianness swaps in binary structures. Endianness differences—big-endian versus little-endian byte ordering—can alter multi-byte patterns in formats like executables or images, so tools may test both orientations during matching to resolve ambiguities. Probabilistic frameworks improve robustness against minor corruptions or format variants. Several tools implement these techniques for practical use. TrID uses a user-contributed database of over 4,000 binary signatures to match byte patterns, providing probabilistic scores for multiple possible formats. Apache Tika integrates content detection with extraction, employing a combination of signature matching and statistical analysis to identify over 1,000 formats via its MIME type repository. FIDO (Format Identification for Digital Objects) supports fuzzy matching through PRONOM signatures, allowing tolerance for offsets and variants in archival workflows. Additionally, the DROID tool leverages the PRONOM registry's extensive signature database—containing internal byte patterns and positional rules—for batch processing in digital preservation, achieving reliable identification across thousands of formats. These methods find applications in digital forensics, where identifying file types from disk images aids evidence recovery; malware detection, by flagging anomalous entropy in executables; and archival ingestion, ensuring format compliance in repositories. Challenges arise with obfuscated files, such as those packed or encrypted to evade signatures, and damaged files where patterns are incomplete, often requiring hybrid approaches or manual verification to maintain accuracy.

Structural Organization

Unstructured Formats

Unstructured formats represent the simplest category of file structures, where data is stored as a continuous sequence of bytes without any internal headers, indices, delimiters, or metadata to define organization. These files treat the entire content as raw binary data, often saved with extensions like .bin, requiring external specifications or prior knowledge to interpret the byte layout correctly. This approach contrasts with more organized formats by eliminating any built-in structure, making the file a direct dump of memory or sensor output. Another instance is uncompressed bitmap images in raw RGB format, consisting solely of pixel data without headers, as used in certain video frame buffers or low-level graphics processing. Early audio files, such as .raw PCM recordings, store unprocessed pulse-code modulation samples as a flat byte stream, lacking encoding details like sample rate or channels. These formats find primary use in low-level input/output operations, embedded systems, and scenarios demanding minimal overhead, such as firmware loading or real-time data capture where external configuration files or code provide the necessary interpretation context, including byte offsets for specific elements. For instance, raw audio or image data requires accompanying parameters for playback or rendering. The advantages of unstructured formats lie in their simplicity and compactness, avoiding metadata overhead and thus optimizing storage and transfer efficiency in resource-constrained settings. However, they suffer from poor portability, as the absence of self-descriptive elements demands precise external knowledge, increasing the risk of misinterpretation across systems or over time. Parsing unstructured files poses significant challenges, typically involving manual examination of byte offsets to locate and extract data segments, often facilitated by hex editors that display the raw content in both hexadecimal and ASCII views for analysis. Tools like HxD enable users to navigate large binary streams, search for patterns, and perform edits without altering the file's linear nature, though this process remains labor-intensive compared to formats with built-in navigation aids.

Chunk-Based Formats

Chunk-based file formats organize data into a sequence of self-contained blocks, each identified by a unique tag, typically consisting of a chunk identifier, a length field specifying the size of the payload, and the actual data payload itself. This modular approach allows files to be parsed incrementally without requiring knowledge of the entire structure upfront. The format often begins with an overall container chunk that encapsulates subsequent sub-chunks, enabling a linear traversal of the file. For instance, the Resource Interchange File Format (RIFF), developed by Microsoft and IBM in 1991, uses a top-level "RIFF" chunk followed by a file type identifier (such as "WAVE" for audio or "AVI" for video) and then nested chunks like "fmt " for format details and "data" for the primary content. Prominent examples illustrate this structure's application across media types. In the Portable Network Graphics (PNG) format, standardized by the World Wide Web Consortium in 1996, the file starts with an 8-byte signature, followed by chunks such as IHDR (image header, containing width, height, bit depth, and color type), one or more IDAT chunks (holding compressed image data), and IEND (marking the file's end). The Audio Interchange File Format (AIFF), introduced by Apple in 1988 based on the Interchange File Format (IFF), employs a "FORM" container chunk with sub-chunks like "COMM" for common parameters (sample rate, channels) and "SSND" for sound data. These designs facilitate handling diverse data streams, from raster images to uncompressed audio. The chunk-based paradigm offers several advantages, particularly in extensibility, where new chunk types can be added without breaking compatibility—parsers simply skip unrecognized chunks based on their length fields. This supports ongoing evolution, as seen in PNG's ancillary chunks for metadata like text or transparency information, which applications can ignore if unsupported. Partial parsing is another key benefit, allowing efficient access to specific sections (e.g., extracting audio format from a WAV file's "fmt " chunk without loading the entire "data" payload), which is valuable for streaming or resource-constrained environments. Error resilience is enhanced through mechanisms like cyclic redundancy checks (CRC); in PNG, each chunk includes a 32-bit CRC over its type and data fields, enabling detection of corruption during transmission or storage. Parsing chunk-based files involves sequentially reading the identifier and length to seek to the payload, validating the chunk's integrity (e.g., via CRC where present), and processing or skipping as needed before advancing by the specified size plus any padding. Libraries streamline this process; for example, libpng, the reference implementation for PNG since 1995, provides functions to read chunks incrementally, handling decompression of IDAT payloads via zlib and supporting custom chunk callbacks for extensibility. Similar approaches apply to RIFF-based formats, where tools like those in the Windows Multimedia API parse chunks by FOURCC codes. The evolution of chunk-based formats traces back to the 1980s amid growing multimedia demands, originating with Electronic Arts' IFF in 1985 for Amiga systems, which influenced Apple's AIFF and Microsoft's RIFF. By the early 1990s, RIFF addressed Windows multimedia needs, underpinning formats like WAV (1991) for audio interchange. The 1990s saw broader adoption, with PNG emerging in 1996 as a patent-free alternative to GIF, leveraging chunks for robust image handling. Today, these formats persist in containers like WebP (using RIFF since 2010), balancing legacy compatibility with modern requirements for metadata and partial decoding.

Directory-Based Formats

Directory-based file formats organize data through a central directory or index that serves as a table of contents, providing pointers to various data sections within the file to enable hierarchical and efficient access. This structure typically includes a dedicated section containing entries with metadata such as byte offsets, sizes, and types, allowing applications to navigate directly to specific components without sequential scanning. Unlike simpler sequential arrangements, this approach supports non-linear retrieval, making it suitable for complex, multi-component files. A prominent example is the ZIP archive format, where the central directory (CDIR) at the end of the file lists all entries with their local headers' offsets, compressed sizes, and uncompressed sizes, facilitating quick extraction of individual files. In the TAR format, header blocks embedded before each file's data act as a distributed directory, recording details like file names, permissions, and lengths in 512-byte records to outline the archive's contents. The PDF format employs a cross-reference table that maps object numbers to their byte offsets, enabling random access to document elements such as pages and fonts. Database files like SQLite use page indices within B-tree structures to reference data pages, supporting indexed queries across the file. Container formats such as Matroska (used in MKV files) incorporate segment-level indices like the SeekHead and Cues elements, which point to tracks, clusters, and chapters for multimedia synchronization. The core mechanism involves index entries that store essential metadata—typically offsets for positioning, sizes for boundary definition, and types for content interpretation—enabling random access through file seeks to the specified locations. This allows for targeted reading or writing without loading the entire file into memory, enhancing performance in resource-constrained environments. These formats offer advantages in efficient querying and scalability for large files, as the index permits O(1) access to components, and they often support per-entry compression to optimize storage without affecting individual retrieval. However, corruption in the index can render large portions of the file inaccessible, necessitating robust parsing tools like those in the unzip utility for ZIP files to validate and repair structures.

Intellectual Property Protection

File formats are often protected through intellectual property mechanisms that safeguard the underlying technologies, specifications, and implementations, though the abstract concept of a format itself is generally not protectable. Patents commonly cover specific encoding algorithms used within formats, such as the Lempel-Ziv-Welch (LZW) compression algorithm integral to the Graphics Interchange Format (GIF). Unisys Corporation held U.S. Patent No. 4,558,302 for LZW, which expired on June 20, 2003, after which the technology entered the public domain globally by 2004. Patents may also extend to hardware implementations of format-related processes, ensuring control over both software and physical embodiments. Copyright law protects the expressive elements of file formats, including the textual descriptions in specification documents and example files, but does not extend to the functional aspects or the format's underlying idea. For instance, Adobe Systems copyrighted its Portable Document Format (PDF) reference manuals, distributing them under a licensing policy that permitted viewing and printing but restricted editing or redistribution until the format's adoption as ISO 32000-1 in 2008, after which the specification became openly accessible. Similarly, sample implementation files accompanying specifications fall under copyright as creative works, while the format's structure remains unprotected as a method of operation. Trade secrets further shield proprietary formats, such as the binary file formats used in pre-2007 versions of Microsoft Office (e.g., .doc and .xls), where end-user license agreements (EULAs) explicitly prohibit reverse engineering to prevent unauthorized disclosure or replication. Licensing arrangements govern access to protected file formats, ranging from open models to royalty-based systems. Open formats like the Portable Network Graphics (PNG) specification are placed in the public domain, allowing unrestricted use, while others, such as those under Creative Commons licenses, permit sharing with attribution for derivative works. In contrast, royalty-bearing licenses apply to patented elements in standards like MPEG video formats, administered by Via Licensing Alliance (formerly MPEG LA), which pools essential patents and charges per-unit fees—e.g., up to $0.20 per device for AVC/H.264 decoding in certain applications—to ensure collective compensation for licensors. Intellectual property disputes over file formats have shaped their evolution, often prompting alternatives or regulatory interventions. The Unisys enforcement of its LZW patent in the 1990s led to widespread backlash and the development of PNG in 1995 as a patent-free successor to GIF, utilizing the deflate compression algorithm. In the European Union, interoperability mandates under frameworks like the European Interoperability Framework promote open standards for file formats in public sector systems, requiring non-discriminatory access to specifications to facilitate cross-border data exchange and prevent vendor lock-in.

Digital Preservation Challenges

One of the primary challenges in digital preservation is the obsolescence of file formats, where once-common standards like WordPerfect's proprietary document format or Adobe Flash's multimedia files become unreadable on modern systems without specialized intervention, potentially leading to a "digital dark age" in which vast amounts of cultural and historical data are lost to technological incompatibility. This risk arises as software and hardware evolve rapidly, rendering legacy formats unsupported by contemporary tools and exacerbating the loss of digital heritage if proactive measures are not taken. To mitigate obsolescence, preservation strategies include emulation, which recreates the original software environment on new hardware—such as using DOSBox to run old executables—and migration, which converts files to more sustainable formats like shifting TIFF images to JPEG 2000 for enhanced longevity. Normalization further supports these efforts by standardizing files within archives to open, widely supported formats, ensuring accessibility without repeated conversions. These approaches balance fidelity to the original with practical usability, though each carries trade-offs, such as emulation's resource intensity or migration's potential loss of nuanced features. Key tools and initiatives address these issues systematically; the Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) provides guidelines on recommended formats to prioritize for long-term sustainability, evaluating factors like openness and support. The PRONOM registry, maintained by The National Archives (UK), catalogs file formats and assesses preservation risks based on factors like vendor support and documentation availability. Complementing these, the JHOVE validator verifies file integrity and compliance with format specifications, helping institutions detect issues early in the preservation workflow. Technical hurdles compound these challenges, particularly proprietary lock-in, where formats controlled by vendors like early Microsoft Office files limit access due to restricted specifications, and undocumented variants that vary unpredictably across implementations. Additionally, formats such as PDF often depend on specific software renderers for features like embedded fonts or transparency, risking rendering inconsistencies over time without the original viewer. These dependencies demand ongoing risk assessment to avoid silent data corruption. Looking ahead, future trends emphasize self-describing formats that embed structural metadata and specifications directly within the file, as seen in PDF/A, to reduce reliance on external documentation and enhance portability. In the 2020s, blockchain technology is emerging for provenance tracking, offering immutable records of a file's history and authenticity to bolster preservation in decentralized archives.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.