Recent from talks
Nothing was collected or created yet.
Hierarchical Data Format
View on Wikipedia| Hierarchical Data Format | |
|---|---|
| Filename extension | .hdf, .h4, .hdf4, .he2, .h5, .hdf5, .he5 |
| Internet media type | application/x-hdf, application/x-hdf5 |
| Magic number | \211HDF\r\n\032\n |
| Developed by | The HDF Group |
| Type of format | Scientific data format |
| Open format? | Yes |
| Website | www |
Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.
In keeping with this goal, the HDF libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms and programming languages. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).[1]
The current version, HDF5, differs significantly in design and API from the major legacy version HDF4.
Early history
[edit]The quest for a portable scientific data format, originally dubbed AEHOO (All Encompassing Hierarchical Object Oriented format) began in 1987 by the Graphics Foundations Task Force (GFTF) at the National Center for Supercomputing Applications (NCSA). NSF grants received in 1990 and 1992 were important to the project. Around this time NASA investigated 15 different file formats for use in the Earth Observing System (EOS) project. After a two-year review process, HDF was selected as the standard data and information system.[2]
HDF4
[edit]HDF4 is the older version of the format, although still actively supported by The HDF Group. It supports a proliferation of different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an API for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
HDF is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."[3]
The HDF4 format has many limitations.[4][5] It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; SD (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.
HDF5
[edit]The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In 2002 it won an R&D 100 Award.[6]
HDF5 simplifies the file structure to include only two major types of object:

- Datasets, which are typed multidimensional arrays
- Groups, which are container structures that can hold datasets and other groups
This results in a truly hierarchical, filesystem-like data format.[clarification needed][citation needed] In fact, resources in an HDF5 file can be accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.
In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.
The latest version of NetCDF, version 4, is based on HDF5.
Because it uses B-trees to index table objects, HDF5 works well for time series data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an SQL database, but B-tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL star schema.[example needed]
Feedback
[edit]Criticism of HDF5 follows from its monolithic design and lengthy specification.
Officially supported APIs
[edit]- C
- C++
- CLI - .NET
- Fortran, Fortran 90
- HDF5 Lite (H5LT) – a light-weight interface for C
- HDF5 Image (H5IM) – a C interface for images or rasters
- HDF5 Table (H5TB) – a C interface for tables
- HDF5 Packet Table (H5PT) – interfaces for C and C++ to handle "packet" data, accessed at high-speeds
- HDF5 Dimension Scale (H5DS) – allows dimension scales to be added to HDF5
- Java
See also
[edit]- Common Data Format (CDF)
- FITS, a data format used in astronomy
- GRIB (GRIdded Binary), a data format used in meteorology
- HDF Explorer
- NetCDF, The Netcdf Java library reads HDF5, HDF4, HDF-EOS and other formats using pure Java
- Protocol Buffers - Google's data interchange format
- Zarr, a data format with similarities to HDF5
References
[edit]- ^ "Java-based HDF Viewer (HDFView)". Archived from the original on 2016-08-11. Retrieved 2014-07-25.
- ^ "History of HDF Group". Archived from the original on 21 August 2016. Retrieved 15 July 2014.
- ^ This article is based on material taken from Hierarchical+Data+Format at the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.
- ^ How is HDF5 different from HDF4? Archived 2009-03-30 at the Wayback Machine
- ^ "Are there limitations to HDF4 files?". Archived from the original on 2016-04-19. Retrieved 2009-03-29.
- ^ R&D 100 Awards Archives Archived 2011-01-04 at the Wayback Machine
- ^ Rossant, Cyrille. "Moving away from HDF5". cyrille.rossant.net. Retrieved 21 April 2016.
External links
[edit]Hierarchical Data Format
View on GrokipediaOverview
Definition and Purpose
The Hierarchical Data Format (HDF) is a set of file formats, primarily HDF4 and HDF5, designed for storing and organizing large, heterogeneous scientific and engineering datasets in a hierarchical structure analogous to a file system.[2] This format supports the efficient management of diverse data types, including multidimensional arrays, images, tables, and associated metadata, facilitating complex data hierarchies within a single file.[1] The primary purpose of HDF is to enable efficient storage, access, and sharing of such datasets across distributed computing environments, with built-in self-description through metadata that ensures portability between different platforms, hardware, and applications without loss of information.[3] By incorporating attributes and extensible schemas, HDF files can describe their own contents, allowing users to discover data structure and semantics independently of the originating software.[8] At its core, HDF adheres to principles of platform independence, extensibility, and optimization for high-performance input/output (I/O) operations in data-intensive scientific computing workflows.[9] These features make it suitable for handling terabyte-scale datasets in fields like astrophysics, climate modeling, and bioinformatics, where rapid data access and interoperability are critical.[10] HDF was originally developed in the late 1980s at the National Center for Supercomputing Applications (NCSA) to address the growing needs of scientific visualization and data sharing.[11] It is now developed and maintained by The HDF Group, a non-profit corporation dedicated to ensuring open access to these technologies under a BSD-like open-source license.[12][13]Key Versions and Evolution
The Hierarchical Data Format (HDF) primarily consists of two major versions: HDF4 and HDF5. HDF4, developed at the National Center for Supercomputing Applications (NCSA) starting in 1987 and selected by NASA as the standard for Earth Observing System (EOS) data products in 1993, is a 32-bit format supporting Scientific Data Sets (SDS) for multidimensional numerical arrays, General Raster Images (GR) for 2D imagery, and annotations for file and object metadata.[14][15][16] HDF5, first released in November 1998, marked a significant evolution with 64-bit addressing to accommodate larger files, hierarchical groups for organizing datasets like directories, and extensible datasets allowing dynamic growth in size and dimensions.[17][1][18] The transition to HDF5 was motivated by HDF4's scalability limitations for handling petabyte-scale datasets anticipated in NASA's EOS mission, necessitating improvements in file size limits, parallel I/O support, and data organization.[14][19] In recognition of these innovations, HDF5 received the R&D 100 Award in 2002 for advancing data management technologies.[19][20] HDF5 maintains backward compatibility with HDF4 through dedicated libraries and conversion tools that enable reading and migration of HDF4 files, though HDF4 lacks native support for HDF5 structures.[21][18] As of 2025, the HDF Group provides ongoing maintenance for both formats, positioning HDF5 as the actively developed standard while treating HDF4 as legacy with limited updates.[22][3]History
Early Development
The Hierarchical Data Format (HDF) originated in 1987 at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, where the Graphics Foundations Task Force initiated development to address the need for a portable, self-describing file format capable of handling large, complex scientific datasets in high-performance computing environments.[12] Led by developers including Mike Folk, the project sought to overcome limitations of existing flat file formats by enabling hierarchical organization and metadata integration for multidimensional data storage, particularly in supercomputing applications like plasma physics simulations.[23][24] Early efforts evolved from the AEHOO (All Encompassing Hierarchical Object Oriented format) prototype, a conceptual framework focused on architecture-independent libraries for array-based data in distributed scientific workflows.[25] By 1988, this had matured into a basic HDF prototype supporting raster images and general scientific data, emphasizing extensibility and cross-platform compatibility to facilitate data sharing among researchers.[26] A key milestone came in 1993 when NASA's Earth Observing System (EOS) evaluated HDF among more than a dozen candidate formats, selecting it for its hierarchical structure, self-describing capabilities, and suitability for managing extensible, high-volume scientific datasets in Earth science applications.[27][28] This endorsement highlighted HDF's potential to support evolving data needs in high-performance computing beyond initial prototypes.HDF4 Introduction and Adoption
The Hierarchical Data Format version 4 (HDF4) emerged as a key standard in the early 1990s, with NASA selecting it in 1993 as the official file format for all data products generated by the Earth Observing System (EOS) project after evaluating over a dozen alternatives. This choice was driven by HDF4's design for efficient storage and access of complex scientific datasets, including its emphasis on portability across diverse computing environments such as UNIX workstations, VMS systems, and other platforms prevalent at the time. Developed at the National Center for Supercomputing Applications (NCSA), HDF4 built on earlier prototypes to provide a robust, self-describing format suitable for large-scale data sharing in scientific computing.[15][28][29] Adoption of HDF4 accelerated through its integration into NASA's EOS infrastructure, particularly for handling satellite-derived data from instruments like the Moderate Resolution Imaging Spectroradiometer (MODIS) on the Terra and Aqua satellites, which produced terabytes of imagery and geophysical measurements stored in HDF4 files. The format's flexibility in supporting multidimensional arrays, raster images, and tabular data made it ideal for applications in climate modeling, where it facilitated the organization of gridded atmospheric and oceanographic datasets, as well as in astronomy for archiving spectral and imaging observations. By enabling seamless data exchange between heterogeneous systems, HDF4 addressed critical needs in interdisciplinary research, fostering its uptake in government and academic projects focused on environmental monitoring and simulation.[30][31][32] In the mid-1990s, HDF4 underwent key enhancements that bolstered its appeal, including built-in support for annotations to add descriptive metadata and color palettes for rendering scientific visualizations, which were essential for tools handling image-based datasets. NCSA's active distribution efforts, including free software libraries and utilities, drove community expansion, resulting in widespread use across over 100 specialized applications by 1998 for data analysis, conversion, and visualization in fields like remote sensing and computational science. This growth was supported by ongoing maintenance at NCSA, which ensured compatibility and reliability until 2006, when responsibilities transitioned to The HDF Group, a non-profit spin-off dedicated to sustaining HDF technologies.[33][34][11]Transition to HDF5
As scientific datasets grew rapidly in the 1990s, particularly with projections for terabyte-scale data from NASA's Earth Observing System (EOS), the limitations of HDF4 became increasingly apparent, including its 2 GB file size cap and restrictions on the number of objects (around 20,000 per file), which hindered handling of complex, large-scale data.[16][35] Additionally, HDF4's partial support for hierarchical organization and lack of native parallel I/O and advanced compression features for unlimited dimensions proved inadequate for emerging high-performance computing needs in fields like climate modeling and simulations.[35][24] Development of HDF5 began in 1996 at the National Center for Supercomputing Applications (NCSA), under the auspices of what would become The HDF Group, with initial funding from the U.S. Department of Energy's (DOE) Advanced Simulation and Computing (ASC) program to address scalable data management for supercomputing applications at labs like Lawrence Livermore, Los Alamos, and Sandia.[25] The project incorporated input from key stakeholders, including NASA for EOS data requirements and DOE laboratories for parallel processing demands, evolving from an initial "Big HDF" concept to a comprehensive redesign.[25] A beta version was released in 1998, followed by the full 1.0 release later that year, marking a significant shift toward supporting unlimited file sizes, true hierarchical structures, and enhanced portability across platforms.[1][25] The initial rollout emphasized tools for backward compatibility to ease migration from HDF4, such as conversion utilities and APIs designed to read legacy files where possible, particularly through extensions like HDF-EOS for NASA applications.[34] By 2000, HDF5 saw adoption in NASA's Earth Science Data Systems, including EOS projects, where it facilitated handling of growing satellite data volumes while maintaining interoperability with existing HDF4-based workflows.[34] Key milestones included the 2002 R&D 100 Award, recognizing HDF5's innovations in high-performance data storage, shared by NCSA and three DOE labs.[36] Subsequent iterations have continued to incorporate user feedback on usability, with ongoing releases addressing performance and integration challenges.[24]HDF4
Data Model
The Hierarchical Data Format version 4 (HDF4) employs a data model centered on a collection of core object types that facilitate the storage and organization of scientific and graphical data in a self-describing manner. This model emphasizes modularity, allowing diverse data forms to coexist within a single file while maintaining accessibility through standardized interfaces. Introduced as part of HDF4's development in the mid-1990s, the model builds on earlier HDF concepts to support multidimensional numerical arrays, images, tabular structures, and associated metadata, enabling efficient data sharing in scientific computing environments.[29] At the heart of the model are Scientific Data Sets (SDS), which represent multidimensional arrays optimized for numerical scientific data. Each SDS can encompass up to four dimensions, with extents defining the size along each axis, and supports a range of atomic data types including 32-bit and 64-bit IEEE floating-point numbers, signed and unsigned 8-bit, 16-bit, and 32-bit integers, and 8-bit characters.[37] General Raster Images (GR) provide dedicated storage for two-dimensional pixel data, accommodating multi-component pixels (e.g., RGB values) and associated elements like aspect ratios and color lookups, making them suitable for interleaved or interlace image representations.[37] Complementing these, Vdata objects function as extensible tables, storing sequences of fixed-length records with heterogeneous fields that can include integers, floating-point values, and character strings, akin to simple relational storage without enforced schemas.[18] Finally, annotations attach unstructured text metadata—such as descriptions, labels, or file-level notes—to any object or the entire file, enhancing interpretability without altering the primary data structure.[37] The overall hierarchy in HDF4 is fundamentally flat, with all core objects positioned at the top level and interconnected via tag/reference pairs rather than nested containers. Tags, which are 16-bit unsigned integers serving as type identifiers (e.g., ranging from predefined values like 702 for SDS), uniquely classify objects, while references (also 16-bit unsigned integers) provide instance-specific handles for locating and linking them within the file.[37] This scheme supports limited grouping through optional Vgroup objects (tag 1965), which bundle related items but do not impose deep nesting, preserving a streamlined, non-hierarchical topology. The model accommodates up to four dimensions for SDS and two for GR, with data types restricted to the enumerated atomic varieties to ensure portability across platforms.[18] Inter-object relationships are established through reference linkages, enabling dependencies such as a GR image referencing a separate palette object (tag 204) for color interpretation or an annotation tying descriptive text to an SDS.[37] Self-describing properties are integral, as each object's data descriptor block includes a header detailing its tag (indicating type), rank (dimensionality), extents (array shapes), and native data type, allowing applications to dynamically parse and interpret content without external schemas.[37] A representative example is an SDS object, identified by the SDS creation tag (702), storing a 100×200×50 three-dimensional array of 32-bit floats. This SDS can include attributes for units (tag 705), such as "kelvin" for temperature measurements, and calibration parameters (tag 731), including scale factors and offsets to convert raw values to physical units, thereby embedding essential context directly within the data structure.[37]File Structure and Features
The HDF4 file format organizes data on disk through a series of tagged blocks, where each block is identified by a unique tag-reference pair and contains either metadata or raw data elements. An optional user block precedes the main file content, allowing for user-defined information such as file descriptors, followed immediately by the file signature consisting of the magic bytes 0x0E 0x03 13 0x01, which uniquely identifies the file as an HDF4 format.[38] After the signature, the file includes a free list implemented as linked data descriptor (DD) blocks to manage unused or freed space, and the central object directory, also composed of linked DD blocks, which enumerates all tag/reference pairs along with their offsets and lengths to facilitate navigation to data objects throughout the file.[38] Data storage in HDF4 emphasizes flexibility for scientific arrays, such as Scientific Data Sets (SDS), which can be allocated contiguously as a single continuous byte block for straightforward access or in a chunked layout using fixed-size chunks stored non-contiguously via chunk tables, enabling efficient partial reads and dynamic appending.[38] Compression is integrated through extensible tags, supporting methods like Run-Length Encoding (RLE) via DFTAG_RLE, adaptive Skipping-Huffman, NBIT, GNU ZIP (Deflate), JPEG for images (DFTAG_JPEG or DFTAG_GREYJPEG), and IMCOMP (DFTAG_IMC), with compressed data prefixed by special headers to indicate the scheme.[38] Addressing employs 32-bit offsets and lengths for all elements, constraining individual file sizes to a maximum of 2 GB and references to 32-bit unique values per tag.[38] Key features include support for compound datatypes, which permit structured records akin to C structs, implemented in Vdatas (DFTAG_VS) for tabular data with multiple fields or in SDS for multidimensional arrays with heterogeneous elements.[38] The format's extensibility arises from its tag system, with 16-bit tags divided into NCSA-reserved (1–32767), user-definable (32768–64999), and expansion-reserved (65000–65535) ranges, plus extended tags (e.g., offset +0x4000) for advanced capabilities like linking blocks (EXT_LINKED), external file references (EXT_EXTERN), chunking (DFTAG_CHUNK), and compression (SPECIAL_COMP).[38] Operational features center on robust I/O via the low-level H-interface, with functions like Hread and Hwrite supporting seek-based sequential access optimized for high-throughput environments such as supercomputers, including linked blocks for unbounded appending and external storage via HXcreate, though without built-in parallelism.[38] Inspection tools, such as h4dump, allow users to dump and examine file contents, including tag structures and data hierarchies, aiding debugging and verification.[39]Limitations and Legacy
One of the primary technical constraints of HDF4 is its use of 32-bit offsets and lengths in data descriptors, which restricts individual file sizes to approximately 2 GB.[37] This limitation arises from the format's reliance on signed 32-bit integers for byte positions and data extents, preventing seamless handling of larger datasets without external workarounds like linked blocks.[37] Additionally, HDF4 imposes a cap of around 20,000 objects per file, further constraining its capacity for complex data collections.[18] HDF4 lacks a true hierarchical structure akin to modern formats, instead depending on tag/reference pairs and Vgroups to simulate organization.[18] Tags, which are 16-bit unsigned integers ranging from 1 to 65,535, identify object types but offer limited extensibility due to their fixed namespace, with only a subset available for user-defined purposes.[37] This tagged approach results in a relatively flat data model, providing poor support for very large datasets or parallel access, as the format does not natively accommodate distributed I/O operations.[18] Performance suffers in contemporary scenarios involving multi-terabyte data, exacerbated by rigid object models and inefficient I/O for extensive reads or writes.[40] Furthermore, HDF4's attribute system is object-specific and lacks the uniform, extensible support found in successors, limiting metadata flexibility across all elements.[18] Despite these shortcomings, HDF4 retains a significant legacy role in scientific workflows, particularly within NASA's Earth Observing System (EOS) missions such as Terra and Aqua, where it underpins HDF-EOS tools for storing telemetry and derived products.[41] As of June 2025, the latest release is version 4.3.1, and the format remains maintained by The HDF Group for backward compatibility, ensuring ongoing support for legacy applications and data archives.[42][43] Conversion utilities like h4toh5 enable migration of HDF4 files to HDF5, preserving data integrity while addressing size and extensibility issues.[44] HDF4's design directly influenced HDF5's architecture, with the latter incorporating lessons from HDF4's tagged model to introduce robust hierarchies and scalability.[40] Community tools such as HDFView continue to provide unified browsing and editing capabilities for both HDF4 and HDF5 files, facilitating gradual transitions in legacy environments.[45]HDF5
Architectural Enhancements
HDF5 introduces significant architectural improvements over its predecessor, HDF4, to enhance scalability and flexibility for handling large, complex datasets in scientific computing. These enhancements address key limitations in file size and access patterns, enabling support for exabyte-scale data management while maintaining a self-describing format.[1] A primary advancement is the adoption of 64-bit addressing, which allows HDF5 files to exceed 2^64 bytes in size, far surpassing HDF4's 2 GB restriction and accommodating modern big data requirements. Additionally, HDF5 integrates parallel I/O capabilities through the Message Passing Interface (MPI), permitting multiple processes to access and modify a shared file concurrently for improved performance in distributed computing environments.[46][47] The hierarchical structure is refined with groups functioning as container objects analogous to directories, organizing datasets—which are multidimensional arrays of typed elements—and named datatypes. Hard and soft links provide flexible referencing to these objects, enabling efficient navigation and reuse within the file hierarchy without duplicating data.[1][8] Extensibility is bolstered by allowing an unlimited number of attributes—small datasets attached to objects for metadata—directly in object headers, unlike the fixed limits in earlier formats. Pluggable I/O filters support on-the-fly data transformation, including compression algorithms such as GZIP and Szip, as well as encryption options to secure sensitive information during storage and transfer.[48] Further innovations include virtual datasets, which compose multiple source datasets across files into a unified view for seamless querying and analysis without physical reorganization. Extensible arrays, implemented via chunked storage and unlimited dimensions, facilitate dynamic growth along specified axes, making HDF5 suitable for time-series data that accumulates over time. POSIX-like semantics underpin access control, leveraging hierarchical paths for object addressing and inheriting operating system permissions for secure, familiar file handling.[49][50][46]Data Model and Organization
The HDF5 data model organizes data into a hierarchical structure resembling a file system, where the entire file serves as a rooted, directed graph beginning with the root group denoted by "/". This model supports complex relationships through various objects, enabling efficient logical navigation and data management without direct ties to physical storage layouts.[8] Groups act as hierarchical containers within the HDF5 file, analogous to directories, that hold datasets, other groups, and committed datatypes via links. The root group "/" forms the top-level container, allowing unlimited nesting to create a tree-like structure. Groups support hard links, which directly reference objects within the same file, and soft links, which are symbolic pointers that may point to non-existent or external objects. Additionally, external links enable connections to objects in other HDF5 files, effectively mounting external files as subgroups to integrate data across multiple files.[51][8] Datasets represent the core data storage units in HDF5, consisting of n-dimensional arrays that can have fixed or extendable (variable) shapes, supporting up to 32 dimensions. Each dataset is defined by a datatype specifying the element type and a dataspace describing the array's dimensions and layout. Datatypes include atomic types such as integers and floating-point numbers; compound types that aggregate multiple fields like structs; enumeration types for named values; opaque types for raw binary data; reference types for pointers to other objects; string types, either fixed-length or variable-length; and variable-length types for sequences of arbitrary size. Partial access to datasets is facilitated through hyperslabs, which select contiguous or patterned subsets of the array via selections in the dataspace, allowing efficient reading or writing of specific regions without loading the entire dataset. For example, a dataset might use the atomic datatype H5T_FLOAT with a fixed shape of [100, 200] to store a two-dimensional matrix of single-precision floating-point values.[50][52][53] Attributes provide key-value metadata attached to groups or datasets, serving as small, named datasets without size restrictions, unlike earlier formats. Each attribute has its own name, datatype, and dataspace, enabling the storage of descriptive information such as units (e.g., "m/s" for velocity) or provenance details directly on the object. This metadata enhances discoverability and interoperability by allowing users to annotate primary data objects flexibly.[54] Committed datatypes extend the model by allowing datatype definitions to be named and stored as independent objects within groups, promoting reuse across multiple datasets and ensuring consistency in data representation. The overall organization treats the HDF5 file as a directed graph, where groups serve as nodes, links as directed edges, and datasets/attributes as leaf or annotated elements, navigable via path names like "/group1/dataset1". This graph structure supports shared components and complex topologies beyond simple hierarchies.[52][8]Performance Features and Extensions
The HDF5 file format employs a superblock as its versioned header, which contains essential metadata including free lists for managing unused space and references to the global heap for storing shared strings and other persistent objects. This structure facilitates efficient navigation and resource allocation within the file. Groups in HDF5 are indexed using B-trees, which enable scalable organization and quick lookups for hierarchical elements, supporting large-scale datasets without performance degradation. Chunked storage divides datasets into fixed-size blocks, allowing partial I/O operations and integration with compression filters to optimize access patterns for multidimensional data.[55] To enhance compression efficiency, particularly for binary data like images or simulations, HDF5 supports external filters such as bitshuffle, which rearranges bits within chunks before applying standard compressors like LZ4 or Zstd, achieving up to 2-4 times faster compression and decompression compared to traditional shuffling methods. Parallel HDF5 (PHDF5) extends these capabilities for distributed environments by leveraging MPI for collective I/O operations, enabling multiple processes to read and write to the same file concurrently while maintaining data consistency through MPI-IO drivers. Caching strategies in HDF5 include a raw data chunk cache, which holds recently accessed dataset portions in memory to reduce disk I/O, and a metadata cache that buffers file headers, B-tree nodes, and other structural elements, with configurable sizes and replacement policies like LRU to adapt to application workloads.[56][57][58] Introduced in HDF5 version 1.10 in 2015, single-writer multiple-reader (SWMR) mode allows one process to append data to datasets while multiple readers access the file simultaneously without blocking, using relaxed consistency semantics to minimize synchronization overhead and support real-time data streaming applications. This feature includes support for SWMR datasets, where append-only writes ensure readers see monotonically increasing data without corruption. The HDF5 High-Level (HL) library provides simplified interfaces built atop the core library, offering domain-specific functions for images (H5IM), tables (H5TB), and packet tables (H5PT) to streamline common operations and reduce boilerplate code for performance-critical tasks.[59][59] Utility tools complement these features for file optimization and inspection; h5dump outputs the contents of an HDF5 file in a human-readable format, including datasets and attributes, while h5repack copies files with modified layouts, such as applying new filters or converting storage types, to improve compression ratios or access speeds post-creation. As of November 2025, the current HDF5 version is 2.0.0, which introduces support for dataset chunks larger than 4 GiB using 64-bit addressing and increases the default chunk cache hash table size to 8191 for better performance with large datasets. It also includes optimizations for virtual datasets, such as delayed layout copying that improves opening times by approximately 30% for datasets with many mappings. Pluggable filters enable compression with libraries like Blosc and ZFP for high-throughput handling of various data types, including in AI/ML workflows.[60][60][61][62]Programming Interfaces
Core APIs
The core APIs of the Hierarchical Data Format (HDF) provide low-level C interfaces for file manipulation, data creation, and I/O operations, forming the foundation for accessing HDF files in both HDF4 and HDF5 versions. These APIs are designed for portability across platforms and emphasize efficient handling of multidimensional scientific data. In HDF4, the APIs are organized into modular libraries for core operations, scientific datasets (SDS), general rasters (GR), and vdata structures, while HDF5 introduces a more unified, object-oriented approach with dedicated modules for files, groups, datasets, attributes, and datatypes. While the core APIs remain stable, HDF5 2.0.0 (released November 2025) includes some API signature changes for specific functions such as H5Dread_chunk, H5Tdecode, and H5Iregister_type, along with new features like an AWS S3 backend; see the release notes for migration details.[5][63][64] HDF4's core APIs are divided into four primary libraries: the H library for basic file management, the DF library for SDS, the DG library for raster images, and the DH library for vdata. The H library includes functions likeHopen, which opens or creates an HDF file and returns a file identifier, and Hclose to terminate access, with access modes such as DFACC_READ, DFACC_WRITE, or DFACC_CREATE.[5] For SDS operations in the DF library, SDstart initializes the interface on a file, SDcreate establishes a dataset with specified datatype, rank, and dimensions, and I/O functions like SDread or SDwrite (often via SDreaddata and SDwritedata) handle data transfer using parameters for start position, stride, and edges to enable partial reads or writes. The DG library supports raster data through GRcreate to define an image with components, interlace mode, and dimensions, paired with GRreadimage and GRwriteimage for I/O, including palette management via GRgetlutid and GRwritelut. In the DH library, vdata handling uses VScreate to build tabular structures with fields, VSattach and VSdetach for access control, and VSread or VSwrite for record-based I/O in interlaced or non-interlaced modes.[5]
HDF5's core APIs are structured around the H5 module and specialized submodules, including H5F for files, H5G for groups, H5D for datasets, H5A for attributes, and H5T for datatypes, offering a more hierarchical and extensible interface. Key functions include H5Fcreate to generate a new file or open an existing one, returning a file identifier; H5Gcreate to form groups within the file; and H5Dcreate to define datasets with associated datatypes and dataspace dimensions. Data I/O relies on H5Dwrite and H5Dread, which support hyperslab selections—rectangular subsets defined via start, stride, count, and block parameters in the H5S dataspace module—for efficient partial access to large arrays without loading entire datasets into memory. Attribute operations use H5Acreate and H5Awrite to attach metadata to objects, while H5T functions like H5Tcreate allow custom datatype definitions, including compound and variable-length types.[63]
Common patterns across both versions include identifier-based access, where functions return opaque handles (e.g., int32 for HDF4 file IDs or hid_t in HDF5) to reference objects without exposing internal structures, and error handling through return codes like SUCCEED/FAIL in HDF4 or herr_t in HDF5, which indicates success (0) or failure (negative values) and integrates with an error stack for diagnostics via functions like H5Eprint. Customization often involves property lists in HDF5, such as H5P_DATASET_CREATE for chunking, compression, or fill values during dataset creation, or H5P_FILE_ACCESS for I/O drivers and caching; HDF4 uses simpler flags or attributes for similar purposes.[65][66][5]
Version differences highlight HDF5's evolution toward a more object-oriented design, with explicit support for complex datatypes via H5T and hierarchical organization mirroring the file's data model of groups and datasets, contrasting HDF4's flatter, library-specific structure. Additionally, the Parallel HDF5 (PHDF5) extension builds on the core APIs to enable asynchronous I/O in MPI environments, allowing non-blocking operations like H5Dwrite in collective or independent modes for high-performance computing applications.[10][63]
Language Bindings and Tools
The Hierarchical Data Format (HDF) provides language bindings primarily through the HDF5 library, with official support for C, C++, Fortran 90 (extended to Fortran 2003 via ISO_C_BINDING), and Java using the Java Native Interface (JNI).[3] These bindings enable developers to access HDF5's core functionality, such as dataset creation and file I/O, while abstracting low-level C details where appropriate; for instance, the C++ wrapper offers STL-like containers for datasets and groups. Third-party bindings extend HDF5 to other languages, including Python via h5py, which provides a NumPy-compatible interface for efficient array storage and retrieval.[67] Similarly, .NET support is available through ILNumerics, offering high-level object-oriented access to HDF5 files integrated with .NET's numerical computing features.[68] HDF4 bindings are more limited, focusing on C and Fortran interfaces with basic support for scientific I/O operations, while Java access relies on community wrappers rather than official JNI implementations. In contrast to HDF5's broader ecosystem, HDF4's language support has not evolved significantly, reflecting its legacy status. HDF5 includes specialized high-level interfaces to simplify common tasks beyond the core C API. The H5LT (Lite) interface provides straightforward functions for basic read/write operations on datasets and attributes, reducing boilerplate code for simple applications.[69] The H5IM interface handles image data, supporting formats like JPEG and PNG within HDF5 datasets for multidimensional raster storage. For time-series data, the H5PT (Packet Table) interface optimizes appendable tables with fixed-size records, suitable for streaming sensor outputs. Additionally, H5LD within the Lite API manages external links, allowing references to datasets in separate files for modular data organization. Supporting tools enhance HDF5 file manipulation and development. HDFView serves as a graphical user interface for browsing, editing, and visualizing HDF5 files, including dataset inspection and metadata editing across platforms. The h5cc utility acts as a compiler wrapper, automating the linking of HDF5 libraries for C programs to streamline builds. For file management, h5repart repartitions HDF5 files or families, enabling conversion between single files and distributed sets for parallel I/O.[70] Tools like h5jam and h5unjam embed or extract user blocks (such as metadata files) from HDF5 headers, facilitating custom annotations. Conversion utilities, including h4toh5, transform HDF4 files to HDF5 format while preserving data model compatibility, often aligning with netCDF-4 conventions.[44] HDF5 integrates deeply with scientific ecosystems, notably as the underlying format for netCDF-4, which adds climate and geoscience-specific abstractions atop HDF5's structure. As of 2025, adoption in data science grows through bindings like HDF5.jl for Julia, enabling high-performance array operations in scientific workflows, and hdf5r or rhdf5 for R, supporting statistical analysis of large datasets in Bioconductor environments.[71][72] These extensions, over 700 GitHub projects strong, underscore HDF5's role in cross-language data interoperability.[3]Applications
Scientific and Engineering Uses
In astronomy and Earth sciences, the Hierarchical Data Format (HDF), particularly through the HDF-EOS extension, plays a crucial role in managing satellite imagery and environmental data from NASA's Earth Observing System (EOS). For instance, data from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument on the Terra and Aqua satellites is stored in HDF-EOS format, enabling efficient handling of multispectral imagery with global coverage every 1-2 days across 36 spectral bands.[73][27] Similarly, Landsat missions utilize HDF5 for archiving high-resolution remote sensing data, such as the Level-0 Reduced (L0R) products from Landsat 8, which group observations from the Operational Land Imager (OLI) and Thermal Infrared Sensor (TIRS) into multiple HDF5 files per acquisition.[74][75] In climate modeling, HDF5 supports the storage of multidimensional grids and time series, facilitating analysis of large-scale environmental datasets like those derived from Aura satellite instruments, where time series extraction from multiple HDF-EOS files (HE5) allows researchers to track variables such as atmospheric composition over extended periods.[76] In physics and simulations, HDF5 is extensively employed at U.S. Department of Energy (DOE) laboratories for handling complex datasets from plasma physics and computational fluid dynamics (CFD). DOE-funded plasma physics simulations, such as those using the Vector Particle-in-Cell (VPIC) code, rely on HDF5 for scalable I/O in exascale environments, writing massive datasets like 291 TB files from trillion-particle runs on over 298,000 cores to achieve sustained throughputs of 52 GB/s.[77][78] In CFD applications, HDF5 was specifically designed in the late 1990s to support DOE labs' largest simulations, enabling the storage of multidimensional arrays for fluid flow analyses and serving as the foundation for standards like the CFD General Notation System (CGNS), which organizes structured and unstructured grids for turbulence modeling.[79][79] For engineering applications, HDF5 facilitates the management of medical imaging and genomics datasets, accommodating high-dimensional and heterogeneous biological data. In medical imaging, particularly microscopy, HDF5 unifies formats for electron microscopy and functional MRI, storing terabyte-scale images with embedded metadata for efficient querying and analysis, as seen in tools like MINC for modality-neutral storage across CT, PET, and MRI scanners.[80][81] In genomics, HDF5 serves as a container for sequencing data, including SNP matrices and sequence alignments, with projects like Genomedata using it to store and compress large functional genomics datasets for rapid access to nucleotide tallies and variant information.[82][83] A key benefit of HDF in these domains is its ability to handle heterogeneous data, such as combining spectral arrays with descriptive metadata in a single file, which supports seamless integration in workflows from acquisition to analysis.[3][84] Furthermore, HDF5's parallel I/O capabilities make it suitable for petascale simulations at supercomputing centers, where it manages outputs from DOE applications like VPIC, ensuring scalability on systems with hundreds of thousands of cores without performance bottlenecks.[85][86]Integration in Software Ecosystems
The Hierarchical Data Format (HDF), particularly HDF5, is deeply integrated into various libraries and frameworks that facilitate data handling in scientific computing and geospatial analysis. netCDF-4, a widely used library for multidimensional scientific data, is built directly on HDF5 as its underlying storage layer, enabling enhanced features like internal compression and unlimited dimensions while ensuring full interoperability with standard HDF5 tools.[87] Similarly, the Geospatial Data Abstraction Library (GDAL) provides robust support for reading and writing HDF5 files, including parsing of HDF-EOS5 metadata for grid and swath data, which is essential for geospatial workflows.[88] Commercial analysis environments like MATLAB offer native functions such ash5read and h5write for importing and exporting HDF5 datasets, supporting hierarchical structures and attributes without requiring external libraries.[89] In remote sensing applications, ENVI and IDL from NV5 Geospatial Software include built-in HDF5 readers that handle variable-length arrays and EOS extensions, streamlining data processing in Earth observation pipelines.[90]
HDF5 is embedded in several high-profile projects for data management and visualization in scientific domains. The ROOT framework, developed by CERN for high-energy physics analysis, can export datasets to HDF5 format using third-party tools like root2hdf5, allowing interoperability with broader ecosystems while leveraging ROOT's statistical tools on HDF5-stored particle collision data.[91] Open-source visualization software like ParaView utilizes HDF5 as a backend for rendering complex volumes and meshes, often via the eXtensible Data Model and Format (XDMF) which references HDF5 files for efficient loading of large-scale simulation outputs.[92]
HDF5 complies with key standards that promote its use in geospatial and Earth science contexts. It is recognized as an official Open Geospatial Consortium (OGC) standard, specifically the OGC Hierarchical Data Format Version 5 (HDF5) Core Standard, which defines its data model for encoding multidimensional arrays in spatially and temporally varying geospatial applications.[93] For NASA-specific needs, the HDF-EOS library extends HDF5 with conventions tailored to Earth Observing System (EOS) data, including geolocation metadata and swath/grid structures, ensuring compatibility across NASA's distributed archives.[94]
As of 2025, HDF5 plays a prominent role in modern software ecosystems, particularly in machine learning and cloud computing. In ML workflows, libraries like h5py enable seamless loading of HDF5 datasets into TensorFlow and PyTorch, supporting efficient handling of large training corpora with features like chunked access and compression, as highlighted in HDF Group's guidance for AI/ML applications.[95] For cloud environments, HDF5 includes adapters for object storage such as AWS S3, allowing virtual file access without full downloads through tools like the HDF5 S3 connector, which optimizes performance for distributed data processing.[96]