Hubbry Logo
Hierarchical Data FormatHierarchical Data FormatMain
Open search
Hierarchical Data Format
Community hub
Hierarchical Data Format
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Hierarchical Data Format
Hierarchical Data Format
from Wikipedia
Hierarchical Data Format
Filename extension.hdf, .h4, .hdf4, .he2, .h5, .hdf5, .he5
Internet media typeapplication/x-hdf, application/x-hdf5
Magic number\211HDF\r\n\032\n
Developed byThe HDF Group
Type of formatScientific data format
Open format?Yes
Websitewww.hdfgroup.org/solutions/hdf5 Edit this at Wikidata

Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

In keeping with this goal, the HDF libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms and programming languages. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).[1]

The current version, HDF5, differs significantly in design and API from the major legacy version HDF4.

Early history

[edit]

The quest for a portable scientific data format, originally dubbed AEHOO (All Encompassing Hierarchical Object Oriented format) began in 1987 by the Graphics Foundations Task Force (GFTF) at the National Center for Supercomputing Applications (NCSA). NSF grants received in 1990 and 1992 were important to the project. Around this time NASA investigated 15 different file formats for use in the Earth Observing System (EOS) project. After a two-year review process, HDF was selected as the standard data and information system.[2]

HDF4

[edit]

HDF4 is the older version of the format, although still actively supported by The HDF Group. It supports a proliferation of different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an API for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.

HDF is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."[3]

The HDF4 format has many limitations.[4][5] It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; SD (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.

HDF5

[edit]

The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In 2002 it won an R&D 100 Award.[6]

HDF5 simplifies the file structure to include only two major types of object:

HDF Structure Example
  • Datasets, which are typed multidimensional arrays
  • Groups, which are container structures that can hold datasets and other groups

This results in a truly hierarchical, filesystem-like data format.[clarification needed][citation needed] In fact, resources in an HDF5 file can be accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.

In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.

The latest version of NetCDF, version 4, is based on HDF5.

Because it uses B-trees to index table objects, HDF5 works well for time series data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an SQL database, but B-tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL star schema.[example needed]

Feedback

[edit]

Criticism of HDF5 follows from its monolithic design and lengthy specification.

  • HDF5 does not enforce the use of UTF-8, so client applications may be expecting ASCII in most places.
  • Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).[7]

Officially supported APIs

[edit]
  • C
  • C++
  • CLI - .NET
  • Fortran, Fortran 90
  • HDF5 Lite (H5LT) – a light-weight interface for C
  • HDF5 Image (H5IM) – a C interface for images or rasters
  • HDF5 Table (H5TB) – a C interface for tables
  • HDF5 Packet Table (H5PT) – interfaces for C and C++ to handle "packet" data, accessed at high-speeds
  • HDF5 Dimension Scale (H5DS) – allows dimension scales to be added to HDF5
  • Java

See also

[edit]
  • Common Data Format (CDF)
  • FITS, a data format used in astronomy
  • GRIB (GRIdded Binary), a data format used in meteorology
  • HDF Explorer
  • NetCDF, The Netcdf Java library reads HDF5, HDF4, HDF-EOS and other formats using pure Java
  • Protocol Buffers - Google's data interchange format
  • Zarr, a data format with similarities to HDF5

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Hierarchical Data Format (HDF) is a multi-object and associated software designed for the storage, , and transfer of large volumes of scientific and , including numerical arrays, images, and metadata, in a portable and self-describing hierarchical structure. Developed to address the needs of environments, HDF supports complex relationships through a model that organizes information into groups, datasets, and attributes, enabling efficient access and sharing across heterogeneous platforms without proprietary dependencies. Its two primary versions, HDF4 and HDF5, cater to evolving requirements, with HDF5 serving as the modern standard for handling massive, multidimensional datasets in fields like astronomy, , and climate modeling. Originating in the late 1980s, HDF was created at the (NCSA) at the University of Illinois at Urbana-Champaign to facilitate the exchange of graphical and numerical data among diverse machines in a distributed scientific environment. The initial HDF (version 4.x) was released in the 1990s, focusing on multi-object files for heterogeneous data elements such as raster images and scientific grids. In response to growing data complexity and performance demands, HDF5 was introduced as a completely new format in 1998, featuring an extensible and high-performance I/O capabilities to support contemporary supercomputing and applications. The HDF Group, a non-profit organization spun off from NCSA in 2006, now maintains and advances the technology to ensure long-term data accessibility. At its core, HDF comprises three interconnected elements: a binary file format that encodes data hierarchically, an abstract data model for logical organization, and a software library providing application programming interfaces (APIs) in languages like C, C++, Fortran, and Java. The data model uses groups as containers analogous to directories, datasets as multidimensional arrays of homogeneous or compound elements (supporting atomic types like integers and floats, as well as variable-length strings and arrays), and attributes for attaching metadata to any object. Key features include chunking and compression for efficient storage of extendible datasets, hyperslab selections for partial I/O operations, and parallel I/O support for high-performance computing environments, allowing HDF files to scale from laptops to supercomputers. This structure makes HDF particularly suited for self-describing files where all necessary metadata travels with the data, reducing errors in long-term archiving and interdisciplinary collaboration. HDF's versatility has made it a in numerous domains, powering tools in for , for data, and physics experiments at facilities like . Over 700 projects on integrate HDF5, underscoring its role in modern data pipelines, , and workflows involving heterogeneous datasets. Ongoing developments by The HDF Group emphasize interoperability with emerging technologies, such as and containerized environments, to sustain its relevance in an era of and .

Overview

Definition and Purpose

The Hierarchical Data Format (HDF) is a set of file formats, primarily HDF4 and HDF5, designed for storing and organizing large, heterogeneous scientific and engineering datasets in a hierarchical structure analogous to a file system. This format supports the efficient management of diverse data types, including multidimensional arrays, images, tables, and associated metadata, facilitating complex data hierarchies within a single file. The primary purpose of HDF is to enable efficient storage, access, and sharing of such datasets across environments, with built-in self-description through metadata that ensures portability between different platforms, hardware, and applications without loss of information. By incorporating attributes and extensible schemas, HDF files can describe their own contents, allowing users to discover and semantics independently of the originating software. At its core, HDF adheres to principles of platform independence, extensibility, and optimization for high-performance (I/O) operations in data-intensive scientific computing workflows. These features make it suitable for handling terabyte-scale datasets in fields like , climate modeling, and bioinformatics, where rapid data access and are critical. HDF was originally developed in the late 1980s at the (NCSA) to address the growing needs of scientific visualization and data sharing. It is now developed and maintained by The HDF Group, a non-profit corporation dedicated to ensuring to these technologies under a BSD-like .

Key Versions and Evolution

The Hierarchical Data Format (HDF) primarily consists of two major versions: HDF4 and HDF5. HDF4, developed at the (NCSA) starting in 1987 and selected by as the standard for (EOS) data products in 1993, is a 32-bit format supporting Scientific Data Sets (SDS) for multidimensional numerical arrays, General Raster Images (GR) for 2D imagery, and annotations for file and object metadata. HDF5, first released in November 1998, marked a significant evolution with 64-bit addressing to accommodate larger files, hierarchical groups for organizing datasets like directories, and extensible datasets allowing dynamic growth in size and dimensions. The transition to HDF5 was motivated by HDF4's scalability limitations for handling petabyte-scale datasets anticipated in NASA's EOS mission, necessitating improvements in file size limits, parallel I/O support, and data organization. In recognition of these innovations, HDF5 received the R&D 100 Award in 2002 for advancing data management technologies. HDF5 maintains with HDF4 through dedicated libraries and conversion tools that enable reading and migration of HDF4 files, though HDF4 lacks native support for HDF5 structures. As of 2025, the HDF Group provides ongoing maintenance for both formats, positioning HDF5 as the actively developed standard while treating HDF4 as legacy with limited updates.

History

Early Development

The Hierarchical Data Format (HDF) originated in 1987 at the (NCSA) at the University of Illinois at Urbana-Champaign, where the Graphics Foundations Task Force initiated development to address the need for a portable, self-describing capable of handling large, complex scientific datasets in environments. Led by developers including Mike Folk, the project sought to overcome limitations of existing flat file formats by enabling and metadata integration for multidimensional , particularly in supercomputing applications like plasma physics simulations. Early efforts evolved from the AEHOO (All Encompassing Hierarchical Object Oriented format) prototype, a focused on architecture-independent libraries for array-based in distributed scientific workflows. By , this had matured into a basic HDF prototype supporting raster images and general scientific , emphasizing extensibility and cross-platform compatibility to facilitate among researchers. A key milestone came in 1993 when NASA's (EOS) evaluated HDF among more than a dozen candidate formats, selecting it for its hierarchical structure, self-describing capabilities, and suitability for managing extensible, high-volume scientific datasets in applications. This endorsement highlighted HDF's potential to support evolving data needs in beyond initial prototypes.

HDF4 Introduction and Adoption

The Hierarchical Data Format version 4 (HDF4) emerged as a key standard in the early , with selecting it in 1993 as the official file format for all data products generated by the (EOS) project after evaluating over a dozen alternatives. This choice was driven by HDF4's design for efficient storage and access of complex scientific datasets, including its emphasis on portability across diverse computing environments such as UNIX workstations, VMS systems, and other platforms prevalent at the time. Developed at the (NCSA), HDF4 built on earlier prototypes to provide a robust, self-describing format suitable for large-scale in scientific computing. Adoption of HDF4 accelerated through its integration into NASA's EOS infrastructure, particularly for handling satellite-derived data from instruments like the (MODIS) on the Terra and Aqua satellites, which produced terabytes of imagery and geophysical measurements stored in HDF4 files. The format's flexibility in supporting multidimensional arrays, raster images, and tabular data made it ideal for applications in climate modeling, where it facilitated the organization of gridded atmospheric and oceanographic datasets, as well as in astronomy for archiving spectral and imaging observations. By enabling seamless data exchange between heterogeneous systems, HDF4 addressed critical needs in interdisciplinary , fostering its uptake in and academic projects focused on and simulation. In the mid-1990s, HDF4 underwent key enhancements that bolstered its appeal, including built-in support for annotations to add descriptive metadata and color palettes for rendering scientific visualizations, which were essential for tools handling image-based datasets. NCSA's active distribution efforts, including libraries and utilities, drove community expansion, resulting in widespread use across over 100 specialized applications by 1998 for data analysis, conversion, and visualization in fields like and . This growth was supported by ongoing maintenance at NCSA, which ensured compatibility and reliability until 2006, when responsibilities transitioned to The HDF Group, a non-profit spin-off dedicated to sustaining HDF technologies.

Transition to HDF5

As scientific datasets grew rapidly in the , particularly with projections for terabyte-scale data from NASA's (EOS), the limitations of HDF4 became increasingly apparent, including its 2 GB file size cap and restrictions on the number of objects (around 20,000 per file), which hindered handling of complex, large-scale data. Additionally, HDF4's partial support for and lack of native parallel I/O and advanced compression features for unlimited dimensions proved inadequate for emerging needs in fields like climate modeling and simulations. Development of HDF5 began in 1996 at the (NCSA), under the auspices of what would become The HDF Group, with initial funding from the U.S. Department of Energy's (DOE) Advanced Simulation and Computing (ASC) program to address scalable for supercomputing applications at labs like Lawrence Livermore, Los Alamos, and Sandia. The project incorporated input from key stakeholders, including for EOS data requirements and DOE laboratories for parallel processing demands, evolving from an initial "Big HDF" concept to a comprehensive redesign. A beta version was released in 1998, followed by the full 1.0 release later that year, marking a significant shift toward supporting unlimited file sizes, true hierarchical structures, and enhanced portability across platforms. The initial rollout emphasized tools for to ease migration from HDF4, such as conversion utilities and APIs designed to read legacy files where possible, particularly through extensions like HDF-EOS for applications. By 2000, HDF5 saw adoption in NASA's Data Systems, including EOS projects, where it facilitated handling of growing data volumes while maintaining with existing HDF4-based workflows. Key milestones included the 2002 R&D 100 Award, recognizing HDF5's innovations in high-performance data storage, shared by NCSA and three DOE labs. Subsequent iterations have continued to incorporate user feedback on , with ongoing releases addressing and integration challenges.

HDF4

Data Model

The Hierarchical Data Format version 4 (HDF4) employs a centered on a collection of core object types that facilitate the storage and organization of scientific and graphical data in a self-describing manner. This model emphasizes modularity, allowing diverse data forms to coexist within a single file while maintaining accessibility through standardized interfaces. Introduced as part of HDF4's development in the mid-1990s, the model builds on earlier HDF concepts to support multidimensional numerical arrays, images, tabular structures, and associated metadata, enabling efficient in scientific computing environments. At the heart of the model are Scientific Data Sets (SDS), which represent multidimensional arrays optimized for numerical scientific data. Each SDS can encompass up to four dimensions, with extents defining the size along each axis, and supports a range of atomic data types including 32-bit and 64-bit IEEE floating-point numbers, signed and unsigned 8-bit, 16-bit, and 32-bit integers, and 8-bit characters. General Raster Images (GR) provide dedicated storage for two-dimensional pixel data, accommodating multi-component pixels (e.g., RGB values) and associated elements like aspect ratios and color lookups, making them suitable for interleaved or image representations. Complementing these, Vdata objects function as extensible tables, storing sequences of fixed-length records with heterogeneous fields that can include integers, floating-point values, and character strings, akin to simple relational storage without enforced schemas. Finally, annotations attach unstructured text metadata—such as descriptions, labels, or file-level notes—to any object or the entire file, enhancing interpretability without altering the primary data structure. The overall hierarchy in HDF4 is fundamentally flat, with all core objects positioned at the top level and interconnected via tag/reference pairs rather than nested containers. Tags, which are 16-bit unsigned integers serving as type identifiers (e.g., ranging from predefined values like 702 for SDS), uniquely classify objects, while references (also 16-bit unsigned integers) provide instance-specific handles for locating and linking them within the file. This scheme supports limited grouping through optional Vgroup objects (tag 1965), which bundle related items but do not impose deep nesting, preserving a streamlined, non-hierarchical topology. The model accommodates up to four dimensions for SDS and two for GR, with data types restricted to the enumerated atomic varieties to ensure portability across platforms. Inter-object relationships are established through reference linkages, enabling dependencies such as a GR image referencing a separate palette object (tag 204) for color interpretation or an tying descriptive text to an SDS. Self-describing properties are integral, as each object's data descriptor block includes a header detailing its tag (indicating type), rank (dimensionality), extents (array shapes), and native , allowing applications to dynamically parse and interpret content without external schemas. A representative example is an SDS object, identified by the SDS creation tag (702), storing a 100×200×50 three-dimensional array of 32-bit floats. This SDS can include attributes for units (tag 705), such as "kelvin" for temperature measurements, and calibration parameters (tag 731), including scale factors and offsets to convert raw values to physical units, thereby embedding essential context directly within the data structure.

File Structure and Features

The HDF4 organizes on disk through a series of tagged blocks, where each block is identified by a unique tag- pair and contains either metadata or elements. An optional user block precedes the main file content, allowing for user-defined information such as file descriptors, followed immediately by the file signature consisting of the magic bytes 0x0E 0x03 13 0x01, which uniquely identifies the file as an HDF4 format. After the signature, the file includes a free list implemented as linked descriptor (DD) blocks to manage unused or freed space, and the central object directory, also composed of linked DD blocks, which enumerates all tag/ pairs along with their offsets and lengths to facilitate to objects throughout the file. Data storage in HDF4 emphasizes flexibility for scientific arrays, such as Scientific Data Sets (SDS), which can be allocated contiguously as a single continuous byte block for straightforward access or in a chunked layout using fixed-size chunks stored non-contiguously via chunk tables, enabling efficient partial reads and dynamic appending. Compression is integrated through extensible tags, supporting methods like Run-Length Encoding (RLE) via DFTAG_RLE, adaptive Skipping-Huffman, NBIT, GNU ZIP (Deflate), JPEG for images (DFTAG_JPEG or DFTAG_GREYJPEG), and IMCOMP (DFTAG_IMC), with compressed data prefixed by special headers to indicate the scheme. Addressing employs 32-bit offsets and lengths for all elements, constraining individual file sizes to a maximum of 2 GB and references to 32-bit unique values per tag. Key features include support for compound datatypes, which permit structured records akin to C structs, implemented in Vdatas (DFTAG_VS) for tabular data with multiple fields or in SDS for multidimensional arrays with heterogeneous elements. The format's extensibility arises from its tag system, with 16-bit tags divided into NCSA-reserved (1–32767), user-definable (32768–64999), and expansion-reserved (65000–65535) ranges, plus extended tags (e.g., offset +0x4000) for advanced capabilities like linking blocks (EXT_LINKED), external file references (EXT_EXTERN), chunking (DFTAG_CHUNK), and compression (SPECIAL_COMP). Operational features center on robust I/O via the low-level H-interface, with functions like Hread and Hwrite supporting seek-based optimized for high-throughput environments such as supercomputers, including linked blocks for unbounded appending and via HXcreate, though without built-in parallelism. tools, such as h4dump, allow users to dump and examine file contents, including tag structures and hierarchies, aiding and verification.

Limitations and Legacy

One of the primary technical constraints of HDF4 is its use of 32-bit offsets and lengths in data descriptors, which restricts individual file sizes to approximately 2 GB. This limitation arises from the format's reliance on signed 32-bit integers for byte positions and data extents, preventing seamless handling of larger datasets without external workarounds like linked blocks. Additionally, HDF4 imposes a cap of around 20,000 objects per file, further constraining its capacity for complex data collections. HDF4 lacks a true hierarchical structure akin to modern formats, instead depending on tag/reference pairs and Vgroups to simulate organization. Tags, which are 16-bit unsigned integers ranging from 1 to 65,535, identify object types but offer limited extensibility due to their fixed namespace, with only a subset available for user-defined purposes. This tagged approach results in a relatively flat data model, providing poor support for very large datasets or parallel access, as the format does not natively accommodate distributed I/O operations. Performance suffers in contemporary scenarios involving multi-terabyte data, exacerbated by rigid object models and inefficient I/O for extensive reads or writes. Furthermore, HDF4's attribute system is object-specific and lacks the uniform, extensible support found in successors, limiting metadata flexibility across all elements. Despite these shortcomings, HDF4 retains a significant legacy role in scientific workflows, particularly within NASA's (EOS) missions such as Terra and Aqua, where it underpins HDF-EOS tools for storing telemetry and derived products. As of June 2025, the latest release is version 4.3.1, and the format remains maintained by The HDF Group for backward compatibility, ensuring ongoing support for legacy applications and data archives. Conversion utilities like h4toh5 enable migration of HDF4 files to HDF5, preserving while addressing size and extensibility issues. HDF4's design directly influenced HDF5's architecture, with the latter incorporating lessons from HDF4's tagged model to introduce robust hierarchies and scalability. Community tools such as HDFView continue to provide unified browsing and editing capabilities for both HDF4 and HDF5 files, facilitating gradual transitions in legacy environments.

HDF5

Architectural Enhancements

HDF5 introduces significant architectural improvements over its predecessor, HDF4, to enhance scalability and flexibility for handling large, complex datasets in scientific computing. These enhancements address key limitations in and access patterns, enabling support for exabyte-scale while maintaining a self-describing format. A primary advancement is the adoption of 64-bit addressing, which allows HDF5 files to exceed 2^64 bytes in size, far surpassing HDF4's 2 GB restriction and accommodating modern requirements. Additionally, HDF5 integrates parallel I/O capabilities through the (MPI), permitting multiple processes to access and modify a shared file concurrently for improved performance in environments. The hierarchical structure is refined with groups functioning as container objects analogous to directories, organizing datasets—which are multidimensional arrays of typed elements—and named datatypes. Hard and soft links provide flexible referencing to these objects, enabling efficient navigation and reuse within the file hierarchy without duplicating data. Extensibility is bolstered by allowing an unlimited number of attributes—small datasets attached to objects for metadata—directly in object headers, unlike the fixed limits in earlier formats. Pluggable I/O filters support on-the-fly data transformation, including compression algorithms such as GZIP and Szip, as well as encryption options to secure sensitive information during storage and transfer. Further innovations include virtual datasets, which compose multiple source datasets across files into a unified view for seamless querying and analysis without physical reorganization. Extensible arrays, implemented via chunked storage and unlimited dimensions, facilitate dynamic growth along specified axes, making HDF5 suitable for time-series data that accumulates over time. POSIX-like semantics underpin access control, leveraging hierarchical paths for object addressing and inheriting operating system permissions for secure, familiar file handling.

Data Model and Organization

The HDF5 data model organizes data into a hierarchical structure resembling a , where the entire file serves as a rooted, beginning with the root group denoted by "/". This model supports complex relationships through various objects, enabling efficient logical navigation and without direct ties to physical storage layouts. Groups act as hierarchical containers within the HDF5 file, analogous to directories, that hold datasets, other groups, and committed datatypes via . The root group "/" forms the top-level container, allowing unlimited nesting to create a tree-like structure. Groups support hard links, which directly reference objects within the same file, and soft links, which are symbolic pointers that may point to non-existent or external objects. Additionally, external links enable connections to objects in other HDF5 files, effectively mounting external files as subgroups to integrate data across multiple files. Datasets represent the core data storage units in HDF5, consisting of n-dimensional arrays that can have fixed or extendable (variable) shapes, supporting up to 32 dimensions. Each dataset is defined by a datatype specifying the element type and a dataspace describing the array's dimensions and layout. Datatypes include atomic types such as integers and floating-point numbers; compound types that aggregate multiple fields like structs; enumeration types for named values; opaque types for raw ; reference types for pointers to other objects; types, either fixed-length or variable-length; and variable-length types for sequences of arbitrary size. Partial access to datasets is facilitated through hyperslabs, which select contiguous or patterned subsets of the array via selections in the dataspace, allowing efficient reading or writing of specific regions without loading the entire dataset. For example, a dataset might use the atomic datatype H5T_FLOAT with a fixed shape of [100, 200] to store a two-dimensional matrix of single-precision floating-point values. Attributes provide key-value metadata attached to groups or datasets, serving as small, named datasets without size restrictions, unlike earlier formats. Each attribute has its own name, datatype, and dataspace, enabling the storage of descriptive information such as units (e.g., "m/s" for velocity) or details directly on the object. This metadata enhances discoverability and by allowing users to annotate primary objects flexibly. Committed datatypes extend the model by allowing datatype definitions to be named and stored as independent objects within groups, promoting across multiple datasets and ensuring consistency in data representation. The overall organization treats the HDF5 file as a , where groups serve as nodes, links as directed edges, and datasets/attributes as leaf or annotated elements, navigable via path names like "/group1/dataset1". This graph structure supports shared components and complex topologies beyond simple hierarchies.

Performance Features and Extensions

The HDF5 file format employs a superblock as its versioned header, which contains essential metadata including free lists for managing unused space and references to the global heap for storing shared strings and other persistent objects. This structure facilitates efficient navigation and within the file. Groups in HDF5 are indexed using B-trees, which enable scalable organization and quick lookups for hierarchical elements, supporting large-scale datasets without performance degradation. Chunked storage divides datasets into fixed-size blocks, allowing partial I/O operations and integration with compression filters to optimize access patterns for multidimensional data. To enhance compression efficiency, particularly for binary data like images or simulations, HDF5 supports external filters such as bitshuffle, which rearranges bits within chunks before applying standard compressors like LZ4 or , achieving up to 2-4 times faster compression and decompression compared to traditional shuffling methods. Parallel HDF5 (PHDF5) extends these capabilities for distributed environments by leveraging MPI for collective I/O operations, enabling multiple processes to read and write to the same file concurrently while maintaining data consistency through MPI-IO drivers. Caching strategies in HDF5 include a chunk cache, which holds recently accessed dataset portions in memory to reduce disk I/O, and a metadata cache that buffers file headers, nodes, and other structural elements, with configurable sizes and replacement policies like LRU to adapt to application workloads. Introduced in HDF5 version 1.10 in , single-writer multiple-reader (SWMR) mode allows one process to append to datasets while multiple readers access the file simultaneously without blocking, using relaxed consistency semantics to minimize synchronization overhead and support streaming applications. This feature includes support for SWMR datasets, where append-only writes ensure readers see monotonically increasing without corruption. The HDF5 High-Level (HL) library provides simplified interfaces built atop the core library, offering domain-specific functions for images (H5IM), tables (H5TB), and packet tables (H5PT) to streamline common operations and reduce for performance-critical tasks. Utility tools complement these features for file optimization and inspection; h5dump outputs the contents of an HDF5 file in a human-readable format, including and attributes, while h5repack copies files with modified layouts, such as applying new filters or converting storage types, to improve compression ratios or access speeds post-creation. As of November 2025, the current HDF5 version is 2.0.0, which introduces support for chunks larger than 4 GiB using 64-bit addressing and increases the default chunk cache hash table size to 8191 for better performance with large . It also includes optimizations for virtual , such as delayed layout copying that improves opening times by approximately 30% for with many mappings. Pluggable filters enable compression with libraries like Blosc and ZFP for high-throughput handling of various data types, including in AI/ML workflows.

Programming Interfaces

Core APIs

The core APIs of the Hierarchical Data Format (HDF) provide low-level C interfaces for file manipulation, data creation, and I/O operations, forming the foundation for accessing HDF files in both HDF4 and HDF5 versions. These APIs are designed for portability across platforms and emphasize efficient handling of multidimensional scientific . In HDF4, the APIs are organized into modular libraries for core operations, scientific datasets (SDS), general rasters (GR), and vdata structures, while HDF5 introduces a more unified, object-oriented approach with dedicated modules for files, groups, datasets, attributes, and datatypes. While the core APIs remain stable, HDF5 2.0.0 (released November 2025) includes some API signature changes for specific functions such as H5Dread_chunk, H5Tdecode, and H5Iregister_type, along with new features like an AWS S3 backend; see the for migration details. HDF4's core APIs are divided into four primary libraries: the H library for basic file management, the DF library for SDS, the DG library for raster images, and the DH library for vdata. The H library includes functions like Hopen, which opens or creates an HDF file and returns a file identifier, and Hclose to terminate access, with access modes such as DFACC_READ, DFACC_WRITE, or DFACC_CREATE. For SDS operations in the DF library, SDstart initializes the interface on a file, SDcreate establishes a with specified datatype, rank, and dimensions, and I/O functions like SDread or SDwrite (often via SDreaddata and SDwritedata) handle data transfer using parameters for start position, stride, and edges to enable partial reads or writes. The DG library supports raster data through GRcreate to define an image with components, mode, and dimensions, paired with GRreadimage and GRwriteimage for I/O, including palette management via GRgetlutid and GRwritelut. In the DH library, vdata handling uses VScreate to build tabular structures with fields, VSattach and VSdetach for , and VSread or VSwrite for record-based I/O in interlaced or non-interlaced modes. HDF5's core APIs are structured around the H5 module and specialized submodules, including H5F for files, H5G for groups, H5D for datasets, H5A for attributes, and H5T for datatypes, offering a more hierarchical and extensible interface. Key functions include H5Fcreate to generate a new file or open an existing one, returning a file identifier; H5Gcreate to form groups within the file; and H5Dcreate to define datasets with associated datatypes and dataspace dimensions. Data I/O relies on H5Dwrite and H5Dread, which support hyperslab selections—rectangular subsets defined via start, stride, count, and block parameters in the H5S dataspace module—for efficient partial access to large arrays without loading entire datasets into memory. Attribute operations use H5Acreate and H5Awrite to attach metadata to objects, while H5T functions like H5Tcreate allow custom datatype definitions, including compound and variable-length types. Common patterns across both versions include identifier-based access, where functions return opaque handles (e.g., int32 for HDF4 file IDs or hid_t in HDF5) to reference objects without exposing internal structures, and error handling through return codes like SUCCEED/FAIL in HDF4 or herr_t in HDF5, which indicates success (0) or failure (negative values) and integrates with an stack for diagnostics via functions like H5Eprint. Customization often involves property lists in HDF5, such as H5P_DATASET_CREATE for chunking, compression, or fill values during creation, or H5P_FILE_ACCESS for I/O drivers and caching; HDF4 uses simpler flags or attributes for similar purposes. Version differences highlight HDF5's evolution toward a more object-oriented design, with explicit support for complex datatypes via H5T and mirroring the file's of groups and datasets, contrasting HDF4's flatter, library-specific structure. Additionally, the Parallel HDF5 (PHDF5) extension builds on the core APIs to enable in MPI environments, allowing non-blocking operations like H5Dwrite in collective or independent modes for applications.

Language Bindings and Tools

The Hierarchical Data Format (HDF) provides language bindings primarily through the HDF5 library, with official support for , , 90 (extended to 2003 via ISO_C_BINDING), and using the (JNI). These bindings enable developers to access HDF5's core functionality, such as dataset creation and file I/O, while abstracting low-level details where appropriate; for instance, the wrapper offers STL-like containers for datasets and groups. Third-party bindings extend HDF5 to other languages, including Python via h5py, which provides a NumPy-compatible interface for efficient array storage and retrieval. Similarly, .NET support is available through ILNumerics, offering high-level object-oriented access to HDF5 files integrated with .NET's numerical computing features. HDF4 bindings are more limited, focusing on C and Fortran interfaces with basic support for scientific I/O operations, while Java access relies on community wrappers rather than official JNI implementations. In contrast to HDF5's broader ecosystem, HDF4's language support has not evolved significantly, reflecting its legacy status. HDF5 includes specialized high-level interfaces to simplify common tasks beyond the core C API. The H5LT (Lite) interface provides straightforward functions for basic read/write operations on datasets and attributes, reducing for simple applications. The H5IM interface handles image data, supporting formats like and within HDF5 datasets for multidimensional raster storage. For time-series data, the H5PT (Packet Table) interface optimizes appendable tables with fixed-size records, suitable for streaming sensor outputs. Additionally, H5LD within the Lite API manages external links, allowing references to datasets in separate files for modular data organization. Supporting tools enhance HDF5 file manipulation and development. HDFView serves as a for browsing, editing, and visualizing HDF5 files, including dataset inspection and metadata editing across platforms. The h5cc utility acts as a compiler wrapper, automating the linking of HDF5 libraries for C programs to streamline builds. For file management, h5repart repartitions HDF5 files or families, enabling conversion between single files and distributed sets for parallel I/O. Tools like h5jam and h5unjam embed or extract user blocks (such as metadata files) from HDF5 headers, facilitating custom annotations. Conversion utilities, including h4toh5, transform HDF4 files to HDF5 format while preserving data model compatibility, often aligning with netCDF-4 conventions. HDF5 integrates deeply with scientific ecosystems, notably as the underlying format for netCDF-4, which adds climate and geoscience-specific abstractions atop HDF5's structure. As of 2025, adoption in grows through bindings like HDF5.jl for Julia, enabling high-performance array operations in scientific workflows, and hdf5r or rhdf5 for R, supporting statistical analysis of large datasets in Bioconductor environments. These extensions, over 700 projects strong, underscore HDF5's role in cross-language data interoperability.

Applications

Scientific and Engineering Uses

In astronomy and Earth sciences, the Hierarchical Data Format (HDF), particularly through the HDF-EOS extension, plays a crucial role in managing and environmental data from NASA's (EOS). For instance, data from the (MODIS) instrument on the Terra and Aqua satellites is stored in HDF-EOS format, enabling efficient handling of multispectral imagery with global coverage every 1-2 days across 36 spectral bands. Similarly, Landsat missions utilize HDF5 for archiving high-resolution data, such as the Level-0 Reduced (L0R) products from , which group observations from the Operational Land Imager (OLI) and Thermal Infrared Sensor (TIRS) into multiple HDF5 files per acquisition. In climate modeling, HDF5 supports the storage of multidimensional grids and , facilitating analysis of large-scale environmental datasets like those derived from Aura satellite instruments, where extraction from multiple HDF-EOS files (HE5) allows researchers to track variables such as atmospheric composition over extended periods. In physics and simulations, HDF5 is extensively employed at U.S. Department of Energy (DOE) laboratories for handling complex datasets from plasma physics and computational fluid dynamics (CFD). DOE-funded plasma physics simulations, such as those using the Vector Particle-in-Cell (VPIC) code, rely on HDF5 for scalable I/O in exascale environments, writing massive datasets like 291 TB files from trillion-particle runs on over 298,000 cores to achieve sustained throughputs of 52 GB/s. In CFD applications, HDF5 was specifically designed in the late 1990s to support DOE labs' largest simulations, enabling the storage of multidimensional arrays for fluid flow analyses and serving as the foundation for standards like the CFD General Notation System (CGNS), which organizes structured and unstructured grids for turbulence modeling. For engineering applications, HDF5 facilitates the management of and datasets, accommodating high-dimensional and heterogeneous biological data. In , particularly , HDF5 unifies formats for electron microscopy and functional MRI, storing terabyte-scale images with embedded metadata for efficient querying and analysis, as seen in tools like MINC for modality-neutral storage across CT, PET, and MRI scanners. In , HDF5 serves as a container for sequencing data, including SNP matrices and sequence alignments, with projects like Genomedata using it to store and compress large datasets for rapid access to tallies and variant information. A key benefit of HDF in these domains is its ability to handle heterogeneous data, such as combining spectral arrays with descriptive metadata in a single file, which supports seamless integration in workflows from acquisition to analysis. Furthermore, HDF5's parallel I/O capabilities make it suitable for petascale simulations at supercomputing centers, where it manages outputs from DOE applications like VPIC, ensuring scalability on systems with hundreds of thousands of cores without performance bottlenecks.

Integration in Software Ecosystems

The Hierarchical Data Format (HDF), particularly HDF5, is deeply integrated into various libraries and frameworks that facilitate data handling in scientific computing and geospatial analysis. netCDF-4, a widely used library for multidimensional scientific data, is built directly on HDF5 as its underlying storage layer, enabling enhanced features like internal compression and unlimited dimensions while ensuring full with standard HDF5 tools. Similarly, the Geospatial Data Abstraction Library (GDAL) provides robust support for reading and writing HDF5 files, including parsing of HDF-EOS5 metadata for grid and swath data, which is essential for geospatial workflows. Commercial analysis environments like offer native functions such as h5read and h5write for importing and exporting HDF5 datasets, supporting hierarchical structures and attributes without requiring external libraries. In applications, ENVI and IDL from NV5 Geospatial Software include built-in HDF5 readers that handle variable-length arrays and EOS extensions, streamlining data processing in pipelines. HDF5 is embedded in several high-profile projects for data management and visualization in scientific domains. The framework, developed by for high-energy physics analysis, can export datasets to HDF5 format using third-party tools like root2hdf5, allowing interoperability with broader ecosystems while leveraging ROOT's statistical tools on HDF5-stored particle collision data. Open-source visualization software like utilizes HDF5 as a backend for rendering complex volumes and meshes, often via the eXtensible Data Model and Format (XDMF) which references HDF5 files for efficient loading of large-scale outputs. HDF5 complies with key standards that promote its use in geospatial and Earth science contexts. It is recognized as an official Open Geospatial Consortium (OGC) standard, specifically the OGC Hierarchical Data Format Version 5 (HDF5) Core Standard, which defines its for encoding multidimensional arrays in spatially and temporally varying geospatial applications. For NASA-specific needs, the extends HDF5 with conventions tailored to (EOS) data, including geolocation metadata and swath/grid structures, ensuring compatibility across 's distributed archives. As of 2025, HDF5 plays a prominent role in modern software ecosystems, particularly in and . In ML workflows, libraries like h5py enable seamless loading of HDF5 datasets into and , supporting efficient handling of large training corpora with features like chunked access and compression, as highlighted in HDF Group's guidance for AI/ML applications. For cloud environments, HDF5 includes adapters for such as AWS S3, allowing virtual file access without full downloads through tools like the HDF5 S3 connector, which optimizes performance for distributed data processing.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.