Hubbry Logo
DBM (computing)DBM (computing)Main
Open search
DBM (computing)
Community hub
DBM (computing)
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
DBM (computing)
DBM (computing)
from Wikipedia

In computing, a DBM is a library and file format providing fast, single-keyed access to data. A key-value database from the original Unix, dbm is an early example of a NoSQL system.[1][2][3]

History

[edit]

The original dbm library and file format was a simple database engine, originally written by Ken Thompson and released by AT&T in 1979. The name is a three-letter acronym for DataBase Manager, and can also refer to the family of database engines with APIs and features derived from the original dbm.

The dbm library stores arbitrary data by use of a single key (a primary key) in fixed-size buckets and uses hashing techniques to enable fast retrieval of the data by key.

The hashing scheme used is a form of extendible hashing, so that the hashing scheme expands as new buckets are added to the database, meaning that, when nearly empty, the database starts with one bucket, which is then split when it becomes full. The two resulting child buckets will themselves split when they become full, so the database grows as keys are added.

The dbm library and its derivatives are pre-relational databases – they manage associative arrays, implemented as on-disk hash tables. In practice, they can offer a more practical solution for high-speed storage accessed by key, as they do not require the overhead of connecting and preparing queries. This is balanced by the fact that they can generally only be opened for writing by a single process at a time. An agent daemon can handle requests from multiple processes, but introduces IPC overhead.

Implementations

[edit]

The original AT&T dbm library has been replaced by its many successor implementations. Notable examples include:[3]

  • ndbm ("new dbm"), based on the original dbm with some new features.
  • GDBM ("GNU dbm"), GNU rewrite of the library implementing ndbm features and its own interface. Also provides new features like crash tolerance for guaranteeing data consistency.[4][5]
  • sdbm ("small dbm"), a public domain rewrite of dbm. It is a part of the standard distribution for Perl and is available as an external library for Ruby.[6][7]
  • qdbm ("Quick Database Manager"), a higher-performance dbm employing many of the same techniques as Tokyo/Kyoto Cabinet. Written by the same author before they moved on to the cabinets.[8]
  • tdb ("Trivial Database"), a simple database used by Samba that supports multiple writers. Has a gdbm-based API.[9]
  • Berkeley DB, 1991 replacement of ndbm by Sleepycat Software (now Oracle) created to get around the AT&T Unix copyright on BSD. It features many extensions like parallelism, transactional control, hashing, and B-tree storage.
  • LMDB: copy-on-write memory-mapped B+ tree implementation in C with a Berkeley-style API.

The following databases are dbm-inspired, but they do not directly provide a dbm interface, even though it would be trivial to wrap one:

  • cdb ("constant database"), database by Daniel J. Bernstein, database files can only be created and read, but never be modified
  • Tkrzw, an Apache 2.0 licensed successor to Kyoto Cabinet and Tokyo Cabinet
  • WiredTiger: database with traditional row-oriented and column-oriented storage.

Availability

[edit]

As of 2001, the ndbm implementation of DBM was standard on Solaris and IRIX, whereas gdbm is ubiquitous on Linux. The Berkeley DB implementations were standard on some free operating systems.[2][10] After a change of licensing of the Berkeley DB to GNU AGPL in 2013, projects like Debian have moved to LMDB.[11]

Reliability

[edit]

A 2018 AFL fuzzing test against many DBM-family databases exposed many problems in implementations when it comes to corrupt or invalid database files. Only freecdb by Daniel J. Bernstein showed no crashes. The authors of gdbm, tdb, and lmdb were prompt to respond. Berkeley DB fell behind due to the sheer amount of other issues;[10] the fixes would be irrelevant to open-source software users due to the licensing change locking them back on an old version.[11]

See also

[edit]

References

[edit]

Bibliography

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In computing, DBM (Database Manager) is a library of subroutines that provides a simple interface for managing key-value databases stored in disk files, using hashing for efficient access to arbitrary string keys and associated data. The core functions include dbminit for opening a database file, fetch for retrieving data by key, store for inserting or updating key-value pairs, delete for removing entries, firstkey and nextkey for sequential key traversal, and dbmclose for closing the database. Each database consists of two files: a directory file (typically with a .dir suffix) containing hash indices and a page file (.pag suffix) holding the data, with a maximum size limit of 512 bytes per key or content in the original implementation. Functions return 0 or a non-null datum on success and -1 or NULL on errors such as file access failures or key not found. The original DBM library was developed by and first appeared in Version 7 of Unix, released in 1979 as part of the system's programmer's manual. This implementation supported only one open database per process and used static storage for data pointers, which could be overwritten by subsequent calls, requiring programmers to copy results immediately. DBM served as an early example of a key-value store, influencing later database systems by providing lightweight, hashed storage without relational features. Subsequent variants expanded on the original DBM. The New DBM (ndbm) library, introduced in 4.3BSD Unix, modified the to allow multiple databases open simultaneously while maintaining compatibility. GNU DBM (GDBM), released by the , offers extended features like fast mode for read-only access and supports larger databases with better crash recovery. Other implementations, such as those in Python's dbm module and , provide portable interfaces to these formats, often falling back to a dumb (in-memory) implementation if native libraries are unavailable. Despite limitations like single-threaded access and potential file corruption on crashes, DBM and its derivatives remain in use for simple configuration storage and caching in systems.

Overview

Definition and Core Concept

DBM, short for Database Manager, is a foundational and associated in designed for efficient storage and retrieval of using unique keys, effectively implementing an on-disk for associative arrays. It organizes information as simple key-value pairs, where each key serves as a for accessing corresponding values, enabling direct lookups without the need for indexing or querying mechanisms typical of more complex systems. This structure allows for rapid operations on persistent files, making DBM suitable for applications requiring straightforward, high-performance access to discrete records. At its core, DBM operates on the principle of single-keyed access, treating the database as a flat collection of pairs without support for relational links, schemas, or advanced query languages. Keys and values are typically handled as strings or , with the library managing the underlying file organization—often using hashing techniques to map keys to storage locations for O(1) average-time retrieval. Unlike full-fledged relational databases, DBM is a lightweight, non-relational system akin to early paradigms, optimized for single-process, single-user scenarios focused on basic insert, update, delete, and fetch operations rather than transactions, , or multi-table joins. This simplicity positions DBM as an embeddable solution for low-overhead data persistence, avoiding the overhead of server-based database engines. A representative use case involves storing user preferences or configuration settings in a environment, where keys might be string identifiers like "user_id_123" and values binary blobs containing serialized preference , allowing quick reads and writes without relational overhead.

Key Features and Use Cases

DBM provides fast retrieval of through extensible hashing, enabling efficient lookup of values associated with arbitrary string keys without scanning the entire . It supports variable-length keys and values, limited to 512 bytes each in the original (with later variants supporting larger sizes up to bytes or more), stored across two flat files: a directory file for the hash index and a pag file for the actual pages. The library's interface is deliberately simple, offering core operations—open, store (insert or replace), fetch, delete, and close—that facilitate straightforward integration into C programs for managing key-value pairs. These operations ensure atomic updates within a single-process context, promoting data consistency without built-in support for transactions or complex locking. DBM adopts a single-writer model to prevent corruption, allowing multiple processes to read simultaneously only if no writes are active, though original implementations lack automatic locking and require careful application-level coordination for concurrency. This design prioritizes speed and simplicity over full compliance, making it suitable for scenarios where data integrity relies on controlled access patterns. In early Unix systems, DBM found widespread use as an embedded storage solution for applications requiring quick, persistent key-value access. A prominent example is the mail transfer agent, which employs DBM to store and query mail aliases, mapping sender addresses to delivery lists in a compiled database format for rapid resolution during message routing. It also served text processing tools, such as those for indexing small datasets or caching word frequencies in document analysis, where the overhead of full relational databases would be excessive. Additionally, DBM acted as a foundational component in systems like spools, providing lightweight persistence for configuration or state data in resource-constrained environments.

Historical Development

Origins in Unix

DBM originated in 1979 when Ken Thompson developed it as part of the Unix operating system at AT&T Bell Labs. This simple database engine addressed the limitations of Unix's early file-based data handling by introducing persistent key-value storage using an extensible hash table stored in disk files. The creation of DBM was motivated by the increasing demand for efficient, persistent data storage within Unix tools, where traditional flat files proved insufficient for managing structured information quickly and reliably. Tools such as troff for document formatting and make for build automation required faster access to key-based data, beyond what sequential file scans could provide, prompting Thompson to design a system that supported large databases—up to 1 billion blocks—while enabling key retrieval in typically 1-2 file system accesses. DBM was initially released as a set of library functions, including dbminit for initialization, store for inserting or updating key-content pairs, fetch for retrieval, delete for removal, and firstkey/nextkey for iteration, integrated directly into . These functions operated on binary data via a datum structure supporting strings up to 512 bytes, with database files split into .pag for data blocks and .dir for a index. Drawing from earlier Unix utilities' use of hashing for quick lookups in unstructured byte streams, DBM formalized this approach to meet the evolving need for structured in a growing ecosystem of command-line programs.

Key Milestones

In 1986, the ndbm (new DBM) library was introduced in 4.3BSD Unix, enhancing the original dbm by supporting multiple databases within a single process and incorporating file locking mechanisms to enable safe concurrent access by multiple processes. During the , ndbm gained formal adoption in standards such as X/Open Portability Guide Issue 4 (XPG4) in 1992, which standardized its interface and header (<ndbm.h>). In 1990, the released GNU DBM (GDBM), an open-source implementation offering enhanced features such as read-only fast mode and improved crash recovery for larger databases. In 1991, emerged as a significant extension of the DBM model, initially developed at the , as an open-source library providing advanced features like hashing, B-trees, and concurrent access; it was later maintained and commercialized by Sleepycat Software starting in 1996. The acquisition of Sleepycat Software by in February 2006 introduced a dual-licensing model for , combining open-source options with commercial terms to support broader enterprise adoption while funding development. However, by 2013, Oracle shifted to the GNU Affero General Public License (AGPLv3), which imposed stricter requirements for networked applications, prompting many open-source projects to migrate to alternatives due to compliance challenges. DBM's simple key-value storage paradigm has profoundly influenced modern databases, serving as an early precursor to distributed key-value stores like and by demonstrating efficient, non-relational data management for high-performance applications.

Technical Details

File Format and

The DBM file format employs two distinct files for each database instance: a directory file (typically named with a .dir extension) that functions as an index, and a page file (with a .pag extension) that holds the actual key-value data. The .dir file contains pointers to specific pages in the .pag file, enabling efficient mapping from hash values to data locations. These files are binary and sparse, with the .pag file potentially containing unused blocks to support dynamic growth. At its core, the data structure is an extendible hash table designed for disk-based storage. Keys are processed through a hash function to generate a fixed-length hash value (typically 32 bits), from which the least significant bits index into the directory stored in the .dir file. This directory entry points to an offset in the .pag file, locating a bucket—a fixed-size page of 512 bytes in the original implementation, capable of storing multiple key-value pairs. Within a bucket, pairs are stored sequentially as a 2-byte length prefix for the key, followed by the key data, a 2-byte length prefix for the content, followed by the content data, all in binary form without null terminators. Collisions occur when multiple keys hash to the same , which is handled by packing pairs sequentially within the bucket until its capacity is reached. To manage growth, full buckets trigger a split: the directory doubles in size if necessary, redistributing entries based on one additional hash bit without rehashing the entire table. This directory-based mechanism ensures average O(1) lookup, insertion, and deletion times while avoiding less efficient strategies like or separate chaining. The original hashing function combines the key and content bytes into a 31-bit value. The original DBM format lacks forward compatibility with variants like certain NDBM implementations due to variations in file headers, such as differing magic numbers or bitmap structures, which prevent seamless interchange between systems.

Programming Interface

The original programming interface for DBM is provided through the <dbm.h> header in C, offering a simple API for managing key-value databases. It uses the datum type, a structure with a char *dptr pointer to the data and an int dsize for its size in bytes, supporting binary data up to 512 bytes per key or content. Unlike later variants, it allows only one open database per process and uses static internal storage, requiring immediate copying of returned data to avoid overwrites. To access a database, programs call dbminit(const char *file), which opens the specified database files (file.dir and file.pag must already exist) and returns 0 on success or -1 on failure. Data manipulation uses fetch(datum key), returning a datum with the value or a if not found; store(datum key, datum content), which inserts or replaces the pair and returns 0 on success or -1 on ; and delete(datum key), returning 0 on success or -1 on failure. For sequential traversal, firstkey() returns the first key as a datum (null indicates end), and nextkey(datum key) retrieves the next in hash order. The database is closed with dbmclose(), which releases resources. Error handling relies on return values (-1 for failure) and the external errno. Later variants like NDBM, standardized in POSIX via <ndbm.h>, extend this interface to support multiple open databases, flexible open flags (e.g., O_RDWR, O_CREAT), and functions like dbm_open() and dbm_store() with insert/replace modes, while maintaining compatibility with the core operations.

Implementations and Variants

Standard Implementations

The original dbm implementation, developed by Ken Thompson at AT&T Bell Labs, was introduced as part of Unix Version 7 in 1979. It provides a simple key-value store using two files—a directory file (.dir) for indexing and a page file (.pag) for data—supporting fast access to large databases via hashing, but limited to one open database per process, without built-in locking or support for concurrent writes from multiple processes. ndbm, or New DBM, emerged as a (BSD) extension in 4.3BSD released in 1986, building on the original dbm to address its limitations. This variant maintains the core hashing algorithm from Unix Version 7 dbm while extending the to support opening multiple databases per process, enhancing usability in multi-tasking environments. GDBM, the GNU DBM, was first implemented by Philip A. Nelson in 1990 under the GNU project to provide a free alternative to proprietary DBM libraries. As a disk-based key-value store, it incorporates an internal bucket cache for improved read performance over the original dbm and supports operational modes such as fast mode for read-only access and sparse mode (to reduce file size by avoiding allocation of unused blocks). sdbm, developed by Ozan S. Yigit in , serves as a compact, public-domain reimplementation of ndbm designed for portability across Unix variants where licensing restricted access to Berkeley code. Its lightweight design, using a single hashing function and simple file structure, makes it suitable for basic key-value needs, and it is integrated as the default DBM backend in languages like (via the SDBM_File module) and (via the stdlib sdbm library). These standard implementations share a commitment to the traditional DBM API for store, fetch, delete, and key traversal operations, striving for binary compatibility with original dbm files where feasible through emulation layers, and are primarily targeted at systems for embedded, .

Extended and Alternative Implementations

, initially released in 1991 by the , evolved into a commercial product under Sleepycat Software starting in 1996, which was acquired by in 2006. It extends the original DBM model by supporting multiple storage backends, including hash tables and B-trees for ordered access, while adding advanced features such as transactions, concurrent access with locking mechanisms, and replication for distributed environments. Post-2013, its open-source distribution shifted to the Affero General Public License (AGPL) alongside a commercial option under the Sleepycat Public License successor, reflecting Oracle's dual-licensing strategy. The (LMDB), developed in 2011 by Symas Corporation for the Project, is modeled loosely after the Berkeley DB API but implements a simplified structure using memory-mapped files for direct OS-managed access. This design enables high read performance through operations, employs for crash recovery without , and supports a single-writer/multiple-reader concurrency model to minimize locking overhead. As of 2025, LMDB remains actively maintained for embedded applications, with ongoing updates in projects like and bindings for modern languages. Other alternatives include QDBM and its successor Tokyo Cabinet, both developed by Mikio Hirabayashi starting in the mid-2000s, which prioritize speed and compression through and storage options for key-value pairs. Tokyo Cabinet, released around 2007, enhances efficiency with memory-mapped I/O and supports larger datasets via table databases, though both libraries are now largely superseded by more recent tools. Samba's Trivial Database (TDB), introduced in the late , provides a lightweight, embeddable store optimized for locking in networked file systems, allowing multiple writers while maintaining dbm/ndbm compatibility. In contrast, the Constant Database (CDB), designed by in 1997, is a read-only format using a static for ultra-fast lookups without updates or concurrency support, ideal for static mappings like mail aliases. Modern key-value stores such as (developed by in 2011) and its fork (by in 2012) draw from DBM's foundational principles of simple, embedded key-value persistence but diverge with architectures for reduction and scalability on SSDs, using entirely different APIs. These systems prioritize high-throughput workloads in applications like browsers and distributed storage, extending DBM concepts to handle terabyte-scale data without direct compatibility.

Availability in Modern Systems

Operating System Support

DBM and its variants maintain varying levels of support across , primarily through legacy compatibility layers and specialized packages rather than as core components. In distributions, GDBM is widely available as a standard GNU library, integrated into core utilities and development environments, ensuring compatibility for applications relying on key-value storage. The GNU C Library () provides ndbm interfaces for backward compatibility with traditional DBM functions, allowing seamless access in most environments. LMDB, a memory-mapped variant, is commonly packaged in major distributions such as and , particularly for use in implementations. On systems such as BSD variants, ndbm remains a native component, with and including the library in their base systems for traditional database operations. In macOS, support for DBM is available through third-party libraries such as , which can be installed via Homebrew for compatibility, though Apple encourages the use of built-in options like for new developments due to their efficiency and integration with the ecosystem. Windows lacks native DBM support, relying instead on ports through environments like and , which provide GDBM and implementations for POSIX-like development. Native integration is limited, with developers often using (WSL) or third-party alternatives like for similar functionality. Certain older Unix systems have deprecated DBM; reflecting a broader shift away from legacy formats. Despite this, DBM variants continue to be bundled in embedded systems, where they support lightweight storage in constrained environments. As of 2025, DBM remains active in IoT and embedded platforms, such as , where GDBM and ndbm packages are readily installable for resource-limited applications. However, its role has been largely overshadowed by in contemporary systems due to the latter's superior features and portability, with no significant removals reported since Debian's transition to LMDB for certain database needs.

Language Bindings and Libraries

DBM, originally developed as a library, provides native access in and C++ programs through standard headers such as <ndbm.h> for the New DBM (NDBM) interface or <gdbm.h> for the GNU DBM implementation, enabling direct manipulation of key-value databases with functions like dbm_open and dbm_store. In C++, while no official Boost wrapper exists specifically for DBM, modern alternatives like the Tkrzw library offer C++-native DBM implementations with enhanced performance and concurrency support. Python integrates DBM functionality via the dbm package in its , which serves as a generic interface to various backends including dbm.ndbm (NDBM), dbm.gnu (GDBM), dbm.bsd (), dbm.dumb (pure Python fallback), and the newer dbm.sqlite3 (SQLite-based, introduced in Python 3.13 as the default for the shelve module). In Python 3, the functionality of the former anydbm module was incorporated into dbm, allowing direct usage or shelve for object persistence, with documentation advising shelve for applications requiring non-string values due to its pickle integration atop DBM backends. Ruby includes a built-in DBM class in its , acting as a wrapper around Unix-style DBM libraries and defaulting to the SDBM backend for simple key-value storage where both keys and values are strings, though it supports alternatives like GDBM or depending on compilation flags. Similarly, Perl offers native DBM support through core modules such as NDBM_File for NDBM compatibility and DB_File for access, allowing tied hashes for seamless integration in scripts via functions like dbmopen or the DBI driver's DBD::DBM for SQL-like queries on DBM files. For other languages, utilizes DB Edition (JE), a pure-Java embedded key-value store compatible with DBM semantics, providing APIs for direct object persistence without native dependencies. Node.js accesses GDBM via third-party bindings like node-gdbm, which exposes methods such as open, store, and fetch for asynchronous key-value operations in environments. Cross-platform libraries like Cabinet extend DBM capabilities with bindings for , Python, , , , and C/C++, supporting advanced features such as indexing and large-scale databases up to 8 exabytes. As of , DBM bindings remain available and maintained in these languages primarily for legacy scripting and configuration storage, such as in Unix tools or simple data persistence tasks, though official documentation across implementations recommends modern alternatives like or for new projects due to DBM's limitations in concurrency and scalability. No major patches have been required recently, reflecting DBM's niche, low-exposure usage in non-critical applications.

Reliability and Limitations

Security and Stability Issues

In 2018, fuzzing experiments using (AFL) on various DBM implementations revealed significant stability issues, including crashes and potential data corruption triggered by malformed database inputs. Implementations such as GDBM, QDBM, Tokyo Cabinet, Kyoto Cabinet, TDB, and NTDB exhibited vulnerabilities leading to segmentation faults (SIGSEGV), aborts (SIGABRT), and arithmetic errors (SIGFPE), often within seconds of exposure to crafted files; in contrast, the read-only freecdb demonstrated full resilience, with no crashes after over 1.1 billion executions across four days of testing. Common security flaws in these systems stem from inadequate handling of inputs during key hashing and data operations, resulting in buffer overflows that allow memory corruption and potential . Additionally, denial-of-service conditions can arise from hash collisions in the underlying table structures, exacerbating performance degradation under adversarial inputs, while fetch and store functions frequently lack robust input validation, enabling malformed keys or values to trigger out-of-bounds reads or writes. Oracle acquired Sleepycat Software, the creators of , in 2006. While fully open-source releases ceased after version 4.8 in 2009, Oracle has continued providing patches through Critical Patch Updates for supported versions, including fixes in 2019 and 2020. This has led to community alternatives like LMDB, which enhances through memory-mapped files () and copy-on-write mechanisms that avoid manual buffer management and ensure crash-proof . While the original DBM remains unpatched, some variants like have received security updates post-2018, though many implementations from the fuzzing still require careful handling; experts recommend input sanitization prior to database operations and preferring read-only modes to mitigate risks. In contemporary deployments, exposure is often limited by embedding DBM within trusted, controlled environments that restrict unverified file access. As of 2025, several distributions have deprecated or plan to remove support, encouraging migration to more modern key-value stores.

Performance and Scalability Constraints

DBM implementations, including the original ndbm and variants like GDBM, achieve average O(1) lookup times through hashing, where keys are mapped to buckets in a disk file for direct access without scanning the entire dataset. However, performance degrades with hash collisions, as the original Unix DBM's simple hashing algorithm—based on a fixed number of buckets—requires linear probing or sequential searches within buckets, leading to increased I/O operations for larger or poorly distributed key sets. This makes DBM suitable for small to medium-sized datasets where hashing provides efficient access, though disk I/O remains the dominant bottleneck. DBM is suitable for small to medium-sized datasets where hashing provides efficient access, though disk I/O limits performance compared to in-memory stores like Redis. Scalability is limited by DBM's single-file and locking model, which enforces exclusive access during writes via file-level locks, causing contention in multi-process or multi-threaded environments where multiple writers compete for the lock. GDBM, for instance, supports multiple concurrent readers but only a single writer at a time, preventing built-in sharding or replication and restricting throughput in high-concurrency scenarios. File growth exacerbates these issues, as appending new buckets or pages without automatic compaction results in internal fragmentation and scattered disk allocation, increasing seek times and overall I/O latency over time. Compared to in-memory stores like , DBM exhibits significantly lower throughput for high-frequency read-write operations due to its reliance on disk persistence. Similarly, DBM lags behind in , as the latter supports fine-grained multi-threaded concurrency with page-level locking, enabling better parallel access without global contention. As of 2025, DBM remains effective in low-resource embedded systems for simple, small-scale key-value storage but is considered legacy for cloud-scale applications requiring distributed or . Variants like GDBM offer minor optimizations, such as a configurable bucket cache that improves read performance by keeping frequently accessed pages in memory—best when sized to match the number of buckets—and a fast-write mode that defers disk synchronization for quicker inserts at the cost of potential data loss on crashes. Despite these, DBM's overall design positions it as unsuitable for big data workloads, where modern alternatives provide compaction, replication, and horizontal scaling absent in core DBM implementations.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.