Hubbry Logo
Linear hashingLinear hashingMain
Open search
Linear hashing
Community hub
Linear hashing
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Linear hashing
Linear hashing
from Wikipedia

Linear hashing (LH) is a dynamic data structure which implements a hash table and grows or shrinks one bucket at a time. It was invented by Witold Litwin in 1980.[1] [2] It has been analyzed by Baeza-Yates and Soza-Pollman.[3] It is the first in a number of schemes known as dynamic hashing[3] [4] such as Larson's Linear Hashing with Partial Extensions,[5] Linear Hashing with Priority Splitting,[6] Linear Hashing with Partial Expansions and Priority Splitting,[7] or Recursive Linear Hashing.[8]

The file structure of a dynamic hashing data structure adapts itself to changes in the size of the file, so expensive periodic file reorganization is avoided.[4] A Linear Hashing file expands by splitting a predetermined bucket into two and shrinks by merging two predetermined buckets into one. The trigger for a reconstruction depends on the flavor of the scheme; it could be an overflow at a bucket or load factor (i.e., the number of records divided by the number of buckets) moving outside of a predetermined range.[1] In Linear Hashing there are two types of buckets, those that are to be split and those already split. While extendible hashing splits only overflowing buckets, spiral hashing (a.k.a. spiral storage) distributes records unevenly over the buckets such that buckets with high costs of insertion, deletion, or retrieval are earliest in line for a split.[5]

Linear Hashing has also been made into a scalable distributed data structure, LH*. In LH*, each bucket resides at a different server.[9] LH* itself has been expanded to provide data availability in the presence of failed buckets.[10] Key based operations (inserts, deletes, updates, reads) in LH and LH* take maximum constant time independent of the number of buckets and hence of records.[1][10]

Algorithm details

[edit]

Records in LH or LH* consists of a key and a content, the latter basically all the other attributes of the record.[1][10] They are stored in buckets. For example, in Ellis' implementation, a bucket is a linked list of records.[2] The file allows the key based CRUD operations create or insert, read, update, and delete as well as a scan operations that scans all records, for example to do a database select operation on a non-key attribute.[10] Records are stored in buckets whose numbering starts with 0.[10]

The key distinction from schemes such as Fagin's extendible hashing is that as the file expands due to insertions, only one bucket is split at a time, and the order in which buckets are split is already predetermined.[11]

Hash functions

[edit]

The hash function returns the 0-based index of the bucket that contains the record with key . When a bucket which uses the hash function is split into two new buckets, the hash function is replaced with for both of those new buckets. At any time, at most two hash functions and are used; such that corresponds to the current level. The family of hash functions is also referred to as the dynamic hash function.

Typically, the value of in corresponds to the number of rightmost binary digits of the key that are used to segregate the buckets. This dynamic hash function can be expressed arithmetically as . Note that when the total number of buckets is equal to one, .

Complete the calculations below to determine the correct hashing function for the given hashing key .[10]

# l represents the current level
# s represents the split pointer index
a = h_l(c)
if (a < s): a = h_{l+1}(c)

Split control

[edit]

Linear hashing algorithms may use only controlled splits or both controlled and uncontrolled splits.

Controlled splitting occurs if a split is performed whenever the load factor, which is monitored by the file, exceeds a predetermined threshold.[10] If the hash index uses controlled splitting, the buckets are allowed to overflow by using linked overflow blocks. When the load factor surpasses a set threshold, the split pointer's designated bucket is split. Instead of using the load factor, this threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets).[12]

An uncontrolled split occurs when a split is performed whenever a bucket overflows, in which case that bucket would be split into two separate buckets.

File contraction occurs in some LH algorithm implementations if a controlled split causes the load factor to sink below a threshold. In this case, a merge operation would be triggered which would undo the last split, and reset the file state.[10]

Split pointer

[edit]

The index of the next bucket to be split is part of the file state and called the split pointer . The split pointer corresponds to the first bucket that uses the hash function instead of .[10]

For example, if numerical records are inserted into the hash index according to their farthest right binary digits, the bucket corresponding with the appended bucket will be split. Thus, if we have the buckets labelled as 000, 001, 10, 11, 100, 101, we would split the bucket 10 because we are appending and creating the next sequential bucket 110. This would give us the buckets 000, 001, 010, 11, 100, 101, 110.[12]

When a bucket is split, split pointer and possibly the level are updated according to the following, such that the level is 0 when the linear hashing index only has 1 bucket.[10]

# l represents the current level
# s represents the split pointer index
s = s + 1
if (s >= 2^l): 
    l = l + 1
    s = 0

LH*

[edit]

The main contribution of LH* is to allow a client of an LH* file to find the bucket where the record resides even if the client does not know the file state. Clients in fact store their version of the file state, which is initially just the knowledge of the first bucket, namely Bucket 0. Based on their file state, a client calculates the address of a key and sends a request to that bucket. At the bucket, the request is checked and if the record is not at the bucket, it is forwarded. In a reasonably stable system, that is, if there is only one split or merge going on while the request is processed, it can be shown that there are at most two forwards. After a forward, the final bucket sends an Image Adjustment Message to the client whose state is now closer to the state of the distributed file.[10] While forwards are reasonably rare for active clients, their number can be even further reduced by additional information exchange between servers and clients [13]

Other properties

[edit]

File state calculation

[edit]

The file state consists of split pointer and level . If the original file started with buckets, then the number of buckets and the file state are related via [13]

.

Adoption in language systems

[edit]

Griswold and Townsend [14] discussed the adoption of linear hashing in the Icon language. They discussed the implementation alternatives of dynamic array algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.

Adoption in database systems

[edit]

Linear hashing is used in the Berkeley database system (BDB), which in turn is used by many software systems, using a C implementation derived from the CACM article and first published on the Usenet in 1988 by Esmond Pitt.

References

[edit]
[edit]

See also

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Linear hashing is a dynamic hashing technique for file and table addressing that allows the to grow or shrink incrementally, enabling support for arbitrary numbers of insertions and deletions without significant degradation in access time or memory utilization. Invented by Witold Litwin in , it addresses the limitations of static hashing methods, which suffer performance drops due to overflows in dynamic environments, by using a predefined sequence of bucket splits rather than reorganizing the entire structure upon resizing. The algorithm operates using a family of hash functions hi(key)=h(key)mod2iNh_i(key) = h(key) \mod 2^i N, where NN is the initial number of buckets and ii is the current level, starting at i=0i = 0. Buckets are numbered sequentially from 0 to 2iN12^i N - 1, and a split pointer (often denoted as "next") tracks the next bucket to split in a linear order during each round. When a bucket overflows, the algorithm splits the bucket indicated by the split pointer using the next-level hash function hi+1h_{i+1}, redistributing records to either the original or a new bucket based on their hash values, then advances the pointer. A round completes after NN splits (when the pointer reaches 2iN2^i N), at which point the level increments, doubling the address space, and the pointer resets to 0 for the next round. For lookups, insertions, or deletions, the primary hash function hih_i is applied first; if the resulting bucket is after the split pointer (indicating it may have been split recently), the algorithm also checks the bucket computed with hi+1h_{i+1} to ensure the record is found, typically requiring at most two probes. This approach eliminates the need for a separate , unlike extendible hashing, simplifying while maintaining . Key advantages include maintaining high load factors—up to 90% for files with near-constant access times averaging 1.03 probes at 60% load and 1.35 at 90%—and supporting graceful degradation without full reorganizations. It achieves this through controlled overflows and linear splitting, avoiding the clustering issues in chained hashing and the directory overhead in other dynamic schemes, making it suitable for database systems and key-value stores. Linear hashing has influenced subsequent variants, such as distributed extensions like LH*, but retains its core appeal for its simplicity and performance stability under varying loads.

Overview

Definition and Motivation

Hashing is a technique that employs a to compute an index for key-value pairs, enabling efficient storage and retrieval by mapping keys to buckets or slots. This method supports average-case constant-time operations for insertion, deletion, and search, assuming a uniform distribution of keys. The load factor, defined as the ratio of the number of stored elements to the total number of buckets, is a critical metric; it influences collision frequency, with optimal performance typically maintained below 0.7 to 0.8 to avoid excessive probing or . Static hashing, with its fixed number of buckets determined at initialization, encounters significant challenges in dynamic environments where data volumes fluctuate. High load factors lead to bucket overflows, necessitating expensive full-table rehashing or reorganization to restore performance, while preallocating excess buckets results in inefficient space utilization. These issues underscore the need for dynamic hashing schemes that expand or contract incrementally without global disruptions. Linear hashing emerges as an extendible hashing method that dynamically adjusts the through sequential bucket splitting guided by a split pointer, eliminating the need for a to resolve addresses. It mitigates the rehashing costs of static hashing and the memory overhead and access delays from directory doublings in extendible hashing variants, allowing growth one bucket at a time. This design facilitates precise load factor control, commonly targeting 60% to 90%, to balance space and performance. The technique delivers key benefits including average constant-time insertions and deletions, with mean successful search times ranging from 1.03 to 1.73 accesses for bucket capacities of 1 to 50 at loads up to 90%. Overflow chains remain bounded, preventing disproportionate slowdowns, and the structure's adaptability suits varying data scales, such as in file systems or database indexes.

Historical Development

Linear hashing was developed by Witold Litwin in 1980 as a dynamic file organization technique designed to support insertions and deletions without requiring full file reorganization. It was first described in his seminal paper "Linear Hashing: A New Tool for File and Table Addressing," presented at the 6th International Conference on Very Large Data Bases (VLDB) in Montreal, Canada. The method emerged amid the rapid growth of relational database management systems in the late 1970s and early 1980s, when there was increasing demand for scalable storage solutions capable of handling fluctuating data volumes and workloads efficiently. Linear hashing built upon prior dynamic hashing approaches, notably extendible hashing—a directory-based predecessor introduced by Ronald Fagin, Jürg Nievergelt, Nicholas Pippenger, and H. Raymond Strong in 1979—which addressed similar challenges but relied on a directory structure that linear hashing sought to eliminate for simplicity. In the early 1980s, linear hashing attracted significant interest in the database research community for its potential in relational systems, leading to early extensions such as partial expansions proposed by Per-Åke Larson in 1980, which generalized Litwin's scheme to allow more controlled growth and better space efficiency. By the 1990s, as gained prominence, further refinements adapted linear hashing for parallel and networked environments; a key milestone was the LH* algorithm, developed by Litwin, Marie-Anne Neimat, and Donovan A. Schneider in 1993, which extended the technique to scalable distributed files across multiple nodes.

Core Algorithm

Hash Functions

In linear hashing, the primary hash function hm(k)=h(k)mod2mh_m(k) = h(k) \mod 2^m maps a key kk to an initial address, where h(k)h(k) is a base that maps arbitrary keys to integers uniformly distributed over a large range, and mm denotes the current maximum power-of-two value corresponding to the number of primary buckets before the ongoing expansion phase. This function ensures that keys are distributed across the existing bucket space in a manner that supports incremental growth without requiring full reorganization. The secondary hash function hm+1(k)=h(k)mod2m+1h_{m+1}(k) = h(k) \mod 2^{m+1} serves to resolve mappings for keys affected by recent splits, distinguishing those that remain in the original bucket from those redirected to the newly created one. By considering one additional bit in the modulus, it facilitates the redistribution of keys during bucket expansion, maintaining balance as the file grows. The bucket address for a key kk is determined by first computing the primary hash hm(k)h_m(k). If this value falls below the current split pointer, the secondary hash hm+1(k)h_{m+1}(k) is used instead; otherwise, the primary hash prevails. This selective application minimizes disruption to the majority of keys during each split. The following pseudo-code demonstrates the computation of the bucket address:

function bucket_address(k, m, split_pointer): h_val = h(k) # Base hash h = h_val % (2 ** m) # Primary hash h_m(k) if h < split_pointer: h2 = h_val % (2 ** (m + 1)) # Secondary hash h_{m+1}(k) return h2 else: return h

function bucket_address(k, m, split_pointer): h_val = h(k) # Base hash h = h_val % (2 ** m) # Primary hash h_m(k) if h < split_pointer: h2 = h_val % (2 ** (m + 1)) # Secondary hash h_{m+1}(k) return h2 else: return h

These hash functions promote locality-preserving splits, as the secondary function reassigns keys from a split to either the original location or a new bucket offset by 2m2^m, preserving relative proximity in the . Within individual buckets, collisions arising from multiple keys hashing to the same address are managed through chaining, which links overflow records in lists, or open addressing, which probes sequential locations until an empty slot is found. Formally, the bucket address selection is expressed as: Bucket address={hm+1(k)if hm(k)<split pointerhm(k)otherwise\text{Bucket address} = \begin{cases} h_{m+1}(k) & \text{if } h_m(k) < \text{split pointer} \\ h_m(k) & \text{otherwise} \end{cases} where hm(k)=h(k)mod2mh_m(k) = h(k) \mod 2^m and hm+1(k)=h(k)mod2m+1h_{m+1}(k) = h(k) \mod 2^{m+1}.

Bucket Splitting Process

In linear hashing, the bucket splitting process is initiated to maintain balanced distribution as the file grows, typically triggered when the overall load factor exceeds a predetermined threshold (e.g., 0.8 or 0.9, depending on the variant and bucket capacity), where load factor is the ratio of records to total bucket capacity, in controlled splitting variants. This threshold ensures that splits occur proactively rather than reactively on every overflow, allowing for efficient space utilization around 0.7 to 0.9 depending on implementation parameters. In uncontrolled splitting, as originally proposed, a split is triggered immediately upon an insertion causing a bucket overflow (local load factor reaching 1), but the process always targets the bucket indicated by the split pointer rather than the overflowing one to promote uniform growth. The splitting procedure begins by identifying the bucket at the current position of the split pointer, which designates the next candidate for division in a linear sequence. All records in this selected are then rehashed using a secondary , such as hi+1h_{i+1}, to redistribute them approximately evenly: records hashing to the original bucket address remain there, while those mapping to the new address are moved to a freshly allocated bucket appended at the end of the file structure. This redistribution leverages the secondary function—typically an extension of the primary hash by considering one additional bit—to ensure balanced occupancy without requiring a full file reorganization. The new bucket is created sequentially, expanding the address space incrementally by one unit per split. Unlike directory-based dynamic hashing methods that double the entire structure at once, linear hashing employs partial expansions where only a single bucket is split per operation, enabling gradual file growth that adapts smoothly to insertion rates. This approach avoids the overhead of global rehashing, with the effectively doubling after a full cycle of splits equal to the initial number of buckets, after which the process repeats with deeper hash levels. The result is a file that can handle varying workloads with minimal disruption, as each split operation is localized and completes in time proportional to the records in the affected bucket. Overflows, which arise from temporary collisions during insertions, are resolved through these splits, as redistributed records reduce chain lengths and restore primary bucket usage over time. For deletions, the scheme supports contraction via coalescing, the reverse of splitting: when the load factor falls below a lower threshold, the split pointer moves backward, merging records from a pair of sibling buckets back into one using the secondary hash function, thereby reclaiming space efficiently. During an insertion, the process integrates seamlessly into the access path: after computing the primary address, the algorithm checks the split pointer to determine if the higher-level hash hi+1h_{i+1} should be used for buckets beyond the pointer; if the target is at or near capacity, a split is performed on the designated before proceeding with the insertion, ensuring the new record finds space without immediate overflow. This check maintains the integrity of the linear expansion while keeping insertion costs amortized constant.

Split Pointer Management

In linear hashing, the split pointer pp serves as a dynamic index that identifies the next eligible for splitting during file expansion. Defined as an ranging from 0 to 2m+112^{m+1} - 1, where mm represents the current level of the hash structure (with fully split buckets numbered 0 to 2m12^m - 1), pp ensures that splits occur sequentially, starting from 0 and progressing linearly through the . This pointer maintains the order of expansion, preventing the need for global reorganization by focusing splits on individual buckets as needed. The advancement of pp is triggered immediately after a split operation on the bucket it currently points to, incrementing pp by 1 to target the subsequent . When pp reaches 2m2^m, indicating that all buckets up to the current level have been split, the level mm is incremented to m+1m+1, effectively doubling the address space, and pp resets to 0 to begin the splitting for the new level. This mechanism interacts with the bucket splitting by designating the exact bucket for division into two, after which the pointer updates to continue the controlled growth. During key-to-bucket mapping for insertions, searches, or deletions, the split pointer integrates directly into the hashing algorithm to determine the appropriate bucket address. Specifically, the primary hash function hm(k)h_m(k) (modulo 2m2^m) is used if the primary hash value hm(k)ph_m(k) \geq p; otherwise, if hm(k)<ph_m(k) < p, the secondary hash function hm+1(k)h_{m+1}(k) (modulo 2m+12^{m+1}) is applied to resolve the bucket, accounting for partially split buckets. This comparison ensures that keys are directed to either an existing bucket or one that has already been split, maintaining load balance without requiring full rehashing. Edge cases in split pointer management arise at power-of-two boundaries and during load adjustments via deletions. At boundaries, such as when [p](/page/P′′)[p](/page/P′′) wraps from 2m+112^{m+1} - 1 back to 0 upon level increment, the structure seamlessly transitions to the expanded space, with all previous mappings remaining valid due to the linear progression. For deletions, if the overall load factor drops below a predefined threshold, the pointer can decrement [p](/page/P′′)[p](/page/P′′) to initiate contraction through grouping (merging split pairs), reversing the expansion sequence to reclaim space efficiently while avoiding underutilization. These rules preserve the dynamic adaptability of linear hashing to varying workloads.

Variants and Extensions

LH* Algorithm

The LH* algorithm, proposed by Witold Litwin, Marie-Anne Neimat, and Daniel A. Schneider in 1993, is a scalable distributed that generalizes linear hashing to parallel or distributed RAM and disk files. It supports dynamic growth and shrinkage across multiple servers without a central directory, using a split coordinator to manage incremental bucket splits one at a time. Deletions are handled implicitly through the splitting process, allowing the structure to adapt to varying loads in distributed environments. LH* partitions data across servers, with each typically residing on a different node. Insertions and searches involve a small number of messages (e.g., 1-3 for inserts, 2-4 for searches in general cases), enabling efficient parallel operations like hash joins. Contractions occur similarly to splits when buckets empty, merging with buddy buckets to reclaim space. Later extensions, such as LH*RS (2004), add features for in multicomputer settings.

Other Modifications

Parallel linear hashing adaptations have been developed for multi-processor systems to enable concurrent operations without significant performance degradation. In the 1990s, researchers proposed locking protocols that protect the split pointer during concurrent splits, allowing multiple threads to perform inserts and searches while a single thread handles bucket splitting under a directory lock. This approach minimizes lock contention by using selective locking on overflow chains and deallocating old buckets exclusively, supporting multithreaded environments in main memory databases. Distributed variants of linear hashing extend the structure to cluster environments by partitioning the hash space across multiple nodes, often using a global split pointer replicated or coordinated via directory structures. These adaptations, such as Distributed Dynamic Hashing (DDH), distribute data dynamically across networked servers while maintaining load balance through incremental bucket splits coordinated between nodes. Memory-optimized versions of linear hashing target in-memory databases by reducing rehashing overhead through cache-conscious designs and adaptive splitting. Post-2010 research has focused on variants that leverage modern hardware, such as Self-Adaptive Linear Hashing (SAL-Hashing) for solid-state drives integrated with in-memory components, which minimizes random writes by adjusting split thresholds based on access patterns. Recent analyses highlight linear hashing's suitability for main memory key-value stores, with optimizations like fringe behavior improvements via spiral extensions to enhance insertion throughput under varying loads. Hybrid approaches have been explored to combine elements of linear hashing with other techniques for better performance in specific scenarios. A specific example is partial linear hashing, introduced by Per-Åke Larson in 1980 and further developed in 1988, which supports dynamic files by allowing selective partial expansions of buckets rather than full splits, thereby reducing reorganization costs and accommodating varying data distributions.

Properties and Analysis

File State and Capacity

In linear hashing, the file's internal state is maintained through a set of variables that monitor its current capacity and dictate when expansions occur. The variable mm denotes the current level, establishing the base range for the primary as 2m2^m buckets (assuming initial N=1N = 1 for simplicity; in general, 2mN2^m N). The split pointer pp tracks the index of the next bucket designated for splitting, with 0p2m0 \leq p \leq 2^m. The total number of buckets nn is computed as n=2m+pn = 2^m + p, reflecting the incremental growth beyond the power-of-two boundary. The capacity of the file is determined by the load factor λ\lambda, a configurable threshold typically ranging from 0.6 to 0.9, which represents the desired utilization ratio of space. The maximum number of records before triggering the next split is approximately λ×n×b\lambda \times n \times b, where bb is the capacity (often normalized to 1 for analysis). A split is initiated when the number of records exceeds λ×(2m+p)×b\lambda \times (2^m + p) \times b, ensuring the file expands proactively to maintain . Once pp reaches 2m2^m, the current level is complete: mm is incremented to m+1m+1, pp resets to 0, and the effectively doubles, completing a full expansion round. Overflow handling is integrated into the state via tracking of chain lengths for each primary bucket, where excess records are stored in linked overflow buckets. During an insert operation, if a target bucket overflows, records are appended to its chain, and the split may be triggered based on the overall load; the state updates include rehashing affected records into the new bucket and incrementing pp. Deletions reduce record counts and chain lengths, potentially enabling contractions if the load falls below a lower threshold (e.g., 0.5λ0.5 \lambda), which merges buckets and decrements mm or adjusts pp accordingly. These transitions ensure the file state remains balanced without full reorganization. For illustration, consider an initial empty file with m=0m = 0, p=0p = 0, yielding [n=1](/page/N+1)[n = 1](/page/N+1) bucket and capacity for approximately 1.6 records (assuming λ=0.8\lambda = 0.8 and bucket size of 2). After inserting the first two records, the load exceeds the threshold, triggering a split: the single divides, pp advances to 1, nn becomes 2, and records are redistributed using the next-level . Since p=1=20p = 1 = 2^0, the level increments to m=1m = 1, pp resets to 0.

Performance Metrics

Linear hashing achieves average-case O(1) for insert, search, and delete operations, primarily due to the use of bounded overflow chains that limit the number of probes per operation. Splits are local to one , maintaining O(1) worst-case time, with amortized constant cost across operations. In the original formulation, successful searches require approximately 1 to 1.77 disk accesses on average, while unsuccessful searches average 1.62 to 1.63 accesses, assuming capacities of 1 to 50 records. Space efficiency in linear hashing is notably higher than directory-based methods like extendible hashing, which incur up to 100% overhead from directory doublings. Linear hashing with partial expansions exhibits an overflow storage ratio of 1/13 to 1/17 (approximately 6-8% extra ) at 85% utilization, enabling partial file growth without full rehashing. Load factor maintenance in linear hashing is optimized at 70-90%, with controlled splits allowing up to 90% utilization to maximize storage —an improvement of 31-40% over uncontrolled variants. Variance reveals even distribution after splits, with successful search lengths stabilizing at 1.08-1.39 accesses and insertion costs at 4-5 accesses under 85% load, though temporary unevenness arises during growth phases. Scalability supports handling over 10^6 with low rehash costs, as the structure grows linearly without directory overhead, extending to billions of buckets using minimal core (a few bytes for pointers). Benchmarks from VLDB analyses demonstrate linear hashing is 2-5 times faster than for dynamic workloads, owing to reduced clustering and incremental expansions. A key trade-off is higher variance in chain lengths during growth phases compared to perfect hashing schemes, potentially increasing unsuccessful search times to 1.63 accesses at low bucket capacities, though this is mitigated by overflow and partial expansions.

Implementations and Applications

Use in Programming Languages

In the 1980s, linear hashing was adopted in interpreters for efficient management, where dynamic growth was essential for handling variable-sized environments without frequent full rehashes. For instance, the TAO Lisp system implemented a simple linear hashing scheme for its to support fast lookups and insertions in a firmware-based interpreter. A prominent modern adoption appears in Erlang's Erlang Term Storage (ETS) module, where tables of types set, bag, and duplicate_bag employ linear hashing to enable dynamic resizing one bucket at a time. This implementation uses an array of fine-grained locks (initially 16, expanded to 64 in later versions) mapped to buckets via modulo operations, allowing concurrent reads and writes while maintaining amortized constant-time operations for inserts and lookups in set tables. Linear hashing's incremental bucket splitting offers advantages in garbage-collected languages by minimizing pause times during resizing, as it avoids the need for complete table rehashing that could interfere with collection cycles. In Erlang, this aligns well with the BEAM virtual machine's lightweight processes and generational garbage collection, enabling scalable in-memory storage for concurrent applications without blocking mutators during growth. Despite these benefits, linear hashing remains less prevalent in strictly typed languages like Java or C++, where simpler open-addressing schemes with periodic doubling predominate due to their predictability and lower implementation overhead in compile-time environments. This contrasts with its broader use in database systems for persistent storage, as explored elsewhere.

Adoption in Database Systems

Linear hashing has seen adoption in several commercial and open-source database systems, particularly for managing dynamic hash indexes and storage in environments requiring efficient equality-based lookups and incremental growth. One of the earliest and most influential implementations is in Berkeley DB, an embeddable key-value store developed by Oracle, where the hash access method employs an extended variant of linear hashing to support dynamic file expansion without full rehashing. This structure, based on Witold Litwin's original 1980 proposal, enables near-optimal performance for secondary storage by splitting buckets incrementally, making it suitable for persistent data in applications like caching and embedded databases. As of 2025, it remains integral to various embedded applications. Oracle's TimesTen , designed for high-volume , also utilizes extended linear hashing in its hash access method to organize efficiently in memory while supporting disk overflow, providing predictable access times even under varying loads. In open-source systems, has incorporated linear hashing into its hash indexes since version 7.1 (released in 2000), where the index method directly implements Litwin's to store only hash values of keys, optimizing for equality queries on types without a natural ordering. Similarly, MySQL's MEMORY storage engine supports hash indexes, facilitating fast in-memory lookups, while its partitioning features include linear hash partitioning to distribute data across subpartitions using a powers-of-two algorithm for quicker add/drop operations compared to standard hashing. In distributed and big data environments, variants like LH*—a scalable extension of linear hashing for parallel and distributed files—have influenced systems handling large-scale key-value storage, with each bucket potentially residing on separate nodes to enable fault-tolerant growth and shrinkage. For instance, Amazon DynamoDB employs linear hashing internally for its dynamic hash tables in scalable key-value operations, supporting high-insert workloads such as log stores by minimizing reorganization during expansions. LH* is particularly advantageous in deletion-heavy applications within cloud databases, as it maintains balanced distribution across nodes without requiring global rehashing.
Add your contribution
Related Hubs
User Avatar
No comments yet.