Recent from talks
Nothing was collected or created yet.
Hash table
View on WikipediaHash table
View on GrokipediaFundamentals
Definition and Operations
A hash table, also known as a hash map, is an associative array abstract data type that implements mappings from keys to values, storing key-value pairs in an array of buckets or slots where a hash function computes the index for each key to enable efficient access.[4][5] This structure leverages the array as its underlying storage mechanism, where the array serves as a fixed-size collection of slots that can accommodate the mapped entries.[6] Under the assumption of simple uniform hashing, the basic operations of insertion, search, and deletion achieve an average time complexity of O(1).[4] The insert operation adds a new key-value pair to the hash table. It computes the hash of the key to determine the target bucket index and places the pair into that bucket; if the key already exists, the value may be updated depending on the implementation.[6] Pseudocode for a basic insert, assuming no collisions for illustration, is as follows:function insert(key, value):
index = hash(key) // maps key to [array](/page/Array) index
table[index] = (key, value)
function insert(key, value):
index = hash(key) // maps key to [array](/page/Array) index
table[index] = (key, value)
function search(key):
index = hash(key)
if table[index].key == key:
return table[index].value
else:
return not found
function search(key):
index = hash(key)
if table[index].key == key:
return table[index].value
else:
return not found
function delete(key):
index = hash(key)
if table[index].key == key:
remove table[index]
else:
do nothing // key not present
function delete(key):
index = hash(key)
if table[index].key == key:
remove table[index]
else:
do nothing // key not present
Load Factor
The load factor of a hash table, denoted as , is defined as the ratio of the number of entries stored in the table to the number of buckets , expressed as . This metric quantifies the fullness of the table and serves as a key indicator for maintaining performance efficiency. Typical implementations set resizing thresholds around 0.7 to 0.75 to balance space usage and operation speed, as higher values risk excessive collisions.[7] A higher load factor increases the probability of collisions, which degrades search, insertion, and deletion performance by requiring more probes or comparisons.[8] In separate chaining, where collisions are handled by linking entries in buckets, the load factor directly influences the expected chain length, which averages under the simple uniform hashing assumption.[9] Consequently, the expected time for a successful search is probes, as the initial bucket access is followed by traversing an average of elements in the chain.[8] In open addressing schemes, the load factor must remain strictly less than 1 to ensure available slots for insertions, since all entries occupy distinct buckets. For quadratic probing, performance notably degrades when exceeds 0.5, as secondary clustering—where probes tend to revisit similar bucket sequences—intensifies, leading to longer probe sequences and potential insertion failures even before the table fills. To mitigate these effects, hash tables trigger resizing when the load factor surpasses a predefined threshold, typically by doubling the number of buckets to approximately halve the new .[7] This process assumes a uniform distribution of hash values to ensure collisions remain manageable post-resizing.Hash Functions
Properties and Design Principles
A good hash function for use in hash tables must exhibit several key properties to ensure balanced distribution of keys and efficient operations. Primarily, it must be deterministic, meaning that identical input keys always produce the same output hash value, which is essential for consistent retrieval and storage.[4] Additionally, the function should achieve uniform distribution, mapping keys evenly across the range of possible hash values to minimize clustering in buckets and approximate the simple uniform hashing assumption.[4] Efficiency in computation is another critical property, as the hash function must evaluate quickly, often in constant time relative to key size, to avoid bottlenecking insert, search, and delete operations.[4] A desirable diffusion property is the avalanche effect, where even a single-bit change in the input causes approximately half the output bits to change, enhancing resistance to patterns in keys and improving overall mixing.[10] To provide theoretical guarantees on uniformity, especially when keys arrive in adversarial or non-random order, universal hashing employs a family of functions rather than a single one. A family of hash functions from a key universe to (where is the number of buckets) is universal if, for any distinct keys , the probability over a random choice of .[11] This bound ensures that the expected number of collisions for any set of keys is close to that under ideal random hashing, with the probability of exceeding times the expected collisions being less than for .[11] In a simple probabilistic model, selecting randomly from such a family simulates independent uniform mapping of each key to buckets, yielding average linear-time performance for chained hash tables even with up to insertions, where is the load factor.[11] An explicit construction of a universal family, introduced by Carter and Wegman, uses affine transformations over a finite field. Let be a prime larger than the key universe size, and the table size; the family consists of functions , where and , chosen uniformly at random.[11]function universal_hash(k, a, b, p, m):
return ((a * k + b) % p) % m
function universal_hash(k, a, b, p, m):
return ((a * k + b) % p) % m
Common Techniques
Common techniques for hash functions focus on transforming keys into table indices efficiently while promoting uniformity. For integer keys, several methods have been developed to achieve this.Integer Hashing
The division method is a straightforward approach for integer keys, defined as , where is the key and is the table size, producing a remainder in the range [0, m-1].[14] This method assumes keys are drawn from a universe of integers, typically from 0 to U-1 where U >> m, and derives uniformity from the property that if m is chosen coprime to the key distribution's characteristics (e.g., avoiding powers of 2 for decimal keys), the remainders approximate a uniform distribution over the slots.[14] For example, with m=10 and k=27, ; for k=123, .[14] The multiplication method improves on this by leveraging fractional parts for better distribution, given by , where A is a constant between 0 and 1, and {x} denotes the fractional part of x (i.e., k A mod 1).[14] Knuth recommends A ≈ (√5 - 1)/2 ≈ 0.6180339887, the reciprocal of the golden ratio, as it yields low-discrepancy sequences that enhance uniformity regardless of key distribution patterns.[14] The derivation of uniformity stems from the irrationality of A, ensuring that the sequence {k A} is dense and equidistributed in [0,1) for sequential integer k, thus mapping evenly to [0, m-1] after scaling and flooring.[14] For m=10, A=0.618, and k=27, first compute 27 * 0.618 ≈ 16.686, fractional part 0.686, then ; for k=123, 123 * 0.618 ≈ 76.018, fractional 0.018, .[14] As a historical example, the mid-square method squares the key and extracts the middle digits to form the hash value, suitable for fixed-length integer representations.[15] For a key with d digits, square it to get up to 2d digits and take the middle d digits as h(k), then mod m if needed.[15] This was an early technique but often produces poor uniformity due to clustering in squared values.[14] For m=10, k=27 (squared=729, middle digit assuming 3-digit pad 729 → 2), h(27)=2; for k=123 (squared=15129, middle 2 digits 51, 51 mod 10=1).[15] These integer methods typically assume keys belong to a universe of non-negative integers from 0 to U-1, with U much larger than m to ensure probabilistic uniformity.[14]String Hashing
For string keys, the polynomial rolling hash treats the string as a base-b number modulo a prime p, computed as , where s_i are character codes (e.g., ASCII), n is length, b is the base, and p is a large prime.[16] The base b is chosen as a small prime like 31 or 131 to balance computation and avalanche effects, while p (e.g., a 64-bit prime) is selected large to minimize collision probability, approximating 1/p for random strings under the birthday paradox.[16] This formulation, used in the Rabin-Karp algorithm, enables efficient sliding-window updates for substring hashing by multiplying the prior hash by b, subtracting the dropped term, and adding the new one, all mod p.[16] For example, with string "abc" (a=97, b=98, c=99), b=31, p=101, n=3: h("abc") = (9731^2 + 9831 + 99) mod 101 = (97961 + 9831 + 99) mod 101 = (93217 + 3038 + 99) mod 101 = 96354 mod 101 = 0.[16]Hybrid Approaches
For complex keys like memory addresses, hybrid methods combine techniques such as XOR folding, which divides the key into fixed-width parts and XORs them to produce a compact value before applying another hash like division.[17] This is useful for network addresses, folding octets via successive XOR to mix bits evenly without overflow issues.[17] For a 32-bit address 0x1A2B3C4D with m=10, fold into 16-bit halves (0x1A2B ^ 0x3C4D = 0x2666), then 0x26 ^ 0x66 = 0x40, and h=0x40 mod 10=4; for 0x11223344, (0x1122 ^ 0x3344=0x2266), 0x22 ^ 0x66=0x44, 0x44 mod 10=8.[17]Collision Resolution
Separate Chaining
Separate chaining resolves collisions in hash tables by associating each bucket with a linked list (or chain) that stores all keys hashing to that index. When inserting a key that collides with existing entries, it is appended to the corresponding chain, allowing multiple elements per bucket without requiring alternative slots. This approach maintains the hash table's array structure while deferring collision handling to the auxiliary lists, which grow dynamically as needed. The primary operations—insertion, search, and deletion—in separate chaining rely on computing the hash index and then traversing the chain linearly. For insertion, the key is typically added at the head of the list for constant-time access, though appending to the end preserves order if desired (requiring a tail pointer for efficiency). Search and deletion involve scanning the chain from the head until the key is found or the end is reached; deletion requires updating the next pointers to remove the node while preserving the list. Pseudocode for these operations, assuming a hash table as an array of linked lists and a hash function , is as follows: Insertion:CHAINED-INSERT(T, k)
x = create new node with key k and value (if applicable)
i = h(k) mod m // m is number of buckets
x.next = T[i].head
T[i].head = x
CHAINED-INSERT(T, k)
x = create new node with key k and value (if applicable)
i = h(k) mod m // m is number of buckets
x.next = T[i].head
T[i].head = x
CHAINED-SEARCH(T, k)
i = h(k) mod m
x = T[i].head
while x ≠ null and x.key ≠ k
x = x.next
return x // null if not found
CHAINED-SEARCH(T, k)
i = h(k) mod m
x = T[i].head
while x ≠ null and x.key ≠ k
x = x.next
return x // null if not found
CHAINED-DELETE(T, k)
i = h(k) mod m
x = T[i].head
if x.key = k
T[i].head = x.next
free x
return
y = x
x = x.next
while x ≠ null and x.key ≠ k
y = x
x = x.next
if x ≠ null
y.next = x.next
free x
CHAINED-DELETE(T, k)
i = h(k) mod m
x = T[i].head
if x.key = k
T[i].head = x.next
free x
return
y = x
x = x.next
while x ≠ null and x.key ≠ k
y = x
x = x.next
if x ≠ null
y.next = x.next
free x
Open Addressing
Open addressing, also known as closed hashing, resolves collisions in hash tables by storing all elements directly within a fixed-size array, without using auxiliary data structures like linked lists. Upon inserting a key , the initial position is computed as , where is the array size and is the hash function; if occupied, a probe sequence for is followed until an empty slot is found or the key is matched during search or deletion. This approach assumes simple uniform hashing and keeps the table load factor , where is the number of elements, to ensure termination with high probability.[20] Deletion in open addressing cannot simply empty a slot, as this would disrupt probe sequences for other elements that might traverse it; instead, a special marker called a tombstone is placed, indicating the slot is available for future insertions but treated as occupied during searches to preserve chain integrity. Tombstones accumulate over time, effectively increasing the load factor and necessitating periodic rehashing to maintain performance.[20] Linear probing, the earliest and simplest variant, defines the probe sequence as . Introduced by Peterson in 1957 for random-access storage systems, it promotes sequential access but suffers from primary clustering: insertions near an occupied run of slots tend to extend the cluster, leading to longer average probe lengths. For instance, if keys hashing to indices 10, 11, and 12 are inserted into a table of size 20, a new key hashing to 10 will probe sequentially through 10–19 before wrapping, exacerbating delays for nearby hashes. Knuth analyzed this in detail, showing average unsuccessful search probes approximate . Quadratic probing mitigates primary clustering by using a quadratic offset: , where constants and are chosen to ensure good distribution, such as and (or equivalently, using in integer arithmetic) for balanced steps when is prime. This produces non-linear jumps, reducing the tendency for probes to fill contiguous blocks, though it introduces secondary clustering—keys sharing the same follow identical sequences. If is prime, the first probes are distinct, guaranteeing coverage of half the table before repetition. Knuth discusses its analysis, noting it performs comparably to linear probing at low loads but requires careful parameter selection to avoid permutation issues. Double hashing further improves distribution by employing two independent hash functions: , where provides a variable step size, commonly set as to ensure it is positive and less than . This method approximates uniform probing, minimizing both primary and secondary clustering by making probe sequences key-dependent and pseudo-random. Introduced in Knuth's work and analyzed in subsequent papers, double hashing achieves near-ideal performance up to load factors around 0.8, with average unsuccessful search probes close to .[21] Open addressing variants excel in cache efficiency due to localized, predictable memory accesses, outperforming chaining in modern hardware for small-to-medium tables. However, they demand load factors below 0.7–0.8 to keep probe lengths reasonable—beyond this, clustering causes exponential degradation, with unsuccessful searches averaging up to probes. Deletion via tombstones adds overhead, as they occupy space without utility, often requiring compaction or resizing to control effective load. In contrast to separate chaining, open addressing avoids pointer overhead but imposes stricter capacity limits.[20]Resizing Strategies
Standard Resizing
In standard hash table resizing, the process is triggered when the load factor α, defined as the ratio of the number of elements n to the table size m (α = n/m), exceeds a maximum threshold α_max, typically set between 0.5 and 1 to maintain performance.[22][23] Upon exceeding this threshold, the table size m is doubled to 2m, and all existing n elements are reinserted into the new table using the hash function adjusted for the updated size, often by recomputing h(k) mod 2m where h is the same underlying hash function.[22][23] This rehashing step ensures the load factor returns to approximately α_max / 2, preserving the uniform distribution of keys under the assumption of a good hash function.[23] The resizing operation incurs a temporary cost of Θ(n + m) time and additional space for the new table, as every element must be rehashed and reinserted, which can lead to pauses in performance during the rebuild.[23] However, by doubling the size each time, the frequency of resizes decreases geometrically—each subsequent resize handles roughly twice as many elements before the next trigger—resulting in an amortized O(1) cost per insertion or deletion operation when analyzed using the potential method.[22][23] For a sequence of n insertions starting from an empty table, the total rehashing cost sums to O(n), as the series of resize costs (n + n/2 + n/4 + ...) converges to less than 2n operations.[22] To prevent excessive space usage when the table becomes underutilized, shrinking is performed by halving the table size m to m/2 when the load factor falls below a minimum threshold, such as α_max / 4 (e.g., 0.25 if α_max = 1), followed by rehashing all elements into the smaller table.[22] This threshold avoids oscillation between frequent grows and shrinks, maintaining efficiency while reclaiming memory.[22] The standard resizing approach applies to both separate chaining and open addressing implementations, as the rehashing process rebuilds the underlying structure regardless of collision resolution method.[23] The following pseudocode illustrates a resize-integrated insert operation for a separate chaining hash table, where resizing is checked after each insertion (adaptable to open addressing by modifying the insert step):function INSERT(key, value):
if n / m > α_max:
RESIZE(2 * m)
index = HASH(key) mod m
// For separate chaining: prepend to chain at T[index]
new_node = new Node(key, value)
new_node.next = T[index]
T[index] = new_node
n = n + 1
function RESIZE(new_m):
old_T = T
m = new_m
T = new array of size m // Initialize empty chains
for i = 0 to old_m - 1:
current = old_T[i]
while current != null:
index = HASH(current.key) mod m
temp = current.next
current.next = T[index]
T[index] = current
current = temp
// For shrinking, call RESIZE(m / 2) in DELETE if n / m < α_min
delete old_T
function INSERT(key, value):
if n / m > α_max:
RESIZE(2 * m)
index = HASH(key) mod m
// For separate chaining: prepend to chain at T[index]
new_node = new Node(key, value)
new_node.next = T[index]
T[index] = new_node
n = n + 1
function RESIZE(new_m):
old_T = T
m = new_m
T = new array of size m // Initialize empty chains
for i = 0 to old_m - 1:
current = old_T[i]
while current != null:
index = HASH(current.key) mod m
temp = current.next
current.next = T[index]
T[index] = current
current = temp
// For shrinking, call RESIZE(m / 2) in DELETE if n / m < α_min
delete old_T
Incremental Approaches
Incremental approaches to hash table resizing aim to mitigate the performance pauses associated with standard doubling or halving operations by distributing the rehashing workload over multiple insertions or operations, enabling gradual adaptation to changing data volumes.[25] These methods are particularly valuable in environments requiring low-latency responses, such as databases and distributed systems, where full rehashing could introduce unacceptable delays.[26]Linear Hashing
Linear hashing, introduced by Witold Litwin in 1980, facilitates dynamic growth by splitting buckets incrementally using a split pointer, avoiding the need for a complete table reorganization.[25] The algorithm employs a family of hash functions defined as , where is the initial number of buckets, is the current level (starting at 0), and is a base hash function producing values much larger than the table size.[26] A split pointer, denoted as Next, indicates the next bucket to split and advances linearly through the table. During insertion, the key is hashed using to locate the primary bucket at the current level . If the bucket has space, the key is inserted directly; otherwise, an overflow page is chained to it temporarily.[25] To resolve the overflow, the bucket at the Next position is split: its contents are rehashed using , distributing records into the original bucket and a newly allocated one, after which Next increments by 1.[26] This process continues per insertion as needed, with one split per overflow event, maintaining a near-constant load factor (up to 90% for file-oriented implementations).[25] When Next reaches the end of the current range (), the level increments, Next resets to 0, and a new round of splits begins, effectively doubling the address space over time without batch reprocessing.[26] Overflow handling relies on chaining pages, but the round-robin splitting prevents long chains by proactively redistributing load.[25] Example: Step-by-Step Insertion with Split in Linear HashingConsider an initial table with buckets (indices 0 and 1), level , Next = 0, and for simplicity (actual implementations use a strong hash). Assume each bucket holds up to 2 records.[26]
- Insert 10: , bucket 0 empty → insert into bucket 0 (now: [27]).
- Insert 12: , bucket 0 full → add overflow page to bucket 0 (now: [10, 12] via overflow). Trigger split at Next=0.
- Split bucket 0 using : Rehash contents—10 mod 4 = 2 (stays in 0, but remapped), 12 mod 4 = 0 (to new bucket 2? Wait, linear maps 0 to 0 and 2). Actually, split creates buckets for higher modulus; records with bit i=0 stay in old, i=1 to new. Assume binary view: split distributes based on next bit. Post-split: bucket 0: [28], new bucket 2: [27], Next=1.
- Subsequent insertions proceed similarly, with searches checking both primary and potential split images if applicable. This example illustrates one split per overflow, spreading cost.[26]
Extendible Hashing
Extendible hashing, proposed by Ronald Fagin et al. in 1979, uses a directory of pointers to bucket pages, enabling dynamic resizing through selective directory expansion without rehashing the entire dataset. The directory is an array of size , where is the global depth representing the number of high-order bits from the hash value used to index it. Each bucket has a local depth , indicating how many bits uniquely identify records within it.[26] For insertion, the hash provides the binary address; the first bits index the directory to find the bucket pointer. If space exists, insert; if full, split the bucket by incrementing its local depth to , creating two new buckets for the differing bit, and redistributing records based on that bit. If , double the directory (global depth increments), duplicating pointers to maintain mappings. Multiple directory entries may point to the same bucket if local depths are lower, allowing coalescing during contraction.[26] This approach guarantees at most two disk accesses for lookups and confines rehashing to the split bucket's records, making it suitable for disk-based systems.Consistent Hashing
Consistent hashing, developed by David Karger et al. in 1997, addresses incremental resizing in distributed hash tables by minimizing key migrations when nodes are added or removed.[29] Keys and nodes are mapped via independent hash functions to points on a fixed circle (e.g., [0, 1) or modulo ), with each key assigned to the nearest succeeding node in clockwise order.[29] To improve load balance, each physical node is represented by multiple virtual nodes (typically copies, where is the number of nodes), hashed separately to create evenly spaced points.[29] Upon adding or removing a node, only the keys in the affected arc (between the new/old virtual nodes and predecessors) are reassigned, impacting an expected fraction of keys, where is the number of virtual nodes per physical node.[29] This ensures smooth load distribution, with maximum load deviating by at most from ideal, and spread (number of nodes holding copies of a key across views) bounded by .[29] Deletions follow symmetrically, reassigning to successors. These incremental methods offer lower latency during resizes for large-scale tables compared to batch rehashing, as costs are amortized over operations, but they introduce greater implementation complexity due to pointers, levels, and virtual mappings.[25][29] They are especially suitable for database indexing and distributed caching, where predictable pauses are critical.[26]Performance Analysis
Time Complexity
The time complexity of hash table operations is analyzed under the simple uniform hashing assumption, where each key is equally likely to hash to any slot independently of other keys. For both separate chaining and open addressing, the average-case time complexity for insertion, search, and deletion is when the load factor (with keys and table size ) is held constant. On modern CPUs, these operations achieve practical latencies dominated by cache access times, with L1 cache hits around 1 ns and higher levels up to 4 ns, enabling lookups as low as 2 ns and iterations or probes in 2–3 ns under ideal cache-resident conditions and minimal collisions.[30] In separate chaining, the expected length of each chain is , leading to an expected time for unsuccessful searches and for successful searches; with constant , both simplify to . For open addressing with linear probing, the expected number of probes for an unsuccessful search is , while for a successful search it is ; again, constant yields average time.[20] Amortized analysis accounts for resizing, which doubles the table size when exceeds a threshold (e.g., 0.5–1.0) to maintain performance. Over a sequence of insertions starting from an empty table, the total time is because each key is moved a constant expected number of times during resizes, yielding amortized time per operation via the aggregate method.[20] In the worst case, all keys may hash to the same slot, resulting in time for operations due to linear scanning of the chain or probe sequence, exacerbated by poor hash functions or adversarial inputs that defeat uniformity. In 2025, research showed that static open-addressing hash tables can achieve sublinear expected search times without element reordering during construction, improving upon classical worst-case bounds under certain conditions.[20][31][31] A key space-time trade-off arises from the load factor: lower (achieved by increasing ) reduces the expected number of probes and thus average time, but at the cost of higher space usage, as the table holds more empty slots.[20] Certain variants improve worst-case guarantees. Replacing linked lists in separate chaining with balanced binary search trees yields worst-case time for operations, though at higher constant factors. Robin Hood hashing, an open-addressing technique, balances probe lengths by displacing keys with shorter probes during insertion, reducing maximum probe sequences and improving practical worst-case performance without asymptotic changes.[20][32]Space Efficiency and Trade-offs
Hash tables achieve linear space complexity in the number of elements , denoted as , but incur overhead that elevates the constant factor beyond 1. This overhead arises from the allocation of buckets, where , plus storage for keys and values; typically, each bucket requires space for a pointer or fixed-size entry, leading to times the size of a pointer or entry for the table structure alone.[33] In separate chaining, collisions are resolved by linking elements in lists, adding one pointer per element beyond the initial bucket pointers, which can increase space usage by approximately 50% or more depending on the load factor, as each node in the chain requires additional memory for links. Open addressing, by contrast, avoids these per-element pointers by probing within the array but wastes slots due to clustering, necessitating a larger to maintain performance and thus higher overall space for the same . This pointer overhead in chaining precludes full space efficiency in stable hash tables, while open addressing trades density for simplicity but still leaves allocated space underutilized at high loads.[34][33] The load factor critically influences this space usage, with optimal values of 0.5 to 0.7 recommended for open addressing to balance probe lengths and avoid excessive clustering, resulting in a space multiplier of (e.g., about 1.4 to 2 times the minimum slots needed). For separate chaining, higher loads up to 0.7 or 0.8 are feasible with minimal space penalty beyond pointers, as chains grow linearly without wasted slots, though performance degrades if exceeds 1 significantly. These ranges ensure space efficiency without disproportionate time costs from collisions.[35] Compression techniques further mitigate overhead in space-constrained settings. Using power-of-two table sizes enables fast modulo operations via bit masking (e.g., ), avoiding costly division while maintaining even distribution and allowing efficient resizing by factors of 2. Bit-packing of keys and fingerprints in compact hashing schemes reduces per-entry size by storing multiple elements per word or using succinct representations, achieving near-optimal space (close to times key size) while supporting constant-time access in expectation.[33] In distributed systems, consistent hashing introduces space trade-offs for availability and load balancing; virtual nodes or explicit replication assign multiple copies of data across nodes to handle failures or hotspots, increasing total storage by a factor equal to the replication degree (often 3 or more) at the cost of reduced per-node efficiency. This ensures minimal remapping on node changes but elevates overall space usage compared to non-replicated single-node hash tables. Compared to balanced binary search trees, which also require space due to per-node pointers and keys, hash tables exhibit a higher constant factor from load factor underutilization and collision structures, typically 1.5 to 2 times the raw element storage versus about 3 times for trees (two pointers per node plus key/value). However, hash tables provide average O(1) time without the O(log n) scaling of tree-based structures, making them preferable when average-case performance and density are prioritized over worst-case guarantees.Applications
Core Data Structures
Hash tables provide an efficient implementation for associative arrays, also known as maps or dictionaries, which store pairs of unique keys and associated values. A hash function maps each key to an array index, allowing average constant-time access, insertion, and deletion operations by storing or retrieving values directly from the computed slot, subject to collision resolution.[36][37] Hash sets, a keys-only variant of maps, leverage hash tables by storing elements as keys with implicit or dummy values, enforcing uniqueness since identical keys result in overwrites rather than duplicates. This structure supports efficient membership testing in average O(1) time, while operations like union and intersection can be computed by hashing elements from one set into a temporary table and checking overlaps with another.[37][38] For ordered variants, hash tables can be combined with balanced search trees, such as treaps, either by using trees within overflow buckets or augmenting the overall structure to maintain sorted order alongside hashed indexing, enabling both fast lookups and ordered traversal.[39][40] Unordered collections like Python's built-indict and set rely on hash tables for their core implementation, where keys are hashed to enable rapid access; set does not preserve insertion order, while dict has preserved insertion order since Python 3.7.[41][38]
A key limitation of hash table-based structures is the absence of inherent ordering, requiring separate mechanisms for sorted access, and the prohibition of duplicates, as the mapping enforces unique keys by design.[36][37]
Specialized Uses
Hash tables find specialized application in database indexing, where they support rapid equality-based queries. Hash indexes map keys to data locations using a hash function, enabling average O(1) lookup times for exact matches, which is particularly advantageous in scenarios like point queries in relational databases.[42] In contrast to ordered structures like B-trees, which facilitate range scans and inequality operations through their sorted leaf nodes, hash indexes do not preserve order and thus perform poorly on range queries, making B-trees the standard for versatile indexing needs.[42] Extendible hashing addresses the limitations of static hashing by dynamically adjusting bucket sizes via a directory that points to data pages based on hash prefixes, allowing the structure to grow or shrink without full reorganization as the database expands.[43] This technique, introduced by Fagin, Nievergelt, Pippenger, and Plotkin in 1979, is employed in filesystems and databases to maintain performance under varying loads, with directory splits occurring only when overflows demand deeper prefix resolution.[43] In caching mechanisms, hash tables provide the foundation for constant-time key access, often combined with other structures for eviction policies. Least Recently Used (LRU) caches integrate a hash table for O(1) lookups with a doubly-linked list to track usage order, where accessed items are moved to the front and the tail item is evicted upon capacity limits, achieving amortized O(1) operations overall. Bloom filters extend hash-based sets probabilistically, using multiple independent hash functions to set bits in a bit array for membership representation; they offer space savings over traditional hash sets by tolerating a tunable false-positive rate (typically 1-10%) while guaranteeing no false negatives, making them ideal for applications like duplicate detection in large streams where exact storage is impractical. Transposition tables in game artificial intelligence, such as chess engines, leverage hash tables to cache search results and avoid redundant computations for equivalent board states. These tables store evaluation scores, depths, and best moves keyed by position hashes, with collisions resolved by depth or exact-match checks to prune search trees. Zobrist hashing, a seminal method for generating these keys, assigns random 64-bit values to each piece-type-square combination and XORs them across the board, producing a near-unique signature with low collision probability (about 2^{-64}) that updates incrementally with moves.[44] Developed by Albert Zobrist in 1970, this approach significantly accelerates alpha-beta search by detecting transpositions—positions reached via different move sequences.[44] Beyond these, hash tables underpin symbol tables in compilers, mapping identifiers to attributes like types, scopes, and memory offsets for efficient semantic analysis and code generation across multiple compilation phases.[45] In storage systems, they facilitate data deduplication by computing cryptographic hashes (e.g., SHA-256) of file chunks as unique fingerprints, storing only one copy per unique hash in a table to eliminate redundancies and achieve up to 95% space reduction in backup scenarios.[46][47] Cryptographically, rainbow tables precompute chains of hash reductions to invert one-way functions like password hashes via time-memory trade-offs, reducing brute-force costs from O(2^n) to O(2^{n/2}) for n-bit keys; however, their effectiveness is severely limited by salting and peppering in modern systems, rendering them insecure for production use and primarily educational for demonstrating collision vulnerabilities. In contemporary big data processing, Apache Spark employs hash joins to merge datasets efficiently, building in-memory hash tables on the smaller relation's partitions for probing by the larger one, with broadcast variants distributing compact tables cluster-wide to minimize shuffling.[48] Distributed hash tables (DHTs), exemplified by Chord, scale hash table semantics across peer-to-peer networks by mapping keys to a circular identifier space via consistent hashing, where nodes maintain finger tables for O(log N) lookups and automatic load balancing under churn.[49] Introduced by Stoica et al. in 2001, Chord underpins decentralized storage in systems like file-sharing overlays, ensuring resilience with O(log N) stabilization after node joins or failures.[49]Implementations
Pseudocode Examples
Hash table implementations typically involve an array of fixed size , a hash function mapping keys to indices 0 to , and maintenance of load factor (where is the number of elements) below a threshold (e.g., 0.75 for chaining, 0.5 for open addressing) to ensure efficiency. Core operations include:- Insert(key, value): Compute index = ; if key exists, update value; else add new entry, increment ; if exceeds threshold, resize.
- Search(key): Compute index = ; probe the bucket/slot until key found or end of chain/probe sequence.
- Delete(key): Compute index = ; locate and remove entry, decrement ; for open addressing, mark as deleted (tombstone) to preserve probe chains.
Chaining Implementation
A basic hash table using chaining resolves collisions by storing multiple elements in linked lists at each bucket. The structure consists of an array of lists, initialized with a number of buckets , typically a power of 2 for efficient resizing. The hash function is denoted as , which maps a key to an index between 0 and . Operations maintain a load factor , where is the number of elements, and resize when exceeds a threshold like 0.75 to ensure amortized constant-time performance. The following pseudocode outlines the core class and operations for chaining, including insertion that checks for resizing.class HashTable
list[] table // array of m lists
integer m // number of buckets
integer n // number of elements
real load_threshold = 0.75
function HashTable(integer initial_m)
m = initial_m
n = 0
table = new list[m] // each entry is an empty list
for i = 0 to m-1
table[i] = empty list
end for
end function
function integer hash(key)
return h(key) mod m // stub for hash function
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
index = hash(key)
// Check if key exists and update value
for each entry in table[index]
if entry.key == key
entry.value = value
return
end if
end for
// Insert new entry at head of list
table[index].insert_front(new entry(key, value))
n = n + 1
end function
function search(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
return entry.value
end if
end for
return null // not found
end function
function delete(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
remove entry from table[index]
n = n - 1
return
end if
end for
// Key not found
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new list[m]
for i = 0 to m-1
table[i] = empty list
end for
for i = 0 to old_m - 1
for each entry in old_table[i]
insert(entry.key, entry.value) // reinsert using new hash
end for
end for
end function
class HashTable
list[] table // array of m lists
integer m // number of buckets
integer n // number of elements
real load_threshold = 0.75
function HashTable(integer initial_m)
m = initial_m
n = 0
table = new list[m] // each entry is an empty list
for i = 0 to m-1
table[i] = empty list
end for
end function
function integer hash(key)
return h(key) mod m // stub for hash function
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
index = hash(key)
// Check if key exists and update value
for each entry in table[index]
if entry.key == key
entry.value = value
return
end if
end for
// Insert new entry at head of list
table[index].insert_front(new entry(key, value))
n = n + 1
end function
function search(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
return entry.value
end if
end for
return null // not found
end function
function delete(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
remove entry from table[index]
n = n - 1
return
end if
end for
// Key not found
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new list[m]
for i = 0 to m-1
table[i] = empty list
end for
for i = 0 to old_m - 1
for each entry in old_table[i]
insert(entry.key, entry.value) // reinsert using new hash
end for
end for
end function
Open Addressing Implementation
Open addressing stores elements directly in the array without lists, using probing sequences to resolve collisions. Linear probing, a common method, computes the next position as , where is the probe number. Deletions use tombstones (marked as "deleted") to avoid breaking search chains, allowing probes to continue past them during searches but treating them as available slots for insertions. The table resizes similarly when load exceeds a threshold, typically lower (e.g., 0.5) to mitigate clustering.[50] The pseudocode below shows operations for linear probing open addressing.class OpenHashTable
entry[] table // array of m entries, each may be null, key-value pair, or deleted
[integer](/page/Integer) m // number of buckets
[integer](/page/Integer) n // number of elements
real load_threshold = 0.5
function OpenHashTable([integer](/page/Integer) initial_m)
m = initial_m
n = 0
table = new entry[m] // all null initially
end function
function [integer](/page/Integer) hash(key, [integer](/page/Integer) i)
return (h(key) + i) mod m // linear probing stub
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
i = 0
repeat
j = hash(key, i)
if table[j] == null or table[j] == deleted
table[j] = new entry(key, value)
n = n + 1
return
end if
if table[j].key == key
table[j].value = value
return
end if
i = i + 1
until i == m
error "table overflow"
end function
function search(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return null // not found, end of chain
end if
if table[j] != deleted and table[j].key == key
return table[j].value
end if
i = i + 1
until i == m
return null
end function
function delete(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return // not found
end if
if table[j].key == key
table[j] = deleted // tombstone
n = n - 1
return
end if
i = i + 1
until i == m
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new entry[m] // all null
for each old_entry in old_table
if old_entry != null and old_entry != deleted
insert(old_entry.key, old_entry.value) // reinsert
end if
end for
end function
class OpenHashTable
entry[] table // array of m entries, each may be null, key-value pair, or deleted
[integer](/page/Integer) m // number of buckets
[integer](/page/Integer) n // number of elements
real load_threshold = 0.5
function OpenHashTable([integer](/page/Integer) initial_m)
m = initial_m
n = 0
table = new entry[m] // all null initially
end function
function [integer](/page/Integer) hash(key, [integer](/page/Integer) i)
return (h(key) + i) mod m // linear probing stub
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
i = 0
repeat
j = hash(key, i)
if table[j] == null or table[j] == deleted
table[j] = new entry(key, value)
n = n + 1
return
end if
if table[j].key == key
table[j].value = value
return
end if
i = i + 1
until i == m
error "table overflow"
end function
function search(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return null // not found, end of chain
end if
if table[j] != deleted and table[j].key == key
return table[j].value
end if
i = i + 1
until i == m
return null
end function
function delete(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return // not found
end if
if table[j].key == key
table[j] = deleted // tombstone
n = n - 1
return
end if
i = i + 1
until i == m
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new entry[m] // all null
for each old_entry in old_table
if old_entry != null and old_entry != deleted
insert(old_entry.key, old_entry.value) // reinsert
end if
end for
end function
Language-Specific Considerations
In Java, theHashMap class implements a hash table using separate chaining for collision resolution, where each bucket contains a linked list of entries. The default load factor is set to 0.75, providing a balance between time and space efficiency by triggering resizing when the table reaches 75% capacity. Additionally, to mitigate performance degradation from long chains, HashMap converts linked lists to balanced binary search trees (red-black trees) when a chain exceeds a threshold of 8 elements, a feature introduced in Java 8 for improved worst-case lookup times. For concurrent environments, ConcurrentHashMap extends this design with fine-grained locking and segmenting, supporting full concurrency for reads and adjustable concurrency for updates without blocking retrievals.
Python's built-in dict type (in the CPython implementation) employs open addressing with a perturbed quadratic probing strategy for collision resolution, implemented in C for efficiency. The probing sequence uses the recurrence index = ((5 * index) + 1 + perturb) % size, where perturb is derived from the key's hash and right-shifted by 5 bits each iteration to add randomness and reduce clustering.[51] Hash randomization, enabled by default since Python 3.3, perturbs the hash function using a random seed at process startup to prevent hash-flooding denial-of-service attacks by ensuring consistent distribution across runs. The set type is essentially an optimized dict variant, using the same underlying hash table structure but storing only keys with dummy values, which allows for efficient membership testing and set operations.
In C++, the standard std::unordered_map utilizes separate chaining, maintaining linked lists (or equivalent forward iterators in some implementations) within buckets determined by a user-supplied or default hash function. It supports customizable hash policies via the Hash template parameter and equality predicates via KeyEqual, enabling adaptations for user-defined types. The reserve method preallocates buckets to accommodate a specified number of elements without exceeding the maximum load factor, avoiding intermediate rehashes during insertion and improving performance for known sizes.
Implementing hash tables in these languages often involves challenges such as providing custom hashers and equality predicates to ensure proper distribution and collision handling for non-standard key types, particularly in C++ where the standard requires hash functions to be consistent with equality. Move semantics, introduced in C++11, further complicate implementations by necessitating efficient transfer of ownership in containers like std::unordered_map, avoiding unnecessary copies during resizing or insertion.
Specialized libraries address these issues with alternative strategies; for instance, Google's dense_hash_map from the Sparsehash library uses open addressing with quadratic probing, optimizing for dense storage and faster lookups at the cost of higher space usage under low loads. Similarly, Abseil's SwissTable implementation in C++ employs open addressing with SIMD-accelerated metadata lookups and robin-hood hashing for probe redistribution, offering superior performance over standard chaining in high-throughput scenarios.
Best practices for hash table usage include seeding hash functions with random values to defend against collision-based attacks like hash flooding, where adversaries craft inputs to degrade performance to O(n). Monitoring and maintaining load factors below 0.75 ensures efficient operation, with proactive resizing or reservation to prevent excessive collisions and rehashing overhead.References
- Jan 15, 2025 · Cryptographic hashes have a security requirement, which is usually phrased in terms of resistance to collisions and pre-images. Indeed, for a ...
- Sep 18, 2012 · Target applications include network traffic authentication and hash-table lookups protected against hash-flooding denial-of-service attacks.