Hash table

Hash tableMain

Community hub

Hash table

8 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

Hash table

View on Wikipedia

from Wikipedia

Not found

Revisions and contributors Edit on Wikipedia Read on Wikipedia

View on Grokipedia

from Grokipedia

A hash table, also known as a hash map or dictionary, is a data structure that implements an associative array abstract data type, storing key-value pairs to enable efficient average-case O(1) time complexity for operations such as insertion, deletion, and lookup by key.^[1] It achieves this performance by using a hash function to compute an index into an array of buckets or slots, from which the desired value can be retrieved or updated directly, making it a fundamental tool in computer science for applications like databases, caches, and symbol tables.^[2] The core mechanism of a hash table relies on the hash function, which maps keys—typically strings, integers, or other comparable types—to numeric indices within a fixed-size array, ideally distributing entries uniformly to minimize clustering.^[1] Common hash functions include division-based methods (e.g., key modulo a prime table size), multiplicative hashing (e.g., using the golden ratio for better uniformity), and bit-shift techniques, with the choice depending on the key type and desired load factor (the ratio of stored elements to table size, often kept below 0.75 to maintain efficiency).^[2] Collisions, which occur when multiple keys hash to the same index, are resolved through techniques such as chaining (linking collided entries in lists at each slot, yielding expected O(1 + α) search time where α is the load factor) or open addressing (probing for the next available slot via linear, quadratic, or double hashing methods, with expected O(1/(1 - α)) time under uniform hashing assumptions).^[1] Hash tables offer significant advantages over alternatives like binary search trees or sorted arrays, providing amortized constant-time operations for dynamic datasets without the O(log n) scaling of tree-based structures, though worst-case performance can degrade to O(n) under poor hashing or adversarial inputs.^[1] They are widely used in programming languages (e.g., Python's dict, Java's HashMap) and systems requiring fast lookups, such as compilers and networking protocols.^[2] The concept of hash tables traces back to 1953, when IBM researcher Hans Peter Luhn proposed using a hash function for direct key-to-address transformation in data retrieval systems, evolving from earlier punch-card tabulation methods for efficient categorization.^[3] Donald Knuth later formalized and popularized the structure in his seminal work The Art of Computer Programming (1973), establishing hashing as a cornerstone of algorithm design, with subsequent advancements like Carter and Wegman's universal hashing (1979) improving provable performance guarantees.^[2]

Fundamentals

Definition and Operations

A hash table, also known as a hash map, is an associative array abstract data type that implements mappings from keys to values, storing key-value pairs in an array of buckets or slots where a hash function computes the index for each key to enable efficient access.^[4]^[5] This structure leverages the array as its underlying storage mechanism, where the array serves as a fixed-size collection of slots that can accommodate the mapped entries.^[6] Under the assumption of simple uniform hashing, the basic operations of insertion, search, and deletion achieve an average time complexity of O(1).^[4] The insert operation adds a new key-value pair to the hash table. It computes the hash of the key to determine the target bucket index and places the pair into that bucket; if the key already exists, the value may be updated depending on the implementation.^[6] Pseudocode for a basic insert, assuming no collisions for illustration, is as follows:

function insert(key, value): index = hash(key) // maps key to [array](/page/Array) index table[index] = (key, value)

The search (or get) operation retrieves the value associated with a given key. It applies the hash function to the key to find the bucket and checks for the key within that bucket to return the corresponding value if found.^[4] Pseudocode for search:

function search(key): index = hash(key) if table[index].key == key: return table[index].value else: return not found

Deletion removes a key-value pair from the hash table. It hashes the key to locate the bucket and removes the pair matching the key from that bucket.^[5] Pseudocode for delete:

function delete(key): index = hash(key) if table[index].key == key: remove table[index] else: do nothing // key not present

To illustrate, consider a simple hash table with an array of 5 buckets using integer keys and a basic hash function that maps a key

k

to index

k \mod 5

. Inserting the pair (3, "apple") computes index 3 and stores it in bucket 3. Searching for key 3 hashes to index 3 and retrieves "apple". Deleting key 3 removes the pair from bucket 3. This example assumes no multiple keys map to the same index, though in practice a hash function serves as the mapping tool from keys to indices, and collisions—where multiple keys share a bucket—are handled separately.^[4]^[6]

Load Factor

The load factor of a hash table, denoted as

\alpha

, is defined as the ratio of the number of entries

n

stored in the table to the number of buckets

m

, expressed as

\alpha = \frac{n}{m}

. This metric quantifies the fullness of the table and serves as a key indicator for maintaining performance efficiency. Typical implementations set resizing thresholds around 0.7 to 0.75 to balance space usage and operation speed, as higher values risk excessive collisions.^[7] A higher load factor increases the probability of collisions, which degrades search, insertion, and deletion performance by requiring more probes or comparisons.^[8] In separate chaining, where collisions are handled by linking entries in buckets, the load factor directly influences the expected chain length, which averages

\alpha

under the simple uniform hashing assumption.^[9] Consequently, the expected time for a successful search is

1 + \alpha

probes, as the initial bucket access is followed by traversing an average of

\alpha

elements in the chain.^[8] In open addressing schemes, the load factor must remain strictly less than 1 to ensure available slots for insertions, since all entries occupy distinct buckets. For quadratic probing, performance notably degrades when

\alpha

exceeds 0.5, as secondary clustering—where probes tend to revisit similar bucket sequences—intensifies, leading to longer probe sequences and potential insertion failures even before the table fills. To mitigate these effects, hash tables trigger resizing when the load factor surpasses a predefined threshold, typically by doubling the number of buckets to approximately halve the new

\alpha

.^[7] This process assumes a uniform distribution of hash values to ensure collisions remain manageable post-resizing.

Hash Functions

Properties and Design Principles

A good hash function for use in hash tables must exhibit several key properties to ensure balanced distribution of keys and efficient operations. Primarily, it must be deterministic, meaning that identical input keys always produce the same output hash value, which is essential for consistent retrieval and storage.^[4] Additionally, the function should achieve uniform distribution, mapping keys evenly across the range of possible hash values to minimize clustering in buckets and approximate the simple uniform hashing assumption.^[4] Efficiency in computation is another critical property, as the hash function must evaluate quickly, often in constant time relative to key size, to avoid bottlenecking insert, search, and delete operations.^[4] A desirable diffusion property is the avalanche effect, where even a single-bit change in the input causes approximately half the output bits to change, enhancing resistance to patterns in keys and improving overall mixing.^[10] To provide theoretical guarantees on uniformity, especially when keys arrive in adversarial or non-random order, universal hashing employs a family of functions rather than a single one. A family

H

of hash functions from a key universe

U

\{0, 1, \dots, m-1\}

(where

m

is the number of buckets) is universal if, for any distinct keys

x, y \in U

, the probability

\Pr[h(x) = h(y)] \leq 1/m

over a random choice of

h \in H

.^[11] This bound ensures that the expected number of collisions for any set of keys is close to that under ideal random hashing, with the probability of exceeding

t

times the expected collisions being less than

1/t

for

t \geq 1

.^[11] In a simple probabilistic model, selecting

h

randomly from such a family simulates independent uniform mapping of each key to buckets, yielding average linear-time performance for chained hash tables even with up to

\alpha m

insertions, where

\alpha

is the load factor.^[11] An explicit construction of a universal family, introduced by Carter and Wegman, uses affine transformations over a finite field. Let

p

be a prime larger than the key universe size, and

m

the table size; the family consists of functions

h_{a,b}(k) = ((a k + b) \mod p) \mod m

, where

a \in \{1, 2, \dots, p-1\}

and

b \in \{0, 1, \dots, p-1\}

, chosen uniformly at random.^[11]

function universal_hash(k, a, b, p, m): return ((a * k + b) % p) % m

This family satisfies universality because, for distinct

k_1, k_2

, the values

a k_1 + b \mod p

and

a k_2 + b \mod p

coincide for at most one

a

per

b

, yielding the required collision probability.^[11] Computation involves basic arithmetic operations, making it efficient for integer keys. The choice of hash function also depends on key types, introducing trade-offs between computational speed and distribution quality. For integers, direct modular reduction suffices and is fast, but for strings or composite objects (e.g., tuples of fields), methods like polynomial rolling hashes—treating the key as base-

R

digits and accumulating

\sum c_i R^i \mod m

—provide better uniformity by incorporating all bits, though at higher cost for long keys.^[4] Composite keys require recursive hashing of components, such as

h((x,y)) = (h(x) \cdot R + h(y)) \mod m

, balancing depth against overflow risks in intermediate values.^[4] Simpler functions prioritize speed for performance-critical applications but risk poor distribution on structured data, while more sophisticated ones enhance quality for diverse or patterned inputs, often using 64-bit intermediates to avoid precision loss.^[4] In security-sensitive environments, such as servers handling untrusted inputs, hash functions must resist adversarial manipulation like hash-flooding attacks, where attackers generate collision-heavy inputs to force degradation via long chains or probes.^[12] Here, non-cryptographic hashes vulnerable to multicollision prediction are inadequate; instead, keyed pseudorandom functions like SipHash, which uses a 128-bit secret key to produce unpredictable outputs, mitigate this by requiring quadratic attacker effort to induce collisions.^[12] SipHash, optimized for short inputs common in hash tables, processes 8-byte keys in under 200 cycles while providing cryptographic strength against DoS via flooding.^[12] Evaluating hash functions focuses on uniformity and collision resistance to verify these properties. Uniformity is assessed via statistical tests like the chi-squared (

\chi^2

) test, which computes

\chi^2 = \sum (O_i - E_i)^2 / E_i

over bins of hash values, where

O_i

and

E_i

are observed and expected counts under uniformity; low

\chi^2

values (e.g., near degrees of freedom) confirm even distribution.^[13] Collision resistance, for non-cryptographic use, measures the empirical or bounded probability of distinct keys colliding, often via simulation or universality proofs, ensuring it remains near

1/m

without exploitable biases.^[10]

Common Techniques

Common techniques for hash functions focus on transforming keys into table indices efficiently while promoting uniformity. For integer keys, several methods have been developed to achieve this.

Integer Hashing

The division method is a straightforward approach for integer keys, defined as

h(k) = k \mod [m](/page/M)

, where

k

is the key and

m

is the table size, producing a remainder in the range [0, m-1].^[14] This method assumes keys are drawn from a universe of integers, typically from 0 to U-1 where U >> m, and derives uniformity from the property that if m is chosen coprime to the key distribution's characteristics (e.g., avoiding powers of 2 for decimal keys), the remainders approximate a uniform distribution over the slots.^[14] For example, with m=10 and k=27,

h(27) = 27 \mod 10 = 7

; for k=123,

h(123) = 123 \mod 10 = 3

.^[14] The multiplication method improves on this by leveraging fractional parts for better distribution, given by

h(k) = \lfloor m \cdot \{k A\} \rfloor

, where A is a constant between 0 and 1, and {x} denotes the fractional part of x (i.e., k A mod 1).^[14] Knuth recommends A ≈ (√5 - 1)/2 ≈ 0.6180339887, the reciprocal of the golden ratio, as it yields low-discrepancy sequences that enhance uniformity regardless of key distribution patterns.^[14] The derivation of uniformity stems from the irrationality of A, ensuring that the sequence {k A} is dense and equidistributed in [0,1) for sequential integer k, thus mapping evenly to [0, m-1] after scaling and flooring.^[14] For m=10, A=0.618, and k=27, first compute 27 * 0.618 ≈ 16.686, fractional part 0.686, then

h(27) = \lfloor 10 \cdot 0.686 \rfloor = 6

; for k=123, 123 * 0.618 ≈ 76.018, fractional 0.018,

h(123) = \lfloor 10 \cdot 0.018 \rfloor = 0

.^[14] As a historical example, the mid-square method squares the key and extracts the middle digits to form the hash value, suitable for fixed-length integer representations.^[15] For a key with d digits, square it to get up to 2d digits and take the middle d digits as h(k), then mod m if needed.^[15] This was an early technique but often produces poor uniformity due to clustering in squared values.^[14] For m=10, k=27 (squared=729, middle digit assuming 3-digit pad 729 → 2), h(27)=2; for k=123 (squared=15129, middle 2 digits 51, 51 mod 10=1).^[15] These integer methods typically assume keys belong to a universe of non-negative integers from 0 to U-1, with U much larger than m to ensure probabilistic uniformity.^[14]

String Hashing

For string keys, the polynomial rolling hash treats the string as a base-b number modulo a prime p, computed as

h(s) = (s_0 b^{n-1} + s_1 b^{n-2} + \dots + s_{n-1} b^0) \mod p

, where s_i are character codes (e.g., ASCII), n is length, b is the base, and p is a large prime.^[16] The base b is chosen as a small prime like 31 or 131 to balance computation and avalanche effects, while p (e.g., a 64-bit prime) is selected large to minimize collision probability, approximating 1/p for random strings under the birthday paradox.^[16] This formulation, used in the Rabin-Karp algorithm, enables efficient sliding-window updates for substring hashing by multiplying the prior hash by b, subtracting the dropped term, and adding the new one, all mod p.^[16] For example, with string "abc" (a=97, b=98, c=99), b=31, p=101, n=3: h("abc") = (9731^2 + 9831 + 99) mod 101 = (97961 + 9831 + 99) mod 101 = (93217 + 3038 + 99) mod 101 = 96354 mod 101 = 0.^[16]

Hybrid Approaches

For complex keys like memory addresses, hybrid methods combine techniques such as XOR folding, which divides the key into fixed-width parts and XORs them to produce a compact value before applying another hash like division.^[17] This is useful for network addresses, folding octets via successive XOR to mix bits evenly without overflow issues.^[17] For a 32-bit address 0x1A2B3C4D with m=10, fold into 16-bit halves (0x1A2B ^ 0x3C4D = 0x2666), then 0x26 ^ 0x66 = 0x40, and h=0x40 mod 10=4; for 0x11223344, (0x1122 ^ 0x3344=0x2266), 0x22 ^ 0x66=0x44, 0x44 mod 10=8.^[17]

Collision Resolution

Separate Chaining

Separate chaining resolves collisions in hash tables by associating each bucket with a linked list (or chain) that stores all keys hashing to that index. When inserting a key that collides with existing entries, it is appended to the corresponding chain, allowing multiple elements per bucket without requiring alternative slots. This approach maintains the hash table's array structure while deferring collision handling to the auxiliary lists, which grow dynamically as needed. The primary operations—insertion, search, and deletion—in separate chaining rely on computing the hash index and then traversing the chain linearly. For insertion, the key is typically added at the head of the list for constant-time access, though appending to the end preserves order if desired (requiring a tail pointer for efficiency). Search and deletion involve scanning the chain from the head until the key is found or the end is reached; deletion requires updating the next pointers to remove the node while preserving the list. Pseudocode for these operations, assuming a hash table

T

as an array of linked lists and a hash function

h(k)

, is as follows: Insertion:

CHAINED-INSERT(T, k) x = create new node with key k and value (if applicable) i = h(k) mod m // m is number of buckets x.next = T[i].head T[i].head = x

This prepends the node in amortized O(1) time, excluding the hash computation. Search:

CHAINED-SEARCH(T, k) i = h(k) mod m x = T[i].head while x ≠ null and x.key ≠ k x = x.next return x // null if not found

This performs a linear scan of the chain, with time proportional to the chain length. Deletion:

CHAINED-DELETE(T, k) i = h(k) mod m x = T[i].head if x.key = k T[i].head = x.next free x return y = x x = x.next while x ≠ null and x.key ≠ k y = x x = x.next if x ≠ null y.next = x.next free x

This scans the chain and splices out the matching node, again linear in chain length. To support efficient deletion without full scans, doubly linked lists can be used, but singly linked lists suffice for basic implementations. Separate chaining offers simplicity in implementation and robustness against poor hash distributions, as it tolerates load factors

\alpha > 1

(where

\alpha = n/m

n

is the number of elements, and

m

is the number of buckets) without primary clustering issues. Performance degrades gracefully with increasing

\alpha

, unlike probing methods that cluster. However, linked lists introduce overhead from pointer storage and poor cache locality due to non-contiguous memory access during traversal (pointer chasing), which can slow operations on modern hardware compared to contiguous structures.^[18] To mitigate long chains at high

\alpha

, alternatives replace plain linked lists with balanced binary search trees (e.g., red-black trees) once chains exceed a threshold like 8 elements, reducing traversal to

O(\log k)

per chain where

k

is chain length. This hybrid approach bounds worst-case time while preserving average-case efficiency. For better locality in frequently accessed chains, self-adjusting lists—where searched elements move to the head—can adapt to access patterns, reducing average traversal costs in non-uniform workloads.^[19] Under the uniform hashing assumption, where each key independently hashes to any bucket with equal probability

1/m

, the expected chain length is exactly

\alpha

. For analysis, consider an unsuccessful search for a key

k

: the hash computation takes O(1) time, followed by probing the chain until an empty slot would be found if inserting. The number of probes equals 1 plus the number of occupied slots ahead in the chain. Let

X_i

be the indicator random variable where

X_i = 1

if the

i

-th key in the chain hashes to the same bucket as

k

(probability

1/m

), but since chains form dynamically, the expected probes for unsuccessful search is the sum over potential positions. More precisely, for

n

keys already inserted, the probability that a particular bucket has exactly

j

keys is binomial:

\Pr(L = j) = \binom{n}{j} (1/m)^j (1 - 1/m)^{n-j}

, approximating Poisson for large

m

. The expected probes

E[P] = \sum_{j=0}^n (j+1) \Pr(L = j)

. This simplifies to

1 + \sum_{j=1}^n j \Pr(L = j) = 1 + E[L] = 1 + \alpha

, since

E[L] = n/m = \alpha

. For successful search, the expected position of the target is uniform in the chain, yielding

1 + \alpha/2

probes, but both are

\Theta(1 + \alpha)

overall. Insertion and deletion follow similarly, as they reduce to search plus O(1) adjustments. Thus, average-case time for all operations is O(1 + \alpha) under uniform hashing.

Open Addressing

Open addressing, also known as closed hashing, resolves collisions in hash tables by storing all elements directly within a fixed-size array, without using auxiliary data structures like linked lists. Upon inserting a key

k

, the initial position is computed as

h(k) \mod m

, where

m

is the array size and

h

is the hash function; if occupied, a probe sequence

h(k, i)

for

i = 0, 1, 2, \dots

is followed until an empty slot is found or the key is matched during search or deletion. This approach assumes simple uniform hashing and keeps the table load factor

\alpha = n/m < 1

, where

n

is the number of elements, to ensure termination with high probability.^[20] Deletion in open addressing cannot simply empty a slot, as this would disrupt probe sequences for other elements that might traverse it; instead, a special marker called a tombstone is placed, indicating the slot is available for future insertions but treated as occupied during searches to preserve chain integrity. Tombstones accumulate over time, effectively increasing the load factor and necessitating periodic rehashing to maintain performance.^[20] Linear probing, the earliest and simplest variant, defines the probe sequence as

h(k, i) = (h(k) + i) \mod m

. Introduced by Peterson in 1957 for random-access storage systems, it promotes sequential access but suffers from primary clustering: insertions near an occupied run of slots tend to extend the cluster, leading to longer average probe lengths. For instance, if keys hashing to indices 10, 11, and 12 are inserted into a table of size 20, a new key hashing to 10 will probe sequentially through 10–19 before wrapping, exacerbating delays for nearby hashes. Knuth analyzed this in detail, showing average unsuccessful search probes approximate

1/(1 - \alpha)^2

. Quadratic probing mitigates primary clustering by using a quadratic offset:

h(k, i) = (h(k) + c_1 i + c_2 i^2) \mod m

, where constants

c_1

and

c_2

are chosen to ensure good distribution, such as

c_1 = 0.5

and

c_2 = 0.5

(or equivalently, using

(i + i^2)/2

in integer arithmetic) for balanced steps when

m

is prime. This produces non-linear jumps, reducing the tendency for probes to fill contiguous blocks, though it introduces secondary clustering—keys sharing the same

h(k)

follow identical sequences. If

m

is prime, the first

\lfloor m/2 \rfloor

probes are distinct, guaranteeing coverage of half the table before repetition. Knuth discusses its analysis, noting it performs comparably to linear probing at low loads but requires careful parameter selection to avoid permutation issues. Double hashing further improves distribution by employing two independent hash functions:

h(k, i) = (h_1(k) + i \cdot h_2(k)) \mod m

, where

h_2(k)

provides a variable step size, commonly set as

h_2(k) = 1 + (h_1(k) \mod (m-1))

to ensure it is positive and less than

m

. This method approximates uniform probing, minimizing both primary and secondary clustering by making probe sequences key-dependent and pseudo-random. Introduced in Knuth's work and analyzed in subsequent papers, double hashing achieves near-ideal performance up to load factors around 0.8, with average unsuccessful search probes close to

1/(1 - \alpha)

.^[21] Open addressing variants excel in cache efficiency due to localized, predictable memory accesses, outperforming chaining in modern hardware for small-to-medium tables. However, they demand load factors below 0.7–0.8 to keep probe lengths reasonable—beyond this, clustering causes exponential degradation, with unsuccessful searches averaging up to

O(1/(1 - \alpha)^2)

probes. Deletion via tombstones adds overhead, as they occupy space without utility, often requiring compaction or resizing to control effective load. In contrast to separate chaining, open addressing avoids pointer overhead but imposes stricter capacity limits.^[20]

Resizing Strategies

Standard Resizing

In standard hash table resizing, the process is triggered when the load factor α, defined as the ratio of the number of elements n to the table size m (α = n/m), exceeds a maximum threshold α_max, typically set between 0.5 and 1 to maintain performance.^[22]^[23] Upon exceeding this threshold, the table size m is doubled to 2m, and all existing n elements are reinserted into the new table using the hash function adjusted for the updated size, often by recomputing h(k) mod 2m where h is the same underlying hash function.^[22]^[23] This rehashing step ensures the load factor returns to approximately α_max / 2, preserving the uniform distribution of keys under the assumption of a good hash function.^[23] The resizing operation incurs a temporary cost of Θ(n + m) time and additional space for the new table, as every element must be rehashed and reinserted, which can lead to pauses in performance during the rebuild.^[23] However, by doubling the size each time, the frequency of resizes decreases geometrically—each subsequent resize handles roughly twice as many elements before the next trigger—resulting in an amortized O(1) cost per insertion or deletion operation when analyzed using the potential method.^[22]^[23] For a sequence of n insertions starting from an empty table, the total rehashing cost sums to O(n), as the series of resize costs (n + n/2 + n/4 + ...) converges to less than 2n operations.^[22] To prevent excessive space usage when the table becomes underutilized, shrinking is performed by halving the table size m to m/2 when the load factor falls below a minimum threshold, such as α_max / 4 (e.g., 0.25 if α_max = 1), followed by rehashing all elements into the smaller table.^[22] This threshold avoids oscillation between frequent grows and shrinks, maintaining efficiency while reclaiming memory.^[22] The standard resizing approach applies to both separate chaining and open addressing implementations, as the rehashing process rebuilds the underlying structure regardless of collision resolution method.^[23] The following pseudocode illustrates a resize-integrated insert operation for a separate chaining hash table, where resizing is checked after each insertion (adaptable to open addressing by modifying the insert step):

function INSERT(key, value): if n / m > α_max: RESIZE(2 * m) index = HASH(key) mod m // For separate chaining: prepend to chain at T[index] new_node = new Node(key, value) new_node.next = T[index] T[index] = new_node n = n + 1 function RESIZE(new_m): old_T = T m = new_m T = new array of size m // Initialize empty chains for i = 0 to old_m - 1: current = old_T[i] while current != null: index = HASH(current.key) mod m temp = current.next current.next = T[index] T[index] = current current = temp // For shrinking, call RESIZE(m / 2) in DELETE if n / m < α_min delete old_T

This implementation ensures all elements are fully reinserted during resize, with the hash function providing uniform distribution for the new size.^[24]^[23]

Incremental Approaches

Incremental approaches to hash table resizing aim to mitigate the performance pauses associated with standard doubling or halving operations by distributing the rehashing workload over multiple insertions or operations, enabling gradual adaptation to changing data volumes.^[25] These methods are particularly valuable in environments requiring low-latency responses, such as databases and distributed systems, where full rehashing could introduce unacceptable delays.^[26]

Linear Hashing

Linear hashing, introduced by Witold Litwin in 1980, facilitates dynamic growth by splitting buckets incrementally using a split pointer, avoiding the need for a complete table reorganization.^[25] The algorithm employs a family of hash functions defined as

h_i(k) = h(k) \mod (2^i \cdot N)

, where

N

is the initial number of buckets,

i

is the current level (starting at 0), and

h(k)

is a base hash function producing values much larger than the table size.^[26] A split pointer, denoted as Next, indicates the next bucket to split and advances linearly through the table. During insertion, the key

k

is hashed using

h_i(k)

to locate the primary bucket at the current level

i

. If the bucket has space, the key is inserted directly; otherwise, an overflow page is chained to it temporarily.^[25] To resolve the overflow, the bucket at the Next position is split: its contents are rehashed using

h_{i+1}(k)

, distributing records into the original bucket and a newly allocated one, after which Next increments by 1.^[26] This process continues per insertion as needed, with one split per overflow event, maintaining a near-constant load factor (up to 90% for file-oriented implementations).^[25] When Next reaches the end of the current range (

N \times 2^i

), the level

i

increments, Next resets to 0, and a new round of splits begins, effectively doubling the address space over time without batch reprocessing.^[26] Overflow handling relies on chaining pages, but the round-robin splitting prevents long chains by proactively redistributing load.^[25] Example: Step-by-Step Insertion with Split in Linear Hashing
Consider an initial table with

N = 2

buckets (indices 0 and 1), level

i = 0

, Next = 0, and

h(k) = k

for simplicity (actual implementations use a strong hash). Assume each bucket holds up to 2 records.^[26]

Insert 10: $h_0(10) = 10 \mod 2 = 0$ , bucket 0 empty → insert into bucket 0 (now: ^[27]).
Insert 12: $h_0(12) = 12 \mod 2 = 0$ , bucket 0 full → add overflow page to bucket 0 (now: [10, 12] via overflow). Trigger split at Next=0.
Split bucket 0 using $h_1(k) = k \mod 4$ : Rehash contents—10 mod 4 = 2 (stays in 0, but remapped), 12 mod 4 = 0 (to new bucket 2? Wait, linear maps 0 to 0 and 2). Actually, split creates buckets for higher modulus; records with bit i=0 stay in old, i=1 to new. Assume binary view: split distributes based on next bit. Post-split: bucket 0: ^[28], new bucket 2: ^[27], Next=1.
Subsequent insertions proceed similarly, with searches checking both primary and potential split images if applicable. This example illustrates one split per overflow, spreading cost.^[26]

Extendible Hashing

Extendible hashing, proposed by Ronald Fagin et al. in 1979, uses a directory of pointers to bucket pages, enabling dynamic resizing through selective directory expansion without rehashing the entire dataset. The directory is an array of size

2^d

, where

d

is the global depth representing the number of high-order bits from the hash value used to index it. Each bucket has a local depth

l \leq d

, indicating how many bits uniquely identify records within it.^[26] For insertion, the hash

h(k)

provides the binary address; the first

d

bits index the directory to find the bucket pointer. If space exists, insert; if full, split the bucket by incrementing its local depth to

l+1

, creating two new buckets for the differing bit, and redistributing records based on that bit. If

l+1 > d

, double the directory (global depth increments), duplicating pointers to maintain mappings. Multiple directory entries may point to the same bucket if local depths are lower, allowing coalescing during contraction.^[26] This approach guarantees at most two disk accesses for lookups and confines rehashing to the split bucket's records, making it suitable for disk-based systems.

Consistent Hashing

Consistent hashing, developed by David Karger et al. in 1997, addresses incremental resizing in distributed hash tables by minimizing key migrations when nodes are added or removed.^[29] Keys and nodes are mapped via independent hash functions to points on a fixed circle (e.g., [0, 1) or modulo

2^{32}

), with each key assigned to the nearest succeeding node in clockwise order.^[29] To improve load balance, each physical node is represented by multiple virtual nodes (typically

O(\log n)

copies, where

n

is the number of nodes), hashed separately to create evenly spaced points.^[29] Upon adding or removing a node, only the keys in the affected arc (between the new/old virtual nodes and predecessors) are reassigned, impacting an expected fraction

k/n

of keys, where

k

is the number of virtual nodes per physical node.^[29] This ensures smooth load distribution, with maximum load deviating by at most

O(\log n / n)

from ideal, and spread (number of nodes holding copies of a key across views) bounded by

O(k \log n)

.^[29] Deletions follow symmetrically, reassigning to successors. These incremental methods offer lower latency during resizes for large-scale tables compared to batch rehashing, as costs are amortized over operations, but they introduce greater implementation complexity due to pointers, levels, and virtual mappings.^[25]^[29] They are especially suitable for database indexing and distributed caching, where predictable pauses are critical.^[26]

Performance Analysis

Time Complexity

The time complexity of hash table operations is analyzed under the simple uniform hashing assumption, where each key is equally likely to hash to any slot independently of other keys. For both separate chaining and open addressing, the average-case time complexity for insertion, search, and deletion is

O(1)

when the load factor

\alpha = n/m

(with

n

keys and table size

m

) is held constant. On modern CPUs, these

O(1)

operations achieve practical latencies dominated by cache access times, with L1 cache hits around 1 ns and higher levels up to 4 ns, enabling lookups as low as 2 ns and iterations or probes in 2–3 ns under ideal cache-resident conditions and minimal collisions.^[30] In separate chaining, the expected length of each chain is

\alpha

, leading to an expected

O(1 + \alpha)

time for unsuccessful searches and

O(1 + \alpha/2)

for successful searches; with constant

\alpha

, both simplify to

O(1)

. For open addressing with linear probing, the expected number of probes for an unsuccessful search is

\frac{1}{1 - \alpha}

, while for a successful search it is

\frac{1 + \alpha}{2(1 - \alpha)}

; again, constant

\alpha

yields

O(1)

average time.^[20] Amortized analysis accounts for resizing, which doubles the table size when

\alpha

exceeds a threshold (e.g., 0.5–1.0) to maintain performance. Over a sequence of

n

insertions starting from an empty table, the total time is

O(n)

because each key is moved a constant expected number of times during resizes, yielding amortized

O(1)

time per operation via the aggregate method.^[20] In the worst case, all keys may hash to the same slot, resulting in

O(n)

time for operations due to linear scanning of the chain or probe sequence, exacerbated by poor hash functions or adversarial inputs that defeat uniformity. In 2025, research showed that static open-addressing hash tables can achieve sublinear expected search times without element reordering during construction, improving upon classical worst-case bounds under certain conditions.^[20]^[31]^[31] A key space-time trade-off arises from the load factor: lower

\alpha

(achieved by increasing

m

) reduces the expected number of probes and thus average time, but at the cost of higher space usage, as the table holds more empty slots.^[20] Certain variants improve worst-case guarantees. Replacing linked lists in separate chaining with balanced binary search trees yields

O(\log n)

worst-case time for operations, though at higher constant factors. Robin Hood hashing, an open-addressing technique, balances probe lengths by displacing keys with shorter probes during insertion, reducing maximum probe sequences and improving practical worst-case performance without asymptotic changes.^[20]^[32]

Space Efficiency and Trade-offs

Hash tables achieve linear space complexity in the number of elements

n

, denoted as

\Theta(n)

, but incur overhead that elevates the constant factor beyond 1. This overhead arises from the allocation of

m

buckets, where

m \geq n

, plus storage for keys and values; typically, each bucket requires space for a pointer or fixed-size entry, leading to

m

times the size of a pointer or entry for the table structure alone.^[33] In separate chaining, collisions are resolved by linking elements in lists, adding one pointer per element beyond the initial bucket pointers, which can increase space usage by approximately 50% or more depending on the load factor, as each node in the chain requires additional memory for links. Open addressing, by contrast, avoids these per-element pointers by probing within the array but wastes slots due to clustering, necessitating a larger

m

to maintain performance and thus higher overall space for the same

n

. This pointer overhead in chaining precludes full space efficiency in stable hash tables, while open addressing trades density for simplicity but still leaves allocated space underutilized at high loads.^[34]^[33] The load factor

\alpha = n/m

critically influences this space usage, with optimal values of 0.5 to 0.7 recommended for open addressing to balance probe lengths and avoid excessive clustering, resulting in a space multiplier of

1/\alpha

(e.g., about 1.4 to 2 times the minimum

n

slots needed). For separate chaining, higher loads up to 0.7 or 0.8 are feasible with minimal space penalty beyond pointers, as chains grow linearly without wasted slots, though performance degrades if

\alpha

exceeds 1 significantly. These ranges ensure space efficiency without disproportionate time costs from collisions.^[35] Compression techniques further mitigate overhead in space-constrained settings. Using power-of-two table sizes enables fast modulo operations via bit masking (e.g.,

h(k) \mod 2^k = h(k) \& (2^k - 1)

), avoiding costly division while maintaining even distribution and allowing efficient resizing by factors of 2. Bit-packing of keys and fingerprints in compact hashing schemes reduces per-entry size by storing multiple elements per word or using succinct representations, achieving near-optimal space (close to

n

times key size) while supporting constant-time access in expectation.^[33] In distributed systems, consistent hashing introduces space trade-offs for availability and load balancing; virtual nodes or explicit replication assign multiple copies of data across nodes to handle failures or hotspots, increasing total storage by a factor equal to the replication degree (often 3 or more) at the cost of reduced per-node efficiency. This ensures minimal remapping on node changes but elevates overall space usage compared to non-replicated single-node hash tables. Compared to balanced binary search trees, which also require

\Theta(n)

space due to per-node pointers and keys, hash tables exhibit a higher constant factor from load factor underutilization and collision structures, typically 1.5 to 2 times the raw element storage versus about 3 times for trees (two pointers per node plus key/value). However, hash tables provide average O(1) time without the O(log n) scaling of tree-based structures, making them preferable when average-case performance and density are prioritized over worst-case guarantees.

Applications

Core Data Structures

Hash tables provide an efficient implementation for associative arrays, also known as maps or dictionaries, which store pairs of unique keys and associated values. A hash function maps each key to an array index, allowing average constant-time access, insertion, and deletion operations by storing or retrieving values directly from the computed slot, subject to collision resolution.^[36]^[37] Hash sets, a keys-only variant of maps, leverage hash tables by storing elements as keys with implicit or dummy values, enforcing uniqueness since identical keys result in overwrites rather than duplicates. This structure supports efficient membership testing in average O(1) time, while operations like union and intersection can be computed by hashing elements from one set into a temporary table and checking overlaps with another.^[37]^[38] For ordered variants, hash tables can be combined with balanced search trees, such as treaps, either by using trees within overflow buckets or augmenting the overall structure to maintain sorted order alongside hashed indexing, enabling both fast lookups and ordered traversal.^[39]^[40] Unordered collections like Python's built-in dict and set rely on hash tables for their core implementation, where keys are hashed to enable rapid access; set does not preserve insertion order, while dict has preserved insertion order since Python 3.7.^[41]^[38] A key limitation of hash table-based structures is the absence of inherent ordering, requiring separate mechanisms for sorted access, and the prohibition of duplicates, as the mapping enforces unique keys by design.^[36]^[37]

Specialized Uses

Hash tables find specialized application in database indexing, where they support rapid equality-based queries. Hash indexes map keys to data locations using a hash function, enabling average O(1) lookup times for exact matches, which is particularly advantageous in scenarios like point queries in relational databases.^[42] In contrast to ordered structures like B-trees, which facilitate range scans and inequality operations through their sorted leaf nodes, hash indexes do not preserve order and thus perform poorly on range queries, making B-trees the standard for versatile indexing needs.^[42] Extendible hashing addresses the limitations of static hashing by dynamically adjusting bucket sizes via a directory that points to data pages based on hash prefixes, allowing the structure to grow or shrink without full reorganization as the database expands.^[43] This technique, introduced by Fagin, Nievergelt, Pippenger, and Plotkin in 1979, is employed in filesystems and databases to maintain performance under varying loads, with directory splits occurring only when overflows demand deeper prefix resolution.^[43] In caching mechanisms, hash tables provide the foundation for constant-time key access, often combined with other structures for eviction policies. Least Recently Used (LRU) caches integrate a hash table for O(1) lookups with a doubly-linked list to track usage order, where accessed items are moved to the front and the tail item is evicted upon capacity limits, achieving amortized O(1) operations overall. Bloom filters extend hash-based sets probabilistically, using multiple independent hash functions to set bits in a bit array for membership representation; they offer space savings over traditional hash sets by tolerating a tunable false-positive rate (typically 1-10%) while guaranteeing no false negatives, making them ideal for applications like duplicate detection in large streams where exact storage is impractical. Transposition tables in game artificial intelligence, such as chess engines, leverage hash tables to cache search results and avoid redundant computations for equivalent board states. These tables store evaluation scores, depths, and best moves keyed by position hashes, with collisions resolved by depth or exact-match checks to prune search trees. Zobrist hashing, a seminal method for generating these keys, assigns random 64-bit values to each piece-type-square combination and XORs them across the board, producing a near-unique signature with low collision probability (about 2^{-64}) that updates incrementally with moves.^[44] Developed by Albert Zobrist in 1970, this approach significantly accelerates alpha-beta search by detecting transpositions—positions reached via different move sequences.^[44] Beyond these, hash tables underpin symbol tables in compilers, mapping identifiers to attributes like types, scopes, and memory offsets for efficient semantic analysis and code generation across multiple compilation phases.^[45] In storage systems, they facilitate data deduplication by computing cryptographic hashes (e.g., SHA-256) of file chunks as unique fingerprints, storing only one copy per unique hash in a table to eliminate redundancies and achieve up to 95% space reduction in backup scenarios.^[46]^[47] Cryptographically, rainbow tables precompute chains of hash reductions to invert one-way functions like password hashes via time-memory trade-offs, reducing brute-force costs from O(2^n) to O(2^{n/2}) for n-bit keys; however, their effectiveness is severely limited by salting and peppering in modern systems, rendering them insecure for production use and primarily educational for demonstrating collision vulnerabilities. In contemporary big data processing, Apache Spark employs hash joins to merge datasets efficiently, building in-memory hash tables on the smaller relation's partitions for probing by the larger one, with broadcast variants distributing compact tables cluster-wide to minimize shuffling.^[48] Distributed hash tables (DHTs), exemplified by Chord, scale hash table semantics across peer-to-peer networks by mapping keys to a circular identifier space via consistent hashing, where nodes maintain finger tables for O(log N) lookups and automatic load balancing under churn.^[49] Introduced by Stoica et al. in 2001, Chord underpins decentralized storage in systems like file-sharing overlays, ensuring resilience with O(log N) stabilization after node joins or failures.^[49]

Implementations

Pseudocode Examples

Hash table implementations typically involve an array of fixed size

m

, a hash function

h(k)

mapping keys to indices 0 to

m-1

, and maintenance of load factor

\alpha = n/m

(where

n

is the number of elements) below a threshold (e.g., 0.75 for chaining, 0.5 for open addressing) to ensure efficiency. Core operations include:

Insert(key, value): Compute index = $h(key) \mod m$ ; if key exists, update value; else add new entry, increment $n$ ; if $\alpha$ exceeds threshold, resize.
Search(key): Compute index = $h(key) \mod m$ ; probe the bucket/slot until key found or end of chain/probe sequence.
Delete(key): Compute index = $h(key) \mod m$ ; locate and remove entry, decrement $n$ ; for open addressing, mark as deleted (tombstone) to preserve probe chains.

Detailed pseudocode for specific methods follows in subsections.

Chaining Implementation

A basic hash table using chaining resolves collisions by storing multiple elements in linked lists at each bucket. The structure consists of an array of lists, initialized with a number of buckets

m

, typically a power of 2 for efficient resizing. The hash function is denoted as

h(key)

, which maps a key to an index between 0 and

m-1

. Operations maintain a load factor

\alpha = n/m

, where

n

is the number of elements, and resize when

\alpha

exceeds a threshold like 0.75 to ensure amortized constant-time performance. The following pseudocode outlines the core class and operations for chaining, including insertion that checks for resizing.

class HashTable list[] table // array of m lists integer m // number of buckets integer n // number of elements real load_threshold = 0.75 function HashTable(integer initial_m) m = initial_m n = 0 table = new list[m] // each entry is an empty list for i = 0 to m-1 table[i] = empty list end for end function function integer hash(key) return h(key) mod m // stub for hash function end function function insert(key, value) if n / m > load_threshold resize(2 * m) end if index = hash(key) // Check if key exists and update value for each entry in table[index] if entry.key == key entry.value = value return end if end for // Insert new entry at head of list table[index].insert_front(new entry(key, value)) n = n + 1 end function function search(key) index = hash(key) for each entry in table[index] if entry.key == key return entry.value end if end for return null // not found end function function delete(key) index = hash(key) for each entry in table[index] if entry.key == key remove entry from table[index] n = n - 1 return end if end for // Key not found end function function resize(integer new_m) old_table = table m = new_m n = 0 table = new list[m] for i = 0 to m-1 table[i] = empty list end for for i = 0 to old_m - 1 for each entry in old_table[i] insert(entry.key, entry.value) // reinsert using new hash end for end for end function

For an empty table, search returns null immediately since all lists are empty. When the table reaches full load (e.g.,

\alpha > 0.75

), insertion triggers resizing by doubling

m

, creating a new array, and reinserting all elements, which amortizes to O(1) per operation over multiple insertions.

Open Addressing Implementation

Open addressing stores elements directly in the array without lists, using probing sequences to resolve collisions. Linear probing, a common method, computes the next position as

(h(key) + i) \mod m

, where

i

is the probe number. Deletions use tombstones (marked as "deleted") to avoid breaking search chains, allowing probes to continue past them during searches but treating them as available slots for insertions. The table resizes similarly when load exceeds a threshold, typically lower (e.g., 0.5) to mitigate clustering.^[50] The pseudocode below shows operations for linear probing open addressing.

class OpenHashTable entry[] table // array of m entries, each may be null, key-value pair, or deleted [integer](/page/Integer) m // number of buckets [integer](/page/Integer) n // number of elements real load_threshold = 0.5 function OpenHashTable([integer](/page/Integer) initial_m) m = initial_m n = 0 table = new entry[m] // all null initially end function function [integer](/page/Integer) hash(key, [integer](/page/Integer) i) return (h(key) + i) mod m // linear probing stub end function function insert(key, value) if n / m > load_threshold resize(2 * m) end if i = 0 repeat j = hash(key, i) if table[j] == null or table[j] == deleted table[j] = new entry(key, value) n = n + 1 return end if if table[j].key == key table[j].value = value return end if i = i + 1 until i == m error "table overflow" end function function search(key) i = 0 repeat j = hash(key, i) if table[j] == null return null // not found, end of chain end if if table[j] != deleted and table[j].key == key return table[j].value end if i = i + 1 until i == m return null end function function delete(key) i = 0 repeat j = hash(key, i) if table[j] == null return // not found end if if table[j].key == key table[j] = deleted // tombstone n = n - 1 return end if i = i + 1 until i == m end function function resize(integer new_m) old_table = table m = new_m n = 0 table = new entry[m] // all null for each old_entry in old_table if old_entry != null and old_entry != deleted insert(old_entry.key, old_entry.value) // reinsert end if end for end function

In an empty table, search probes until hitting null. At full load, insertion resizes by doubling and reinserting non-deleted entries, clearing tombstones. Deletion with tombstones ensures searches probe past marked slots but insertions can reuse them, preventing false negatives in searches.^[50]

Language-Specific Considerations

In Java, the HashMap class implements a hash table using separate chaining for collision resolution, where each bucket contains a linked list of entries. The default load factor is set to 0.75, providing a balance between time and space efficiency by triggering resizing when the table reaches 75% capacity. Additionally, to mitigate performance degradation from long chains, HashMap converts linked lists to balanced binary search trees (red-black trees) when a chain exceeds a threshold of 8 elements, a feature introduced in Java 8 for improved worst-case lookup times. For concurrent environments, ConcurrentHashMap extends this design with fine-grained locking and segmenting, supporting full concurrency for reads and adjustable concurrency for updates without blocking retrievals. Python's built-in dict type (in the CPython implementation) employs open addressing with a perturbed quadratic probing strategy for collision resolution, implemented in C for efficiency. The probing sequence uses the recurrence index = ((5 * index) + 1 + perturb) % size, where perturb is derived from the key's hash and right-shifted by 5 bits each iteration to add randomness and reduce clustering.^[51] Hash randomization, enabled by default since Python 3.3, perturbs the hash function using a random seed at process startup to prevent hash-flooding denial-of-service attacks by ensuring consistent distribution across runs. The set type is essentially an optimized dict variant, using the same underlying hash table structure but storing only keys with dummy values, which allows for efficient membership testing and set operations. In C++, the standard std::unordered_map utilizes separate chaining, maintaining linked lists (or equivalent forward iterators in some implementations) within buckets determined by a user-supplied or default hash function. It supports customizable hash policies via the Hash template parameter and equality predicates via KeyEqual, enabling adaptations for user-defined types. The reserve method preallocates buckets to accommodate a specified number of elements without exceeding the maximum load factor, avoiding intermediate rehashes during insertion and improving performance for known sizes. Implementing hash tables in these languages often involves challenges such as providing custom hashers and equality predicates to ensure proper distribution and collision handling for non-standard key types, particularly in C++ where the standard requires hash functions to be consistent with equality. Move semantics, introduced in C++11, further complicate implementations by necessitating efficient transfer of ownership in containers like std::unordered_map, avoiding unnecessary copies during resizing or insertion. Specialized libraries address these issues with alternative strategies; for instance, Google's dense_hash_map from the Sparsehash library uses open addressing with quadratic probing, optimizing for dense storage and faster lookups at the cost of higher space usage under low loads. Similarly, Abseil's SwissTable implementation in C++ employs open addressing with SIMD-accelerated metadata lookups and robin-hood hashing for probe redistribution, offering superior performance over standard chaining in high-throughput scenarios. Best practices for hash table usage include seeding hash functions with random values to defend against collision-based attacks like hash flooding, where adversaries craft inputs to degrade performance to O(n). Monitoring and maintaining load factors below 0.75 ensures efficient operation, with proactive resizing or reservation to prevent excessive collisions and rehashing overhead.

History

Media collections

Hash table

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Hash table

Hash table

Fundamentals

Definition and Operations

Load Factor

Hash Functions

Properties and Design Principles

Common Techniques

Integer Hashing

String Hashing

Hybrid Approaches

Collision Resolution

Separate Chaining

Open Addressing

Resizing Strategies

Standard Resizing

Incremental Approaches

Linear Hashing

Extendible Hashing

Consistent Hashing

Performance Analysis

Time Complexity

Space Efficiency and Trade-offs

Applications

Core Data Structures

Specialized Uses

Implementations

Pseudocode Examples

Chaining Implementation

Open Addressing Implementation

Language-Specific Considerations

References

Add your contribution

Related Hubs

Contribute something