Hubbry Logo
Hash tableHash tableMain
Open search
Hash table
Community hub
Hash table
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Hash table
Hash table
from Wikipedia

Hash table
TypeUnordered associative array
Invented1953
Time complexity in big O notation
Operation Average Worst case
Search Θ(1) O(n)[a]
Insert Θ(1) O(n)
Delete Θ(1) O(n)
Space complexity
Space Θ(n)[2] O(n)
A small phone book as a hash table

In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values.[3] A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called a hash map.

Most hash table designs employ an imperfect hash function. Hash collisions, where the hash function generates the same index for more than one key, therefore typically must be accommodated in some way. Common strategies to handle hash collisions include chaining, which stores multiple elements in the same slot using linked lists, and open addressing, which searches for the next available slot according to a probing sequence.[4]

In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key–value pairs, at amortized constant average cost per operation.[5][4]: 513–558 [6]

Hashing is an example of a space-time tradeoff. If memory is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a binary search or linear search can be used to retrieve the element.[7]: 458 

In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. Hash tables are widely used in modern software systems for tasks such as database indexing, caching, and implementing associative arrays, due to their fast average-case performance.[8] For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets. Many programming languages provide built-in hash table structures, such as Python’s dictionaries, Java’s HashMap, and C++’s unordered_map, which abstract the complexity of hashing from the programmer.[9]

History

[edit]

The idea of hashing arose independently in different places. In January 1953, Hans Peter Luhn wrote an internal IBM memorandum that used hashing with chaining. The first example of open addressing was proposed by A. D. Linh, building on Luhn's memorandum.[4]: 547  Around the same time, Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel of IBM Research implemented hashing for the IBM 701 assembler.[10]: 124  Open addressing with linear probing is credited to Amdahl, although Andrey Ershov independently had the same idea.[10]: 124–125  The term "open addressing" was coined by W. Wesley Peterson in his article which discusses the problem of search in large files.[11]: 15 

The first published work on hashing with chaining is credited to Arnold Dumey, who discussed the idea of using remainder modulo a prime as a hash function.[11]: 15  The word "hashing" was first published in an article by Robert Morris.[10]: 126  A theoretical analysis of linear probing was submitted originally by Konheim and Weiss.[11]: 15 

Overview

[edit]

An associative array stores a set of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint of unique keys. In the hash table implementation of associative arrays, an array of length is partially filled with elements, where . A key is hashed using a hash function to compute an index location in the hash table, where . The efficiency of a hash table depends on the load factor, defined as the ratio of the number of stored elements to the number of available slots, with lower load factors generally yielding faster operations.[12] At this index, both the key and its associated value are stored. Storing the key alongside the value ensures that lookups can verify the key at the index to retrieve the correct value, even in the presence of collisions. Under reasonable assumptions, hash tables have better time complexity bounds on search, delete, and insert operations in comparison to self-balancing binary search trees.[11]: 1 

Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.[11]: 1 

Load factor

[edit]

A load factor is a critical statistic of a hash table, and is defined as follows:[2] where

  • is the number of entries occupied in the hash table.
  • is the number of buckets.

The performance of the hash table deteriorates in relation to the load factor .[11]: 2  In the limit of large and , each bucket statistically has a Poisson distribution with expectation for an ideally random hash function.

The software typically ensures that the load factor remains below a certain constant, . This helps maintain good performance. Therefore, a common approach is to resize or "rehash" the hash table whenever the load factor reaches . Similarly the table may also be resized if the load factor drops below .[13]

Load factor for separate chaining

[edit]

With separate chaining hash tables, each slot of the bucket array stores a pointer to a list or array of data.[14]

Separate chaining hash tables suffer gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed.[13]

With separate chaining, the value of that gives best performance is typically between 1 and 3.[13]

Load factor for open addressing

[edit]

With open addressing, each slot of the bucket array holds exactly one item. Therefore an open-addressed hash table cannot have a load factor greater than 1.[14]

The performance of open addressing becomes very bad when the load factor approaches 1.[13]

Therefore a hash table that uses open addressing must be resized or rehashed if the load factor approaches 1.[13]

With open addressing, acceptable figures of max load factor should range around 0.6 to 0.75.[15][16]: 110 

Hash function

[edit]

A hash function maps the universe of keys to indices or slots within the table, that is, for . The conventional implementations of hash functions are based on the integer universe assumption that all elements of the table stem from the universe , where the bit length of is confined within the word size of a computer architecture.[11]: 2 

A hash function is said to be perfect for a given set if it is injective on , that is, if each element maps to a different value in .[17][18] A perfect hash function can be created if all the keys are known ahead of time.[17]

Integer universe assumption

[edit]

The schemes of hashing used in integer universe assumption include hashing by division, hashing by multiplication, universal hashing, dynamic perfect hashing, and static perfect hashing.[11]: 2  However, hashing by division is the commonly used scheme.[19]: 264 [16]: 110 

Hashing by division

[edit]

The scheme in hashing by division is as follows:[11]: 2  where is the hash value of and is the size of the table.

Hashing by multiplication

[edit]

The scheme in hashing by multiplication is as follows:[11]: 2–3  Where is a non-integer real-valued constant and is the size of the table. An advantage of the hashing by multiplication is that the is not critical.[11]: 2–3  Although any value produces a hash function, Donald Knuth suggests using the golden ratio.[11]: 3 

String hashing

[edit]

Commonly a string is used as a key to the hash function. Stroustrup[20] describes a simple hash function in which an unsigned integer that is initially zero is repeatedly left shifted one bit and then xor'ed with the integer value of the next character. This hash value is then taken modulo the table size. If the left shift is not circular, then the string length should be at least eight bits less than the size of the unsigned integer in bits. Another common way to hash a string to an integer is with a polynomial rolling hash function.

Choosing a hash function

[edit]

Uniform distribution of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson's chi-squared test for discrete uniform distributions.[21][22]

The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number.[23]

For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.[23][4]

K-independent hashing offers a way to prove a certain hash function does not have bad keysets for a given type of hashtable. A number of K-independence results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.[24]

Collision resolution

[edit]

A search algorithm that uses hashing consists of two parts. The first part is computing a hash function which transforms the search key into an array index. The ideal case is such that no two search keys hash to the same array index. However, this is not always the case and impossible to guarantee for unseen given data.[4]: 515  Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.[7]: 458 

Separate chaining

[edit]
Hash collision resolved by separate chaining
Hash collision by separate chaining with head records in the bucket array.

In separate chaining, the process involves building a linked list with key–value pair for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.[7]: 464  Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let and be the hash table and the node respectively, the operation involves as follows:[19]: 258 

Chained-Hash-Insert(T, k)
  insert x at the head of linked list T[h(k)]

Chained-Hash-Search(T, k)
  search for an element with key k in linked list T[h(k)]

Chained-Hash-Delete(T, k)
  delete x from the linked list T[h(k)]

If the element is comparable either numerically or lexically, and inserted into the list by maintaining the total order, it results in faster termination of the unsuccessful searches.[4]: 520–521 

Other data structures for separate chaining

[edit]

If the keys are ordered, it could be efficient to use "self-organizing" concepts such as using a self-balancing binary search tree, through which the theoretical worst case could be brought down to , although it introduces additional complexities.[4]: 521 

In dynamic perfect hashing, two-level hash tables are used to reduce the look-up complexity to be a guaranteed in the worst case. In this technique, the buckets of entries are organized as perfect hash tables with slots providing constant worst-case lookup time, and low amortized time for insertion.[25] A study shows array-based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load.[26]: 99 

Techniques such as using fusion tree for each buckets also result in constant time for all operations with high probability.[27]

Caching and locality of reference

[edit]

The linked list of separate chaining implementation may not be cache-conscious due to spatial localitylocality of reference—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail CPU cache inefficiencies.[26]: 91 

In cache-conscious variants of collision resolution through separate chaining, a dynamic array found to be more cache-friendly is used in the place where a linked list or self-balancing binary search trees is usually deployed, since the contiguous allocation pattern of the array could be exploited by hardware-cache prefetchers—such as translation lookaside buffer—resulting in reduced access time and memory consumption.[28][29][30]

Open addressing

[edit]
Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously collided with "John Smith".
This graph compares the average number of CPU cache misses required to look up elements in large hash tables (far exceeding size of the cache) with chaining and linear probing. Linear probing performs better due to better locality of reference, though as the table gets full, its performance degrades drastically.

Open addressing is another collision resolution technique in which every entry record is stored in the bucket array itself, and the hash resolution is performed through probing. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates an unsuccessful search.[31]

Well-known probe sequences include:

  • Linear probing, in which the interval between probes is fixed (usually 1).[32]
  • Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the value given by the original hash computation.[33]: 272 
  • Double hashing, in which the interval between probes is computed by a secondary hash function.[33]: 272–273 

The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor approaches 1.[13][26]: 93  The probing results in an infinite loop if the load factor reaches 1, in the case of a completely filled table.[7]: 471  The average cost of linear probing depends on the hash function's ability to distribute the elements uniformly throughout the table to avoid clustering, since formation of clusters would result in increased search time.[7]: 472 

Caching and locality of reference

[edit]

Since the slots are located in successive locations, linear probing could lead to better utilization of CPU cache due to locality of references resulting in reduced memory latency.[32]

Other collision resolution techniques based on open addressing

[edit]
Coalesced hashing
[edit]

Coalesced hashing is a hybrid of both separate chaining and open addressing in which the buckets or nodes link within the table.[34]: 6–8  The algorithm is ideally suited for fixed memory allocation.[34]: 4  The collision in coalesced hashing is resolved by identifying the largest-indexed empty slot on the hash table, then the colliding value is inserted into that slot. The bucket is also linked to the inserted node's slot which contains its colliding hash address.[34]: 8 

Cuckoo hashing
[edit]

Cuckoo hashing is a form of open addressing collision resolution technique which guarantees worst-case lookup complexity and constant amortized time for insertions. The collision is resolved through maintaining two hash tables, each having its own hashing function, and collided slot gets replaced with the given item, and the preoccupied element of the slot gets displaced into the other hash table. The process continues until every key has its own spot in the empty buckets of the tables; if the procedure enters into infinite loop—which is identified through maintaining a threshold loop counter—both hash tables get rehashed with newer hash functions and the procedure continues.[35]: 124–125 

Hopscotch hashing
[edit]

Hopscotch hashing is an open addressing based algorithm which combines the elements of cuckoo hashing, linear probing and chaining through the notion of a neighbourhood of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket.[36]: 351–352  The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in concurrent settings, thus well suited for implementing resizable concurrent hash table.[36]: 350  The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items.[36]: 352 

Each bucket within the hash table includes an additional "hop-information"—an H-bit bit array for indicating the relative distance of the item which was originally hashed into the current virtual bucket within H − 1 entries.[36]: 352  Let and be the key to be inserted and bucket to which the key is hashed into respectively; several cases are involved in the insertion procedure such that the neighbourhood property of the algorithm is vowed:[36]: 352–353  if is empty, the element is inserted, and the leftmost bit of bitmap is set to 1; if not empty, linear probing is used for finding an empty slot in the table, the bitmap of the bucket gets updated followed by the insertion; if the empty slot is not within the range of the neighbourhood, i.e. H − 1, subsequent swap and hop-info bit array manipulation of each bucket is performed in accordance with its neighbourhood invariant properties.[36]: 353 

Robin Hood hashing
[edit]

Robin Hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest probe sequence length (PSL)—from its "home location" i.e. the bucket to which the item was hashed into.[37]: 12  Although Robin Hood hashing does not change the theoretical search cost, it significantly affects the variance of the distribution of the items on the buckets,[38]: 2  i.e. dealing with cluster formation in the hash table.[39] Each node within the hash table that uses Robin Hood hashing should be augmented to store an extra PSL value.[40] Let be the key to be inserted, be the (incremental) PSL length of , be the hash table and be the index, the insertion procedure is as follows:[37]: 12–13 [41]: 5 

  • If : the iteration goes into the next bucket without attempting an external probe.
  • If : insert the item into the bucket ; swap with —let it be ; continue the probe from the th bucket to insert ; repeat the procedure until every element is inserted.

Dynamic resizing

[edit]

Repeated insertions cause the number of entries in a hash table to grow, which consequently increases the load factor; to maintain the amortized performance of the lookup and insertion operations, a hash table is dynamically resized and the items of the tables are rehashed into the buckets of the new hash table,[13] since the items cannot be copied over as varying table sizes results in different hash value due to modulo operation.[42] If a hash table becomes "too empty" after deleting some elements, resizing may be performed to avoid excessive memory usage.[43]

Resizing by moving all entries

[edit]

Generally, a new hash table with a size double that of the original hash table gets allocated privately and every item in the original hash table gets moved to the newly allocated one by computing the hash values of the items followed by the insertion operation. Rehashing is simple, but computationally expensive.[44]: 478–479 

Alternatives to all-at-once rehashing

[edit]

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually to avoid storage blip—typically at 50% of new table's size—during rehashing and to avoid memory fragmentation that triggers heap compaction due to deallocation of large memory blocks caused by the old hash table.[45]: 2–3  In such case, the rehashing operation is done incrementally through extending prior memory block allocated for the old hash table such that the buckets of the hash table remain unaltered. A common approach for amortized rehashing involves maintaining two hash functions and . The process of rehashing a bucket's items in accordance with the new hash function is termed as cleaning, which is implemented through command pattern by encapsulating the operations such as , and through a wrapper such that each element in the bucket gets rehashed and its procedure involve as follows:[45]: 3 

  • Clean bucket.
  • Clean bucket.
  • The command gets executed.

Linear hashing

[edit]

Linear hashing is an implementation of the hash table which enables dynamic growths or shrinks of the table one bucket at a time.[46]

Performance

[edit]

The performance of a hash table is dependent on the hash function's ability in generating quasi-random numbers () for entries in the hash table where , and denotes the key, number of buckets and the hash function such that . If the hash function generates the same for distinct keys (), this results in collision, which is dealt with in a variety of ways. The constant time complexity () of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table is directly proportional to the chosen hash function's ability to disperse the indices.[47]: 1  However, construction of such a hash function is practically infeasible, that being so, implementations depend on case-specific collision resolution techniques in achieving higher performance.[47]: 2 

The best performance is obtained in the case that the hash function distributes the elements of the universe uniformaly, and the elements stored at the table are drawn at random from the universe. In this case, in hashing with chaining, the expected time for a successful search is , and the expected time for an unsuccessful search is .[48]

Applications

[edit]

Associative arrays

[edit]

Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays.[33]

Database indexing

[edit]

Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees are more popular in these applications.[49]

Caches

[edit]

Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.[50][51]

Sets

[edit]

Hash tables can be used in the implementation of set data structure, which can store unique values without any particular order; set is typically used in testing the membership of a value in the collection, rather than element retrieval.[52]

Transposition table

[edit]

A transposition table to a complex Hash Table which stores information about each section that has been searched.[53]

Implementations

[edit]

Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules.

  • In JavaScript, an "object" is a mutable collection of key-value pairs (called "properties"), where each key is either a string or a guaranteed-unique "symbol"; any other value, when used as a key, is first coerced to a string. Aside from the seven "primitive" data types, every value in JavaScript is an object.[54] ECMAScript 2015 also added the Map data structure, which accepts arbitrary values as keys.[55]
  • C++11 includes unordered_map in its standard library for storing keys and values of arbitrary types.[56]
  • Go's built-in map implements a map type in the form of a type, which is often (but not guaranteed to be) a hash table.[57]
  • Java programming language includes the HashSet, HashMap, LinkedHashSet, and LinkedHashMap generic collections.[58]
  • Python's built-in dict implements a hash table in the form of a type.[59]
  • Ruby's built-in Hash uses the open addressing model from Ruby 2.4 onwards.[60]
  • Rust programming language includes HashMap, HashSet as part of the Rust Standard Library.[61]
  • The .NET standard library includes HashSet and Dictionary,[62][63] so it can be used from languages such as C# and VB.NET.[64]

See also

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A hash table, also known as a hash map or , is a that implements an , storing key-value pairs to enable efficient average-case O(1) for operations such as insertion, deletion, and lookup by key. It achieves this performance by using a hash function to compute an index into an array of buckets or slots, from which the desired value can be retrieved or updated directly, making it a fundamental tool in for applications like databases, caches, and symbol tables. The core mechanism of a hash table relies on the , which maps keys—typically strings, integers, or other comparable types—to numeric indices within a fixed-size , ideally distributing entries uniformly to minimize clustering. Common hash functions include division-based methods (e.g., key a prime table size), multiplicative hashing (e.g., using the for better uniformity), and bit-shift techniques, with the choice depending on the key type and desired load factor (the ratio of stored elements to table size, often kept below 0.75 to maintain ). Collisions, which occur when multiple keys hash to the same index, are resolved through techniques such as (linking collided entries in lists at each slot, yielding expected O(1 + α) search time where α is the load factor) or (probing for the next available slot via linear, quadratic, or methods, with expected O(1/(1 - α)) time under uniform hashing assumptions). Hash tables offer significant advantages over alternatives like binary search trees or sorted arrays, providing amortized constant-time operations for dynamic datasets without the O(log n) scaling of tree-based structures, though worst-case performance can degrade to O(n) under poor hashing or adversarial inputs. They are widely used in programming languages (e.g., Python's dict, Java's HashMap) and systems requiring fast lookups, such as compilers and networking protocols. The concept of hash tables traces back to 1953, when researcher proposed using a for direct key-to-address transformation in systems, evolving from earlier punch-card tabulation methods for efficient categorization. later formalized and popularized the structure in his seminal work (1973), establishing hashing as a cornerstone of algorithm design, with subsequent advancements like Carter and Wegman's (1979) improving provable performance guarantees.

Fundamentals

Definition and Operations

A hash table, also known as a , is an that implements mappings from keys to values, storing key-value pairs in an of buckets or slots where a computes the index for each key to enable efficient access. This structure leverages the array as its underlying storage mechanism, where the array serves as a fixed-size collection of slots that can accommodate the mapped entries. Under the assumption of simple uniform hashing, the basic operations of insertion, search, and deletion achieve an average of O(1). The insert operation adds a new key-value pair to the hash table. It computes the hash of the key to determine the target index and places the pair into that ; if the key already exists, the value may be updated depending on the . for a basic insert, assuming no collisions for illustration, is as follows:

function insert(key, value): index = hash(key) // maps key to [array](/page/Array) index table[index] = (key, value)

function insert(key, value): index = hash(key) // maps key to [array](/page/Array) index table[index] = (key, value)

The search (or get) operation retrieves the value associated with a given key. It applies the hash function to the key to find the bucket and checks for the key within that bucket to return the corresponding value if found. Pseudocode for search:

function search(key): index = hash(key) if table[index].key == key: return table[index].value else: return not found

function search(key): index = hash(key) if table[index].key == key: return table[index].value else: return not found

Deletion removes a key-value pair from the hash table. It hashes the key to locate the bucket and removes the pair matching the key from that bucket. Pseudocode for delete:

function delete(key): index = hash(key) if table[index].key == key: remove table[index] else: do nothing // key not present

function delete(key): index = hash(key) if table[index].key == key: remove table[index] else: do nothing // key not present

To illustrate, consider a simple hash table with an array of 5 buckets using integer keys and a basic hash function that maps a key kk to index kmod5k \mod 5. Inserting the pair (3, "apple") computes index 3 and stores it in bucket 3. Searching for key 3 hashes to index 3 and retrieves "apple". Deleting key 3 removes the pair from bucket 3. This example assumes no multiple keys map to the same index, though in practice a hash function serves as the mapping tool from keys to indices, and collisions—where multiple keys share a bucket—are handled separately.

Load Factor

The load factor of a hash table, denoted as α\alpha, is defined as the ratio of the number of entries nn stored in the table to the number of buckets mm, expressed as α=nm\alpha = \frac{n}{m}. This metric quantifies the fullness of the table and serves as a key indicator for maintaining performance efficiency. Typical implementations set resizing thresholds around 0.7 to 0.75 to balance space usage and operation speed, as higher values risk excessive collisions. A higher load factor increases the probability of collisions, which degrades search, insertion, and deletion performance by requiring more probes or comparisons. In separate chaining, where collisions are handled by linking entries in buckets, the load factor directly influences the expected chain length, which averages α\alpha under the simple uniform hashing assumption. Consequently, the expected time for a successful search is 1+α1 + \alpha probes, as the initial bucket access is followed by traversing an average of α\alpha elements in the chain. In open addressing schemes, the load factor must remain strictly less than 1 to ensure available slots for insertions, since all entries occupy distinct buckets. For , performance notably degrades when α\alpha exceeds 0.5, as secondary clustering—where probes tend to revisit similar bucket sequences—intensifies, leading to longer probe sequences and potential insertion failures even before the table fills. To mitigate these effects, hash tables trigger resizing when the load factor surpasses a predefined threshold, typically by doubling the number of buckets to approximately halve the new α\alpha. This process assumes a uniform distribution of hash values to ensure collisions remain manageable post-resizing.

Hash Functions

Properties and Design Principles

A good for use in hash tables must exhibit several key properties to ensure balanced distribution of keys and efficient operations. Primarily, it must be deterministic, meaning that identical input keys always produce the same output hash value, which is essential for consistent retrieval and storage. Additionally, the function should achieve uniform distribution, mapping keys evenly across the range of possible hash values to minimize clustering in buckets and approximate the simple uniform hashing assumption. Efficiency in computation is another critical property, as the hash function must evaluate quickly, often in constant time relative to key size, to avoid bottlenecking insert, search, and delete operations. A desirable property is the , where even a single-bit change in the input causes approximately half the output bits to change, enhancing resistance to patterns in keys and improving overall mixing. To provide theoretical guarantees on uniformity, especially when keys arrive in adversarial or non-random order, employs a family of functions rather than a single one. A family HH of hash functions from a key universe UU to {0,1,,m1}\{0, 1, \dots, m-1\} (where mm is the number of buckets) is universal if, for any distinct keys x,yUx, y \in U, the probability Pr[h(x)=h(y)]1/m\Pr[h(x) = h(y)] \leq 1/m over a random of hHh \in H. This bound ensures that the expected number of collisions for any set of keys is close to that under ideal random hashing, with the probability of exceeding tt times the expected collisions being less than 1/t1/t for t1t \geq 1. In a simple probabilistic model, selecting hh randomly from such a family simulates independent uniform mapping of each key to buckets, yielding average linear-time performance for chained hash tables even with up to αm\alpha m insertions, where α\alpha is the load factor. An explicit construction of a universal family, introduced by Carter and Wegman, uses affine transformations over a . Let pp be a prime larger than the key universe size, and mm the table size; the family consists of functions ha,b(k)=((ak+b)modp)modmh_{a,b}(k) = ((a k + b) \mod p) \mod m, where a{1,2,,p1}a \in \{1, 2, \dots, p-1\} and b{0,1,,p1}b \in \{0, 1, \dots, p-1\}, chosen uniformly at random.

function universal_hash(k, a, b, p, m): return ((a * k + b) % p) % m

function universal_hash(k, a, b, p, m): return ((a * k + b) % p) % m

This family satisfies universality because, for distinct k1,k2k_1, k_2, the values ak1+bmodpa k_1 + b \mod p and ak2+bmodpa k_2 + b \mod p coincide for at most one aa per bb, yielding the required collision probability. Computation involves basic arithmetic operations, making it efficient for integer keys. The choice of hash function also depends on key types, introducing trade-offs between computational speed and distribution quality. For integers, direct modular reduction suffices and is fast, but for strings or composite objects (e.g., tuples of fields), methods like polynomial rolling hashes—treating the key as base-RR digits and accumulating ciRimodm\sum c_i R^i \mod m—provide better uniformity by incorporating all bits, though at higher cost for long keys. Composite keys require recursive hashing of components, such as h((x,y))=(h(x)R+h(y))modmh((x,y)) = (h(x) \cdot R + h(y)) \mod m, balancing depth against overflow risks in intermediate values. Simpler functions prioritize speed for performance-critical applications but risk poor distribution on structured data, while more sophisticated ones enhance quality for diverse or patterned inputs, often using 64-bit intermediates to avoid precision loss. In security-sensitive environments, such as servers handling untrusted inputs, hash functions must resist adversarial manipulation like hash-flooding attacks, where attackers generate collision-heavy inputs to force degradation via long chains or probes. Here, non-cryptographic hashes vulnerable to multicollision prediction are inadequate; instead, keyed pseudorandom functions like , which uses a 128-bit secret key to produce unpredictable outputs, mitigate this by requiring quadratic attacker effort to induce collisions. , optimized for short inputs common in hash tables, processes 8-byte keys in under 200 cycles while providing cryptographic strength against DoS via flooding. Evaluating hash functions focuses on uniformity and to verify these properties. Uniformity is assessed via statistical tests like the chi-squared (χ2\chi^2) test, which computes χ2=(OiEi)2/Ei\chi^2 = \sum (O_i - E_i)^2 / E_i over bins of hash values, where OiO_i and EiE_i are observed and expected counts under uniformity; low χ2\chi^2 values (e.g., near ) confirm even distribution. , for non-cryptographic use, measures the empirical or bounded probability of distinct keys colliding, often via or universality proofs, ensuring it remains near 1/m1/m without exploitable biases.

Common Techniques

Common techniques for hash functions focus on transforming keys into table indices efficiently while promoting uniformity. For keys, several methods have been developed to achieve this.

Integer Hashing

The division method is a straightforward approach for keys, defined as h(k)=kmod[m](/page/M)h(k) = k \mod [m](/page/M), where kk is the key and mm is the table size, producing a in the range [0, m-1]. This method assumes keys are drawn from a of s, typically from 0 to U-1 where U >> m, and derives uniformity from the property that if m is chosen coprime to the key distribution's characteristics (e.g., avoiding powers of 2 for keys), the remainders approximate a uniform distribution over the slots. For example, with m=10 and k=27, h(27)=27mod10=7h(27) = 27 \mod 10 = 7; for k=123, h(123)=123mod10=3h(123) = 123 \mod 10 = 3. The multiplication method improves on this by leveraging fractional parts for better distribution, given by h(k)=m{kA}h(k) = \lfloor m \cdot \{k A\} \rfloor, where A is a constant between 0 and 1, and {x} denotes the of x (i.e., k A mod 1). Knuth recommends A ≈ (√5 - 1)/2 ≈ 0.6180339887, the reciprocal of the golden ratio, as it yields low-discrepancy sequences that enhance uniformity regardless of key distribution patterns. The derivation of uniformity stems from the irrationality of A, ensuring that the sequence {k A} is dense and equidistributed in [0,1) for sequential integer k, thus mapping evenly to [0, m-1] after scaling and flooring. For m=10, A=0.618, and k=27, first compute 27 * 0.618 ≈ 16.686, fractional part 0.686, then h(27)=100.686=6h(27) = \lfloor 10 \cdot 0.686 \rfloor = 6; for k=123, 123 * 0.618 ≈ 76.018, fractional 0.018, h(123)=100.018=0h(123) = \lfloor 10 \cdot 0.018 \rfloor = 0. As a historical example, the mid-square method squares the key and extracts the middle digits to form the hash value, suitable for fixed-length representations. For a key with d digits, square it to get up to 2d digits and take the middle d digits as h(k), then mod m if needed. This was an early technique but often produces poor uniformity due to clustering in squared values. For m=10, k=27 (squared=729, middle digit assuming 3-digit pad 729 → 2), h(27)=2; for k=123 (squared=15129, middle 2 digits 51, 51 mod 10=1). These methods typically assume keys belong to a of non-negative s from 0 to U-1, with U much larger than m to ensure probabilistic uniformity.

String Hashing

For string keys, the polynomial rolling hash treats the string as a base-b number modulo a prime p, computed as h(s)=(s0bn1+s1bn2++sn1b0)modph(s) = (s_0 b^{n-1} + s_1 b^{n-2} + \dots + s_{n-1} b^0) \mod p, where s_i are character codes (e.g., ASCII), n is , b is the base, and p is a large prime. The base b is chosen as a small prime like 31 or 131 to balance computation and effects, while p (e.g., a 64-bit prime) is selected large to minimize collision probability, approximating 1/p for random strings under the birthday paradox. This formulation, used in the Rabin-Karp algorithm, enables efficient sliding-window updates for substring hashing by multiplying the prior hash by b, subtracting the dropped term, and adding the new one, all mod p. For example, with string "abc" (a=97, b=98, c=99), b=31, p=101, n=3: h("abc") = (9731^2 + 9831 + 99) mod 101 = (97961 + 9831 + 99) mod 101 = (93217 + 3038 + 99) mod 101 = 96354 mod 101 = 0.

Hybrid Approaches

For complex keys like memory , hybrid methods combine techniques such as XOR folding, which divides the key into fixed-width parts and XORs them to produce a compact value before applying another hash like division. This is useful for network , folding octets via successive XOR to mix bits evenly without overflow issues. For a 32-bit 0x1A2B3C4D with m=10, fold into 16-bit halves (0x1A2B ^ 0x3C4D = 0x2666), then 0x26 ^ 0x66 = 0x40, and h=0x40 mod 10=4; for 0x11223344, (0x1122 ^ 0x3344=0x2266), 0x22 ^ 0x66=0x44, 0x44 mod 10=8.

Collision Resolution

Separate Chaining

Separate chaining resolves collisions in hash tables by associating each bucket with a (or chain) that stores all keys hashing to that index. When inserting a key that collides with existing entries, it is appended to the corresponding chain, allowing multiple elements per bucket without requiring alternative slots. This approach maintains the hash table's structure while deferring collision handling to the auxiliary lists, which grow dynamically as needed. The primary operations—insertion, search, and deletion—in separate chaining rely on computing the hash index and then traversing the chain linearly. For insertion, the key is typically added at the head of the list for constant-time access, though appending to the end preserves order if desired (requiring a pointer for efficiency). Search and deletion involve scanning the chain from the head until the key is found or the end is reached; deletion requires updating the next pointers to remove the node while preserving the list. Pseudocode for these operations, assuming a hash table TT as an array of linked lists and a h(k)h(k), is as follows: Insertion:

CHAINED-INSERT(T, k) x = create new node with key k and value (if applicable) i = h(k) mod m // m is number of buckets x.next = T[i].head T[i].head = x

CHAINED-INSERT(T, k) x = create new node with key k and value (if applicable) i = h(k) mod m // m is number of buckets x.next = T[i].head T[i].head = x

This prepends the node in amortized O(1) time, excluding the hash computation. Search:

CHAINED-SEARCH(T, k) i = h(k) mod m x = T[i].head while x ≠ null and x.key ≠ k x = x.next return x // null if not found

CHAINED-SEARCH(T, k) i = h(k) mod m x = T[i].head while x ≠ null and x.key ≠ k x = x.next return x // null if not found

This performs a linear scan of the chain, with time proportional to the chain length. Deletion:

CHAINED-DELETE(T, k) i = h(k) mod m x = T[i].head if x.key = k T[i].head = x.next free x return y = x x = x.next while x ≠ null and x.key ≠ k y = x x = x.next if x ≠ null y.next = x.next free x

CHAINED-DELETE(T, k) i = h(k) mod m x = T[i].head if x.key = k T[i].head = x.next free x return y = x x = x.next while x ≠ null and x.key ≠ k y = x x = x.next if x ≠ null y.next = x.next free x

This scans the chain and splices out the matching node, again linear in chain length. To support efficient deletion without full scans, doubly linked lists can be used, but singly linked lists suffice for basic implementations. Separate chaining offers simplicity in implementation and robustness against poor hash distributions, as it tolerates load factors α>1\alpha > 1 (where α=n/m\alpha = n/m, nn is the number of elements, and mm is the number of buckets) without primary clustering issues. Performance degrades gracefully with increasing α\alpha, unlike probing methods that cluster. However, linked lists introduce overhead from pointer storage and poor cache locality due to non-contiguous access during traversal (pointer chasing), which can slow operations on modern hardware compared to contiguous structures. To mitigate long chains at high α\alpha, alternatives replace plain linked lists with balanced binary search trees (e.g., red-black trees) once chains exceed a threshold like 8 elements, reducing traversal to O(logk)O(\log k) per chain where kk is chain length. This hybrid approach bounds worst-case time while preserving average-case efficiency. For better locality in frequently accessed chains, self-adjusting lists—where searched elements move to the head—can adapt to access patterns, reducing average traversal costs in non-uniform workloads. Under the uniform hashing assumption, where each key independently hashes to any with equal probability 1/m1/m, the expected chain length is exactly α\alpha. For analysis, consider an unsuccessful search for a key kk: the hash computation takes O(1) time, followed by probing until an empty slot would be found if inserting. The number of probes equals 1 plus the number of occupied slots ahead in . Let XiX_i be the indicator random variable where Xi=1X_i = 1 if the ii-th key in hashes to the same as kk (probability 1/m1/m), but since chains form dynamically, the expected probes for unsuccessful search is the sum over potential positions. More precisely, for nn keys already inserted, the probability that a particular has exactly jj keys is binomial: Pr(L=j)=(nj)(1/m)j(11/m)nj\Pr(L = j) = \binom{n}{j} (1/m)^j (1 - 1/m)^{n-j}, approximating Poisson for large mm. The expected probes E[P]=j=0n(j+1)Pr(L=j)E[P] = \sum_{j=0}^n (j+1) \Pr(L = j). This simplifies to 1+j=1njPr(L=j)=1+E[L]=1+α1 + \sum_{j=1}^n j \Pr(L = j) = 1 + E[L] = 1 + \alpha, since E[L]=n/m=αE[L] = n/m = \alpha. For successful search, the expected position of the target is uniform in the chain, yielding 1+α/21 + \alpha/2 probes, but both are Θ(1+α)\Theta(1 + \alpha) overall. Insertion and deletion follow similarly, as they reduce to search plus O(1) adjustments. Thus, average-case time for all operations is O(1 + \alpha) under uniform hashing.

Open Addressing

Open addressing, also known as closed hashing, resolves collisions in hash tables by storing all elements directly within a fixed-size , without using auxiliary data structures like linked lists. Upon inserting a key kk, the initial position is computed as h(k)modmh(k) \mod m, where mm is the array size and hh is the ; if occupied, a probe sequence h(k,i)h(k, i) for i=0,1,2,i = 0, 1, 2, \dots is followed until an empty slot is found or the key is matched during search or deletion. This approach assumes simple uniform hashing and keeps the table load factor α=n/m<1\alpha = n/m < 1, where nn is the number of elements, to ensure termination with high probability. Deletion in open addressing cannot simply empty a slot, as this would disrupt probe sequences for other elements that might traverse it; instead, a special marker called a tombstone is placed, indicating the slot is available for future insertions but treated as occupied during searches to preserve chain integrity. Tombstones accumulate over time, effectively increasing the load factor and necessitating periodic rehashing to maintain performance. Linear probing, the earliest and simplest variant, defines the probe sequence as h(k,i)=(h(k)+i)modmh(k, i) = (h(k) + i) \mod m. Introduced by Peterson in 1957 for random-access storage systems, it promotes sequential access but suffers from primary clustering: insertions near an occupied run of slots tend to extend the cluster, leading to longer average probe lengths. For instance, if keys hashing to indices 10, 11, and 12 are inserted into a table of size 20, a new key hashing to 10 will probe sequentially through 10–19 before wrapping, exacerbating delays for nearby hashes. Knuth analyzed this in detail, showing average unsuccessful search probes approximate 1/(1α)21/(1 - \alpha)^2. Quadratic probing mitigates primary clustering by using a quadratic offset: h(k,i)=(h(k)+c1i+c2i2)modmh(k, i) = (h(k) + c_1 i + c_2 i^2) \mod m, where constants c1c_1 and c2c_2 are chosen to ensure good distribution, such as c1=0.5c_1 = 0.5 and c2=0.5c_2 = 0.5 (or equivalently, using (i+i2)/2(i + i^2)/2 in integer arithmetic) for balanced steps when mm is prime. This produces non-linear jumps, reducing the tendency for probes to fill contiguous blocks, though it introduces secondary clustering—keys sharing the same h(k)h(k) follow identical sequences. If mm is prime, the first m/2\lfloor m/2 \rfloor probes are distinct, guaranteeing coverage of half the table before repetition. Knuth discusses its analysis, noting it performs comparably to at low loads but requires careful parameter selection to avoid permutation issues. Double hashing further improves distribution by employing two independent hash functions: h(k,i)=(h1(k)+ih2(k))modmh(k, i) = (h_1(k) + i \cdot h_2(k)) \mod m, where h2(k)h_2(k) provides a variable step size, commonly set as h2(k)=1+(h1(k)mod(m1))h_2(k) = 1 + (h_1(k) \mod (m-1)) to ensure it is positive and less than mm. This method approximates uniform probing, minimizing both primary and secondary clustering by making probe sequences key-dependent and pseudo-random. Introduced in Knuth's work and analyzed in subsequent papers, double hashing achieves near-ideal performance up to load factors around 0.8, with average unsuccessful search probes close to 1/(1α)1/(1 - \alpha). Open addressing variants excel in cache efficiency due to localized, predictable memory accesses, outperforming chaining in modern hardware for small-to-medium tables. However, they demand load factors below 0.7–0.8 to keep probe lengths reasonable—beyond this, clustering causes exponential degradation, with unsuccessful searches averaging up to O(1/(1α)2)O(1/(1 - \alpha)^2) probes. Deletion via tombstones adds overhead, as they occupy space without utility, often requiring compaction or resizing to control effective load. In contrast to separate chaining, open addressing avoids pointer overhead but imposes stricter capacity limits.

Resizing Strategies

Standard Resizing

In standard hash table resizing, the process is triggered when the load factor α, defined as the ratio of the number of elements n to the table size m (α = n/m), exceeds a maximum threshold α_max, typically set between 0.5 and 1 to maintain performance. Upon exceeding this threshold, the table size m is doubled to 2m, and all existing n elements are reinserted into the new table using the hash function adjusted for the updated size, often by recomputing h(k) mod 2m where h is the same underlying hash function. This rehashing step ensures the load factor returns to approximately α_max / 2, preserving the uniform distribution of keys under the assumption of a good hash function. The resizing operation incurs a temporary cost of Θ(n + m) time and additional space for the new table, as every element must be rehashed and reinserted, which can lead to pauses in performance during the rebuild. However, by doubling the size each time, the frequency of resizes decreases geometrically—each subsequent resize handles roughly twice as many elements before the next trigger—resulting in an amortized O(1) cost per insertion or deletion operation when analyzed using the potential method. For a sequence of n insertions starting from an empty table, the total rehashing cost sums to O(n), as the series of resize costs (n + n/2 + n/4 + ...) converges to less than 2n operations. To prevent excessive space usage when the table becomes underutilized, shrinking is performed by halving the table size m to m/2 when the load factor falls below a minimum threshold, such as α_max / 4 (e.g., 0.25 if α_max = 1), followed by rehashing all elements into the smaller table. This threshold avoids oscillation between frequent grows and shrinks, maintaining efficiency while reclaiming memory. The standard resizing approach applies to both separate chaining and open addressing implementations, as the rehashing process rebuilds the underlying structure regardless of collision resolution method. The following pseudocode illustrates a resize-integrated insert operation for a separate chaining hash table, where resizing is checked after each insertion (adaptable to open addressing by modifying the insert step):

function INSERT(key, value): if n / m > α_max: RESIZE(2 * m) index = HASH(key) mod m // For separate chaining: prepend to chain at T[index] new_node = new Node(key, value) new_node.next = T[index] T[index] = new_node n = n + 1 function RESIZE(new_m): old_T = T m = new_m T = new array of size m // Initialize empty chains for i = 0 to old_m - 1: current = old_T[i] while current != null: index = HASH(current.key) mod m temp = current.next current.next = T[index] T[index] = current current = temp // For shrinking, call RESIZE(m / 2) in DELETE if n / m < α_min delete old_T

function INSERT(key, value): if n / m > α_max: RESIZE(2 * m) index = HASH(key) mod m // For separate chaining: prepend to chain at T[index] new_node = new Node(key, value) new_node.next = T[index] T[index] = new_node n = n + 1 function RESIZE(new_m): old_T = T m = new_m T = new array of size m // Initialize empty chains for i = 0 to old_m - 1: current = old_T[i] while current != null: index = HASH(current.key) mod m temp = current.next current.next = T[index] T[index] = current current = temp // For shrinking, call RESIZE(m / 2) in DELETE if n / m < α_min delete old_T

This implementation ensures all elements are fully reinserted during resize, with the hash function providing uniform distribution for the new size.

Incremental Approaches

Incremental approaches to hash table resizing aim to mitigate the performance pauses associated with standard doubling or halving operations by distributing the rehashing workload over multiple insertions or operations, enabling gradual adaptation to changing data volumes. These methods are particularly valuable in environments requiring low-latency responses, such as databases and distributed systems, where full rehashing could introduce unacceptable delays.

Linear Hashing

Linear hashing, introduced by Witold Litwin in 1980, facilitates dynamic growth by splitting buckets incrementally using a split pointer, avoiding the need for a complete table reorganization. The algorithm employs a family of hash functions defined as hi(k)=h(k)mod(2iN)h_i(k) = h(k) \mod (2^i \cdot N), where NN is the initial number of buckets, ii is the current level (starting at 0), and h(k)h(k) is a base hash function producing values much larger than the table size. A split pointer, denoted as Next, indicates the next bucket to split and advances linearly through the table. During insertion, the key kk is hashed using hi(k)h_i(k) to locate the primary bucket at the current level ii. If the bucket has space, the key is inserted directly; otherwise, an overflow page is chained to it temporarily. To resolve the overflow, the bucket at the Next position is split: its contents are rehashed using hi+1(k)h_{i+1}(k), distributing records into the original bucket and a newly allocated one, after which Next increments by 1. This process continues per insertion as needed, with one split per overflow event, maintaining a near-constant load factor (up to 90% for file-oriented implementations). When Next reaches the end of the current range (N×2iN \times 2^i), the level ii increments, Next resets to 0, and a new round of splits begins, effectively doubling the address space over time without batch reprocessing. Overflow handling relies on chaining pages, but the round-robin splitting prevents long chains by proactively redistributing load. Example: Step-by-Step Insertion with Split in Linear Hashing
Consider an initial table with N=2N = 2 buckets (indices 0 and 1), level i=0i = 0, Next = 0, and h(k)=kh(k) = k for simplicity (actual implementations use a strong hash). Assume each bucket holds up to 2 records.
  1. Insert 10: h0(10)=10mod2=0h_0(10) = 10 \mod 2 = 0, bucket 0 empty → insert into bucket 0 (now: ).
  2. Insert 12: h0(12)=12mod2=0h_0(12) = 12 \mod 2 = 0, bucket 0 full → add overflow page to bucket 0 (now: [10, 12] via overflow). Trigger split at Next=0.
  3. Split bucket 0 using h1(k)=kmod4h_1(k) = k \mod 4: Rehash contents—10 mod 4 = 2 (stays in 0, but remapped), 12 mod 4 = 0 (to new bucket 2? Wait, linear maps 0 to 0 and 2). Actually, split creates buckets for higher modulus; records with bit i=0 stay in old, i=1 to new. Assume binary view: split distributes based on next bit. Post-split: bucket 0: , new bucket 2: , Next=1.
  4. Subsequent insertions proceed similarly, with searches checking both primary and potential split images if applicable. This example illustrates one split per overflow, spreading cost.

Extendible Hashing

Extendible hashing, proposed by Ronald Fagin et al. in 1979, uses a directory of pointers to bucket pages, enabling dynamic resizing through selective directory expansion without rehashing the entire dataset. The directory is an array of size 2d2^d, where dd is the global depth representing the number of high-order bits from the hash value used to index it. Each bucket has a local depth ldl \leq d, indicating how many bits uniquely identify records within it. For insertion, the hash h(k)h(k) provides the binary address; the first dd bits index the directory to find the pointer. If space exists, insert; if full, split the bucket by incrementing its local depth to l+1l+1, creating two new buckets for the differing bit, and redistributing records based on that bit. If l+1>dl+1 > d, double the directory (global depth increments), duplicating pointers to maintain mappings. Multiple directory entries may point to the same bucket if local depths are lower, allowing coalescing during contraction. This approach guarantees at most two disk accesses for lookups and confines rehashing to the split bucket's records, making it suitable for disk-based systems.

Consistent Hashing

Consistent hashing, developed by David Karger et al. in 1997, addresses incremental resizing in distributed hash tables by minimizing key migrations when nodes are added or removed. Keys and nodes are mapped via independent hash functions to points on a fixed (e.g., [0, 1) or modulo 2322^{32}), with each key assigned to the nearest succeeding node in clockwise order. To improve load balance, each physical node is represented by multiple virtual nodes (typically O(logn)O(\log n) copies, where nn is the number of nodes), hashed separately to create evenly spaced points. Upon adding or removing a node, only the keys in the affected arc (between the new/old virtual nodes and predecessors) are reassigned, impacting an expected fraction k/nk/n of keys, where kk is the number of virtual nodes per physical node. This ensures smooth load distribution, with maximum load deviating by at most O(logn/n)O(\log n / n) from ideal, and spread (number of nodes holding copies of a key across views) bounded by O(klogn)O(k \log n). Deletions follow symmetrically, reassigning to successors. These incremental methods offer lower latency during resizes for large-scale tables compared to batch rehashing, as costs are amortized over operations, but they introduce greater due to pointers, levels, and virtual mappings. They are especially suitable for database indexing and distributed caching, where predictable pauses are critical.

Performance Analysis

Time Complexity

The time complexity of hash table operations is analyzed under the simple uniform hashing assumption, where each key is equally likely to hash to any slot independently of other keys. For both separate chaining and open addressing, the average-case time complexity for insertion, search, and deletion is O(1)O(1) when the load factor α=n/m\alpha = n/m (with nn keys and table size mm) is held constant. On modern CPUs, these O(1)O(1) operations achieve practical latencies dominated by cache access times, with L1 cache hits around 1 ns and higher levels up to 4 ns, enabling lookups as low as 2 ns and iterations or probes in 2–3 ns under ideal cache-resident conditions and minimal collisions. In separate chaining, the expected length of each chain is α\alpha, leading to an expected O(1+α)O(1 + \alpha) time for unsuccessful searches and O(1+α/2)O(1 + \alpha/2) for successful searches; with constant α\alpha, both simplify to O(1)O(1). For open addressing with linear probing, the expected number of probes for an unsuccessful search is 11α\frac{1}{1 - \alpha}, while for a successful search it is 1+α2(1α)\frac{1 + \alpha}{2(1 - \alpha)}; again, constant α\alpha yields O(1)O(1) average time. Amortized analysis accounts for resizing, which doubles the table size when α\alpha exceeds a threshold (e.g., 0.5–1.0) to maintain . Over a sequence of nn insertions starting from an empty table, the total time is O(n)O(n) because each key is moved a constant expected number of times during resizes, yielding amortized O(1)O(1) time per operation via the aggregate method. In the worst case, all keys may hash to the same slot, resulting in O(n)O(n) time for operations due to linear scanning of the chain or probe sequence, exacerbated by poor hash functions or adversarial inputs that defeat uniformity. In 2025, showed that static open-addressing hash tables can achieve sublinear expected search times without element reordering during , improving upon classical worst-case bounds under certain conditions. A key space-time trade-off arises from the load factor: lower α\alpha (achieved by increasing mm) reduces the expected number of probes and thus average time, but at the cost of higher space usage, as the table holds more empty slots. Certain variants improve worst-case guarantees. Replacing linked lists in separate chaining with balanced binary search trees yields O(logn)O(\log n) worst-case time for operations, though at higher constant factors. Robin Hood hashing, an open-addressing technique, balances probe lengths by displacing keys with shorter probes during insertion, reducing maximum probe sequences and improving practical worst-case performance without asymptotic changes.

Space Efficiency and Trade-offs

Hash tables achieve linear in the number of elements nn, denoted as Θ(n)\Theta(n), but incur overhead that elevates the constant factor beyond 1. This overhead arises from the allocation of mm , where mnm \geq n, plus storage for keys and values; typically, each bucket requires space for a pointer or fixed-size entry, leading to mm times the size of a pointer or entry for the table structure alone. In separate chaining, collisions are resolved by linking elements in lists, adding one pointer per element beyond the initial bucket pointers, which can increase space usage by approximately 50% or more depending on the load factor, as each node in the chain requires additional memory for links. , by contrast, avoids these per-element pointers by probing within the array but wastes slots due to clustering, necessitating a larger mm to maintain performance and thus higher overall space for the same nn. This pointer overhead in chaining precludes full space efficiency in stable hash tables, while trades density for simplicity but still leaves allocated space underutilized at high loads. The load factor α=n/m\alpha = n/m critically influences this space usage, with optimal values of 0.5 to 0.7 recommended for to balance probe lengths and avoid excessive clustering, resulting in a space multiplier of 1/α1/\alpha (e.g., about 1.4 to 2 times the minimum nn slots needed). For separate chaining, higher loads up to 0.7 or 0.8 are feasible with minimal space penalty beyond pointers, as chains grow linearly without wasted slots, though performance degrades if α\alpha exceeds 1 significantly. These ranges ensure space efficiency without disproportionate time costs from collisions. Compression techniques further mitigate overhead in space-constrained settings. Using power-of-two table sizes enables fast operations via bit masking (e.g., h(k)mod2k=h(k)&(2k1)h(k) \mod 2^k = h(k) \& (2^k - 1)), avoiding costly division while maintaining even distribution and allowing efficient resizing by factors of 2. Bit-packing of keys and fingerprints in compact hashing schemes reduces per-entry size by storing multiple elements per word or using succinct representations, achieving near-optimal (close to nn times key size) while supporting constant-time access in expectation. In distributed systems, introduces space trade-offs for availability and load balancing; virtual nodes or explicit replication assign multiple copies of data across nodes to handle failures or hotspots, increasing total storage by a factor equal to the replication degree (often 3 or more) at the cost of reduced per-node efficiency. This ensures minimal remapping on node changes but elevates overall space usage compared to non-replicated single-node hash tables. Compared to balanced binary search trees, which also require Θ(n)\Theta(n) space to per-node pointers and keys, hash tables exhibit a higher constant factor from load factor underutilization and collision structures, typically 1.5 to 2 times the raw element storage versus about 3 times for trees (two pointers per node plus key/value). However, hash tables provide average O(1) time without the O(log n) scaling of tree-based structures, making them preferable when average-case performance and density are prioritized over worst-case guarantees.

Applications

Core Data Structures

Hash tables provide an efficient implementation for associative arrays, also known as maps or dictionaries, which store pairs of unique keys and associated values. A maps each key to an array index, allowing average constant-time access, insertion, and deletion operations by storing or retrieving values directly from the computed slot, subject to collision resolution. Hash sets, a keys-only variant of maps, leverage hash tables by storing elements as keys with implicit or dummy values, enforcing uniqueness since identical keys result in overwrites rather than duplicates. This structure supports efficient membership testing in average O(1) time, while operations like union and can be computed by hashing elements from one set into a temporary table and checking overlaps with another. For ordered variants, hash tables can be combined with balanced search trees, such as treaps, either by using trees within overflow buckets or augmenting the overall structure to maintain sorted order alongside hashed indexing, enabling both fast lookups and ordered traversal. Unordered collections like Python's built-in dict and set rely on hash tables for their core implementation, where keys are hashed to enable rapid access; set does not preserve insertion order, while dict has preserved insertion order since Python 3.7. A key limitation of hash table-based structures is the absence of inherent ordering, requiring separate mechanisms for sorted access, and the prohibition of duplicates, as the mapping enforces unique keys by design.

Specialized Uses

Hash tables find specialized application in database indexing, where they support rapid equality-based queries. Hash indexes map keys to data locations using a hash function, enabling average O(1) lookup times for exact matches, which is particularly advantageous in scenarios like point queries in relational databases. In contrast to ordered structures like B-trees, which facilitate range scans and inequality operations through their sorted leaf nodes, hash indexes do not preserve order and thus perform poorly on range queries, making B-trees the standard for versatile indexing needs. Extendible hashing addresses the limitations of static hashing by dynamically adjusting bucket sizes via a directory that points to data pages based on hash prefixes, allowing the structure to grow or shrink without full reorganization as the database expands. This technique, introduced by Fagin, Nievergelt, Pippenger, and Plotkin in 1979, is employed in filesystems and databases to maintain performance under varying loads, with directory splits occurring only when overflows demand deeper prefix resolution. In caching mechanisms, hash tables provide the foundation for constant-time key access, often combined with other structures for eviction policies. Least Recently Used (LRU) caches integrate a hash table for O(1) lookups with a doubly-linked list to track usage order, where accessed items are moved to the front and the tail item is evicted upon capacity limits, achieving amortized O(1) operations overall. Bloom filters extend hash-based sets probabilistically, using multiple independent hash functions to set bits in a for membership representation; they offer space savings over traditional hash sets by tolerating a tunable false-positive rate (typically 1-10%) while guaranteeing no false negatives, making them ideal for applications like duplicate detection in large streams where exact storage is impractical. Transposition tables in game artificial intelligence, such as chess engines, leverage hash tables to cache search results and avoid redundant computations for equivalent board states. These tables store evaluation scores, depths, and best moves keyed by position hashes, with collisions resolved by depth or exact-match checks to prune search trees. , a seminal method for generating these keys, assigns random 64-bit values to each piece-type-square combination and XORs them across the board, producing a near-unique signature with low collision probability (about 2^{-64}) that updates incrementally with moves. Developed by Albert Zobrist in 1970, this approach significantly accelerates alpha-beta search by detecting transpositions—positions reached via different move sequences. Beyond these, hash tables underpin symbol tables in compilers, mapping identifiers to attributes like types, scopes, and memory offsets for efficient semantic analysis and code generation across multiple compilation phases. In storage systems, they facilitate by computing cryptographic hashes (e.g., SHA-256) of file chunks as unique fingerprints, storing only one copy per unique hash in a table to eliminate redundancies and achieve up to 95% space reduction in scenarios. Cryptographically, tables precompute chains of hash reductions to invert one-way functions like password hashes via time-memory trade-offs, reducing brute-force costs from O(2^n) to O(2^{n/2}) for n-bit keys; however, their effectiveness is severely limited by salting and peppering in modern systems, rendering them insecure for production use and primarily educational for demonstrating collision vulnerabilities. In contemporary processing, employs hash joins to merge datasets efficiently, building in-memory hash tables on the smaller relation's partitions for probing by the larger one, with broadcast variants distributing compact tables cluster-wide to minimize shuffling. Distributed hash tables (DHTs), exemplified by Chord, scale hash table semantics across networks by mapping keys to a circular identifier space via , where nodes maintain finger tables for O(log N) lookups and automatic load balancing under churn. Introduced by Stoica et al. in 2001, Chord underpins decentralized storage in systems like file-sharing overlays, ensuring resilience with O(log N) stabilization after node joins or failures.

Implementations

Pseudocode Examples

Hash table implementations typically involve an array of fixed size mm, a hash function h(k)h(k) mapping keys to indices 0 to m1m-1, and maintenance of load factor α=n/m\alpha = n/m (where nn is the number of elements) below a threshold (e.g., 0.75 for , 0.5 for ) to ensure efficiency. Core operations include:
  • Insert(key, value): Compute index = h(key)modmh(key) \mod m; if key exists, update value; else add new entry, increment nn; if α\alpha exceeds threshold, resize.
  • Search(key): Compute index = h(key)modmh(key) \mod m; probe the bucket/slot until key found or end of chain/probe sequence.
  • Delete(key): Compute index = h(key)modmh(key) \mod m; locate and remove entry, decrement nn; for , mark as deleted (tombstone) to preserve probe chains.
Detailed for specific methods follows in subsections.

Chaining Implementation

A basic hash table using resolves collisions by storing multiple elements in at each . The structure consists of an of , initialized with a number of buckets mm, typically a power of 2 for efficient resizing. The is denoted as h(key)h(key), which maps a key to an index between 0 and m1m-1. Operations maintain a load factor α=n/m\alpha = n/m, where nn is the number of elements, and resize when α\alpha exceeds a threshold like 0.75 to ensure amortized constant-time performance. The following pseudocode outlines the core class and operations for chaining, including insertion that checks for resizing.

class HashTable list[] table // array of m lists integer m // number of buckets integer n // number of elements real load_threshold = 0.75 function HashTable(integer initial_m) m = initial_m n = 0 table = new list[m] // each entry is an empty list for i = 0 to m-1 table[i] = empty list end for end function function integer hash(key) return h(key) mod m // stub for hash function end function function insert(key, value) if n / m > load_threshold resize(2 * m) end if index = hash(key) // Check if key exists and update value for each entry in table[index] if entry.key == key entry.value = value return end if end for // Insert new entry at head of list table[index].insert_front(new entry(key, value)) n = n + 1 end function function search(key) index = hash(key) for each entry in table[index] if entry.key == key return entry.value end if end for return null // not found end function function delete(key) index = hash(key) for each entry in table[index] if entry.key == key remove entry from table[index] n = n - 1 return end if end for // Key not found end function function resize(integer new_m) old_table = table m = new_m n = 0 table = new list[m] for i = 0 to m-1 table[i] = empty list end for for i = 0 to old_m - 1 for each entry in old_table[i] insert(entry.key, entry.value) // reinsert using new hash end for end for end function

class HashTable list[] table // array of m lists integer m // number of buckets integer n // number of elements real load_threshold = 0.75 function HashTable(integer initial_m) m = initial_m n = 0 table = new list[m] // each entry is an empty list for i = 0 to m-1 table[i] = empty list end for end function function integer hash(key) return h(key) mod m // stub for hash function end function function insert(key, value) if n / m > load_threshold resize(2 * m) end if index = hash(key) // Check if key exists and update value for each entry in table[index] if entry.key == key entry.value = value return end if end for // Insert new entry at head of list table[index].insert_front(new entry(key, value)) n = n + 1 end function function search(key) index = hash(key) for each entry in table[index] if entry.key == key return entry.value end if end for return null // not found end function function delete(key) index = hash(key) for each entry in table[index] if entry.key == key remove entry from table[index] n = n - 1 return end if end for // Key not found end function function resize(integer new_m) old_table = table m = new_m n = 0 table = new list[m] for i = 0 to m-1 table[i] = empty list end for for i = 0 to old_m - 1 for each entry in old_table[i] insert(entry.key, entry.value) // reinsert using new hash end for end for end function

For an empty table, search returns null immediately since all lists are empty. When the table reaches full load (e.g., α>0.75\alpha > 0.75), insertion triggers resizing by doubling mm, creating a new , and reinserting all elements, which amortizes to O(1) per operation over multiple insertions.

Open Addressing Implementation

Open addressing stores elements directly in the array without lists, using probing sequences to resolve collisions. , a common method, computes the next position as (h(key)+i)modm(h(key) + i) \mod m, where ii is the probe number. Deletions use tombstones (marked as "deleted") to avoid breaking search chains, allowing probes to continue past them during searches but treating them as available slots for insertions. The table resizes similarly when load exceeds a threshold, typically lower (e.g., 0.5) to mitigate clustering. The below shows operations for .

class OpenHashTable entry[] table // array of m entries, each may be null, key-value pair, or deleted [integer](/page/Integer) m // number of buckets [integer](/page/Integer) n // number of elements real load_threshold = 0.5 function OpenHashTable([integer](/page/Integer) initial_m) m = initial_m n = 0 table = new entry[m] // all null initially end function function [integer](/page/Integer) hash(key, [integer](/page/Integer) i) return (h(key) + i) mod m // linear probing stub end function function insert(key, value) if n / m > load_threshold resize(2 * m) end if i = 0 repeat j = hash(key, i) if table[j] == null or table[j] == deleted table[j] = new entry(key, value) n = n + 1 return end if if table[j].key == key table[j].value = value return end if i = i + 1 until i == m error "table overflow" end function function search(key) i = 0 repeat j = hash(key, i) if table[j] == null return null // not found, end of chain end if if table[j] != deleted and table[j].key == key return table[j].value end if i = i + 1 until i == m return null end function function delete(key) i = 0 repeat j = hash(key, i) if table[j] == null return // not found end if if table[j].key == key table[j] = deleted // tombstone n = n - 1 return end if i = i + 1 until i == m end function function resize(integer new_m) old_table = table m = new_m n = 0 table = new entry[m] // all null for each old_entry in old_table if old_entry != null and old_entry != deleted insert(old_entry.key, old_entry.value) // reinsert end if end for end function

class OpenHashTable entry[] table // array of m entries, each may be null, key-value pair, or deleted [integer](/page/Integer) m // number of buckets [integer](/page/Integer) n // number of elements real load_threshold = 0.5 function OpenHashTable([integer](/page/Integer) initial_m) m = initial_m n = 0 table = new entry[m] // all null initially end function function [integer](/page/Integer) hash(key, [integer](/page/Integer) i) return (h(key) + i) mod m // linear probing stub end function function insert(key, value) if n / m > load_threshold resize(2 * m) end if i = 0 repeat j = hash(key, i) if table[j] == null or table[j] == deleted table[j] = new entry(key, value) n = n + 1 return end if if table[j].key == key table[j].value = value return end if i = i + 1 until i == m error "table overflow" end function function search(key) i = 0 repeat j = hash(key, i) if table[j] == null return null // not found, end of chain end if if table[j] != deleted and table[j].key == key return table[j].value end if i = i + 1 until i == m return null end function function delete(key) i = 0 repeat j = hash(key, i) if table[j] == null return // not found end if if table[j].key == key table[j] = deleted // tombstone n = n - 1 return end if i = i + 1 until i == m end function function resize(integer new_m) old_table = table m = new_m n = 0 table = new entry[m] // all null for each old_entry in old_table if old_entry != null and old_entry != deleted insert(old_entry.key, old_entry.value) // reinsert end if end for end function

In an empty table, search probes until hitting null. At full load, insertion resizes by doubling and reinserting non-deleted entries, clearing tombstones. Deletion with tombstones ensures searches probe past marked slots but insertions can reuse them, preventing false negatives in searches.

Language-Specific Considerations

In , the HashMap class implements a hash table using separate chaining for collision resolution, where each contains a of entries. The default load factor is set to 0.75, providing a balance between time and space efficiency by triggering resizing when the table reaches 75% capacity. Additionally, to mitigate performance degradation from long chains, HashMap converts s to balanced binary search trees (red-black trees) when a chain exceeds a threshold of 8 elements, a feature introduced in Java 8 for improved worst-case lookup times. For concurrent environments, ConcurrentHashMap extends this design with fine-grained locking and segmenting, supporting full concurrency for reads and adjustable concurrency for updates without blocking retrievals. Python's built-in dict type (in the implementation) employs with a perturbed strategy for collision resolution, implemented in C for efficiency. The probing sequence uses the recurrence index = ((5 * index) + 1 + perturb) % size, where perturb is derived from the key's hash and right-shifted by 5 bits each to add randomness and reduce clustering. , enabled by default since Python 3.3, perturbs the using a at process startup to prevent hash-flooding denial-of-service attacks by ensuring consistent distribution across runs. The set type is essentially an optimized dict variant, using the same underlying hash table structure but storing only keys with dummy values, which allows for efficient membership testing and set operations. In C++, the standard std::unordered_map utilizes separate chaining, maintaining linked lists (or equivalent forward iterators in some implementations) within buckets determined by a user-supplied or default . It supports customizable hash policies via the Hash template and equality predicates via KeyEqual, enabling adaptations for user-defined types. The reserve method preallocates buckets to accommodate a specified number of elements without exceeding the maximum load factor, avoiding intermediate rehashes during insertion and improving performance for known sizes. Implementing hash tables in these languages often involves challenges such as providing custom hashers and equality predicates to ensure proper distribution and collision handling for non-standard key types, particularly in C++ where the standard requires hash functions to be consistent with equality. Move semantics, introduced in C++11, further complicate implementations by necessitating efficient transfer of ownership in containers like std::unordered_map, avoiding unnecessary copies during resizing or insertion. Specialized libraries address these issues with alternative strategies; for instance, Google's dense_hash_map from the Sparsehash library uses with , optimizing for dense storage and faster lookups at the cost of higher space usage under low loads. Similarly, Abseil's SwissTable implementation in C++ employs with SIMD-accelerated metadata lookups and robin-hood hashing for probe redistribution, offering superior performance over standard in high-throughput scenarios. Best practices for hash table usage include seeding hash functions with random values to defend against collision-based attacks like hash flooding, where adversaries craft inputs to degrade performance to O(n). Monitoring and maintaining load factors below 0.75 ensures efficient operation, with proactive resizing or reservation to prevent excessive collisions and rehashing overhead.

References

  1. Jan 15, 2025 · Cryptographic hashes have a security requirement, which is usually phrased in terms of resistance to collisions and pre-images. Indeed, for a ...
  2. Sep 18, 2012 · Target applications include network traffic authentication and hash-table lookups protected against hash-flooding denial-of-service attacks.
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.