Recent from talks
Contribute something
Nothing was collected or created yet.
Hash table
View on Wikipedia
| Hash table | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Type | Unordered associative array | |||||||||||||||||||||||
| Invented | 1953 | |||||||||||||||||||||||
| ||||||||||||||||||||||||

In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values.[3] A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called a hash map.
Most hash table designs employ an imperfect hash function. Hash collisions, where the hash function generates the same index for more than one key, therefore typically must be accommodated in some way. Common strategies to handle hash collisions include chaining, which stores multiple elements in the same slot using linked lists, and open addressing, which searches for the next available slot according to a probing sequence.[4]
In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key–value pairs, at amortized constant average cost per operation.[5][4]: 513–558 [6]
Hashing is an example of a space-time tradeoff. If memory is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a binary search or linear search can be used to retrieve the element.[7]: 458
In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. Hash tables are widely used in modern software systems for tasks such as database indexing, caching, and implementing associative arrays, due to their fast average-case performance.[8] For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets. Many programming languages provide built-in hash table structures, such as Python’s dictionaries, Java’s HashMap, and C++’s unordered_map, which abstract the complexity of hashing from the programmer.[9]
History
[edit]The idea of hashing arose independently in different places. In January 1953, Hans Peter Luhn wrote an internal IBM memorandum that used hashing with chaining. The first example of open addressing was proposed by A. D. Linh, building on Luhn's memorandum.[4]: 547 Around the same time, Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel of IBM Research implemented hashing for the IBM 701 assembler.[10]: 124 Open addressing with linear probing is credited to Amdahl, although Andrey Ershov independently had the same idea.[10]: 124–125 The term "open addressing" was coined by W. Wesley Peterson in his article which discusses the problem of search in large files.[11]: 15
The first published work on hashing with chaining is credited to Arnold Dumey, who discussed the idea of using remainder modulo a prime as a hash function.[11]: 15 The word "hashing" was first published in an article by Robert Morris.[10]: 126 A theoretical analysis of linear probing was submitted originally by Konheim and Weiss.[11]: 15
Overview
[edit]An associative array stores a set of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint of unique keys. In the hash table implementation of associative arrays, an array of length is partially filled with elements, where . A key is hashed using a hash function to compute an index location in the hash table, where . The efficiency of a hash table depends on the load factor, defined as the ratio of the number of stored elements to the number of available slots, with lower load factors generally yielding faster operations.[12] At this index, both the key and its associated value are stored. Storing the key alongside the value ensures that lookups can verify the key at the index to retrieve the correct value, even in the presence of collisions. Under reasonable assumptions, hash tables have better time complexity bounds on search, delete, and insert operations in comparison to self-balancing binary search trees.[11]: 1
Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.[11]: 1
Load factor
[edit]A load factor is a critical statistic of a hash table, and is defined as follows:[2] where
- is the number of entries occupied in the hash table.
- is the number of buckets.
The performance of the hash table deteriorates in relation to the load factor .[11]: 2 In the limit of large and , each bucket statistically has a Poisson distribution with expectation for an ideally random hash function.
The software typically ensures that the load factor remains below a certain constant, . This helps maintain good performance. Therefore, a common approach is to resize or "rehash" the hash table whenever the load factor reaches . Similarly the table may also be resized if the load factor drops below .[13]
Load factor for separate chaining
[edit]With separate chaining hash tables, each slot of the bucket array stores a pointer to a list or array of data.[14]
Separate chaining hash tables suffer gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed.[13]
With separate chaining, the value of that gives best performance is typically between 1 and 3.[13]
Load factor for open addressing
[edit]With open addressing, each slot of the bucket array holds exactly one item. Therefore an open-addressed hash table cannot have a load factor greater than 1.[14]
The performance of open addressing becomes very bad when the load factor approaches 1.[13]
Therefore a hash table that uses open addressing must be resized or rehashed if the load factor approaches 1.[13]
With open addressing, acceptable figures of max load factor should range around 0.6 to 0.75.[15][16]: 110
Hash function
[edit]A hash function maps the universe of keys to indices or slots within the table, that is, for . The conventional implementations of hash functions are based on the integer universe assumption that all elements of the table stem from the universe , where the bit length of is confined within the word size of a computer architecture.[11]: 2
A hash function is said to be perfect for a given set if it is injective on , that is, if each element maps to a different value in .[17][18] A perfect hash function can be created if all the keys are known ahead of time.[17]
Integer universe assumption
[edit]The schemes of hashing used in integer universe assumption include hashing by division, hashing by multiplication, universal hashing, dynamic perfect hashing, and static perfect hashing.[11]: 2 However, hashing by division is the commonly used scheme.[19]: 264 [16]: 110
Hashing by division
[edit]The scheme in hashing by division is as follows:[11]: 2 where is the hash value of and is the size of the table.
Hashing by multiplication
[edit]The scheme in hashing by multiplication is as follows:[11]: 2–3 Where is a non-integer real-valued constant and is the size of the table. An advantage of the hashing by multiplication is that the is not critical.[11]: 2–3 Although any value produces a hash function, Donald Knuth suggests using the golden ratio.[11]: 3
String hashing
[edit]Commonly a string is used as a key to the hash function. Stroustrup[20] describes a simple hash function in which an unsigned integer that is initially zero is repeatedly left shifted one bit and then xor'ed with the integer value of the next character. This hash value is then taken modulo the table size. If the left shift is not circular, then the string length should be at least eight bits less than the size of the unsigned integer in bits. Another common way to hash a string to an integer is with a polynomial rolling hash function.
Choosing a hash function
[edit]Uniform distribution of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson's chi-squared test for discrete uniform distributions.[21][22]
The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number.[23]
For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.[23][4]
K-independent hashing offers a way to prove a certain hash function does not have bad keysets for a given type of hashtable. A number of K-independence results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.[24]
Collision resolution
[edit]A search algorithm that uses hashing consists of two parts. The first part is computing a hash function which transforms the search key into an array index. The ideal case is such that no two search keys hash to the same array index. However, this is not always the case and impossible to guarantee for unseen given data.[4]: 515 Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.[7]: 458
Separate chaining
[edit]

In separate chaining, the process involves building a linked list with key–value pair for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.[7]: 464 Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let and be the hash table and the node respectively, the operation involves as follows:[19]: 258
Chained-Hash-Insert(T, k) insert x at the head of linked list T[h(k)] Chained-Hash-Search(T, k) search for an element with key k in linked list T[h(k)] Chained-Hash-Delete(T, k) delete x from the linked list T[h(k)]
If the element is comparable either numerically or lexically, and inserted into the list by maintaining the total order, it results in faster termination of the unsuccessful searches.[4]: 520–521
Other data structures for separate chaining
[edit]If the keys are ordered, it could be efficient to use "self-organizing" concepts such as using a self-balancing binary search tree, through which the theoretical worst case could be brought down to , although it introduces additional complexities.[4]: 521
In dynamic perfect hashing, two-level hash tables are used to reduce the look-up complexity to be a guaranteed in the worst case. In this technique, the buckets of entries are organized as perfect hash tables with slots providing constant worst-case lookup time, and low amortized time for insertion.[25] A study shows array-based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load.[26]: 99
Techniques such as using fusion tree for each buckets also result in constant time for all operations with high probability.[27]
Caching and locality of reference
[edit]The linked list of separate chaining implementation may not be cache-conscious due to spatial locality—locality of reference—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail CPU cache inefficiencies.[26]: 91
In cache-conscious variants of collision resolution through separate chaining, a dynamic array found to be more cache-friendly is used in the place where a linked list or self-balancing binary search trees is usually deployed, since the contiguous allocation pattern of the array could be exploited by hardware-cache prefetchers—such as translation lookaside buffer—resulting in reduced access time and memory consumption.[28][29][30]
Open addressing
[edit]

Open addressing is another collision resolution technique in which every entry record is stored in the bucket array itself, and the hash resolution is performed through probing. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates an unsuccessful search.[31]
Well-known probe sequences include:
- Linear probing, in which the interval between probes is fixed (usually 1).[32]
- Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the value given by the original hash computation.[33]: 272
- Double hashing, in which the interval between probes is computed by a secondary hash function.[33]: 272–273
The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor approaches 1.[13][26]: 93 The probing results in an infinite loop if the load factor reaches 1, in the case of a completely filled table.[7]: 471 The average cost of linear probing depends on the hash function's ability to distribute the elements uniformly throughout the table to avoid clustering, since formation of clusters would result in increased search time.[7]: 472
Caching and locality of reference
[edit]Since the slots are located in successive locations, linear probing could lead to better utilization of CPU cache due to locality of references resulting in reduced memory latency.[32]
Other collision resolution techniques based on open addressing
[edit]Coalesced hashing
[edit]Coalesced hashing is a hybrid of both separate chaining and open addressing in which the buckets or nodes link within the table.[34]: 6–8 The algorithm is ideally suited for fixed memory allocation.[34]: 4 The collision in coalesced hashing is resolved by identifying the largest-indexed empty slot on the hash table, then the colliding value is inserted into that slot. The bucket is also linked to the inserted node's slot which contains its colliding hash address.[34]: 8
Cuckoo hashing
[edit]Cuckoo hashing is a form of open addressing collision resolution technique which guarantees worst-case lookup complexity and constant amortized time for insertions. The collision is resolved through maintaining two hash tables, each having its own hashing function, and collided slot gets replaced with the given item, and the preoccupied element of the slot gets displaced into the other hash table. The process continues until every key has its own spot in the empty buckets of the tables; if the procedure enters into infinite loop—which is identified through maintaining a threshold loop counter—both hash tables get rehashed with newer hash functions and the procedure continues.[35]: 124–125
Hopscotch hashing
[edit]Hopscotch hashing is an open addressing based algorithm which combines the elements of cuckoo hashing, linear probing and chaining through the notion of a neighbourhood of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket.[36]: 351–352 The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in concurrent settings, thus well suited for implementing resizable concurrent hash table.[36]: 350 The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items.[36]: 352
Each bucket within the hash table includes an additional "hop-information"—an H-bit bit array for indicating the relative distance of the item which was originally hashed into the current virtual bucket within H − 1 entries.[36]: 352 Let and be the key to be inserted and bucket to which the key is hashed into respectively; several cases are involved in the insertion procedure such that the neighbourhood property of the algorithm is vowed:[36]: 352–353 if is empty, the element is inserted, and the leftmost bit of bitmap is set to 1; if not empty, linear probing is used for finding an empty slot in the table, the bitmap of the bucket gets updated followed by the insertion; if the empty slot is not within the range of the neighbourhood, i.e. H − 1, subsequent swap and hop-info bit array manipulation of each bucket is performed in accordance with its neighbourhood invariant properties.[36]: 353
Robin Hood hashing
[edit]Robin Hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest probe sequence length (PSL)—from its "home location" i.e. the bucket to which the item was hashed into.[37]: 12 Although Robin Hood hashing does not change the theoretical search cost, it significantly affects the variance of the distribution of the items on the buckets,[38]: 2 i.e. dealing with cluster formation in the hash table.[39] Each node within the hash table that uses Robin Hood hashing should be augmented to store an extra PSL value.[40] Let be the key to be inserted, be the (incremental) PSL length of , be the hash table and be the index, the insertion procedure is as follows:[37]: 12–13 [41]: 5
- If : the iteration goes into the next bucket without attempting an external probe.
- If : insert the item into the bucket ; swap with —let it be ; continue the probe from the th bucket to insert ; repeat the procedure until every element is inserted.
Dynamic resizing
[edit]Repeated insertions cause the number of entries in a hash table to grow, which consequently increases the load factor; to maintain the amortized performance of the lookup and insertion operations, a hash table is dynamically resized and the items of the tables are rehashed into the buckets of the new hash table,[13] since the items cannot be copied over as varying table sizes results in different hash value due to modulo operation.[42] If a hash table becomes "too empty" after deleting some elements, resizing may be performed to avoid excessive memory usage.[43]
Resizing by moving all entries
[edit]Generally, a new hash table with a size double that of the original hash table gets allocated privately and every item in the original hash table gets moved to the newly allocated one by computing the hash values of the items followed by the insertion operation. Rehashing is simple, but computationally expensive.[44]: 478–479
Alternatives to all-at-once rehashing
[edit]Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually to avoid storage blip—typically at 50% of new table's size—during rehashing and to avoid memory fragmentation that triggers heap compaction due to deallocation of large memory blocks caused by the old hash table.[45]: 2–3 In such case, the rehashing operation is done incrementally through extending prior memory block allocated for the old hash table such that the buckets of the hash table remain unaltered. A common approach for amortized rehashing involves maintaining two hash functions and . The process of rehashing a bucket's items in accordance with the new hash function is termed as cleaning, which is implemented through command pattern by encapsulating the operations such as , and through a wrapper such that each element in the bucket gets rehashed and its procedure involve as follows:[45]: 3
- Clean bucket.
- Clean bucket.
- The command gets executed.
Linear hashing
[edit]Linear hashing is an implementation of the hash table which enables dynamic growths or shrinks of the table one bucket at a time.[46]
Performance
[edit]The performance of a hash table is dependent on the hash function's ability in generating quasi-random numbers () for entries in the hash table where , and denotes the key, number of buckets and the hash function such that . If the hash function generates the same for distinct keys (), this results in collision, which is dealt with in a variety of ways. The constant time complexity () of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table is directly proportional to the chosen hash function's ability to disperse the indices.[47]: 1 However, construction of such a hash function is practically infeasible, that being so, implementations depend on case-specific collision resolution techniques in achieving higher performance.[47]: 2
The best performance is obtained in the case that the hash function distributes the elements of the universe uniformaly, and the elements stored at the table are drawn at random from the universe. In this case, in hashing with chaining, the expected time for a successful search is , and the expected time for an unsuccessful search is .[48]
Applications
[edit]Associative arrays
[edit]Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays.[33]
Database indexing
[edit]Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees are more popular in these applications.[49]
Caches
[edit]Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.[50][51]
Sets
[edit]Hash tables can be used in the implementation of set data structure, which can store unique values without any particular order; set is typically used in testing the membership of a value in the collection, rather than element retrieval.[52]
Transposition table
[edit]A transposition table to a complex Hash Table which stores information about each section that has been searched.[53]
Implementations
[edit]Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules.
- In JavaScript, an "object" is a mutable collection of key-value pairs (called "properties"), where each key is either a string or a guaranteed-unique "symbol"; any other value, when used as a key, is first coerced to a string. Aside from the seven "primitive" data types, every value in JavaScript is an object.[54] ECMAScript 2015 also added the
Mapdata structure, which accepts arbitrary values as keys.[55] - C++11 includes
unordered_mapin its standard library for storing keys and values of arbitrary types.[56] - Go's built-in
mapimplements a map type in the form of a type, which is often (but not guaranteed to be) a hash table.[57] - Java programming language includes the
HashSet,HashMap,LinkedHashSet, andLinkedHashMapgeneric collections.[58] - Python's built-in
dictimplements a hash table in the form of a type.[59] - Ruby's built-in
Hashuses the open addressing model from Ruby 2.4 onwards.[60] - Rust programming language includes
HashMap,HashSetas part of the Rust Standard Library.[61] - The .NET standard library includes
HashSetandDictionary,[62][63] so it can be used from languages such as C# and VB.NET.[64]
See also
[edit]Notes
[edit]References
[edit]- ^ Martin Farach-Colton; Andrew Krapivin; William Kuszmaul. Optimal Bounds for Open Addressing Without Reordering. 2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS). arXiv:2501.02305. doi:10.1109/FOCS61266.2024.00045.
- ^ a b Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009). Introduction to Algorithms (3rd ed.). Massachusetts Institute of Technology. pp. 253–280. ISBN 978-0-262-03384-8.
- ^ Mehlhorn, Kurt; Sanders, Peter (2008). "Hash Tables and Associative Arrays" (PDF). Algorithms and Data Structures. Springer. pp. 81–98. doi:10.1007/978-3-540-77978-0_4. ISBN 978-3-540-77977-3.
- ^ a b c d e f g Knuth, Donald E. (April 24, 1998). The Art of Computer Programming: Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley Professional. ISBN 978-0-201-89685-5.
- ^ Leiserson, Charles E. (Fall 2005). "Lecture 13: Amortized Algorithms, Table Doubling, Potential Method". course MIT 6.046J/18.410J Introduction to Algorithms. Archived from the original on August 7, 2009.
- ^ Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "Chapter 11: Hash Tables". Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 221–252. ISBN 978-0-262-53196-2.
- ^ a b c d e Sedgewick, Robert; Wayne, Kevin (2011). Algorithms. Vol. 1 (4 ed.). Addison-Wesley Professional – via Princeton University, Department of Computer Science.
- ^ Silberschatz, A.; Korth, H. F.; Sudarshan, S. (2020). Database System Concepts (7th ed.). McGraw-Hill.
- ^ Goodrich, M. T.; Tamassia, R.; Goldwasser, M. H. (2014). Data Structures and Algorithms in Java (6th ed.). Wiley.
- ^ a b c Konheim, Alan G. (2010). Hashing in Computer Science. doi:10.1002/9780470630617. ISBN 978-0-470-34473-6.
- ^ a b c d e f g h i j k l Mehta, Dinesh P.; Mehta, Dinesh P.; Sahni, Sartaj, eds. (2004). Handbook of Data Structures and Applications. doi:10.1201/9781420035179. ISBN 978-0-429-14701-2.
- ^ Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press.
- ^ a b c d e f g Mayers, Andrew (2008). "CS 312: Hash tables and amortized analysis". Cornell University, Department of Computer Science. Archived from the original on April 26, 2021. Retrieved October 26, 2021 – via cs.cornell.edu.
- ^ a b James S. Plank and Brad Vander Zanden. "CS140 Lecture notes -- Hashing".
- ^ Maurer, W. D.; Lewis, T. G. (March 1975). "Hash Table Methods". ACM Computing Surveys. 7 (1): 5–19. doi:10.1145/356643.356645. S2CID 17874775.
- ^ a b Owolabi, Olumide (February 2003). "Empirical studies of some hashing functions". Information and Software Technology. 45 (2): 109–112. doi:10.1016/S0950-5849(02)00174-X.
- ^ a b Lu, Yi; Prabhakar, Balaji; Bonomi, Flavio (2006). Perfect Hashing for Network Applications. 2006 IEEE International Symposium on Information Theory. pp. 2774–2778. doi:10.1109/ISIT.2006.261567. ISBN 1-4244-0505-X. S2CID 1494710.
- ^ Belazzougui, Djamal; Botelho, Fabiano C.; Dietzfelbinger, Martin (2009). "Hash, displace, and compress" (PDF). Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7–9, 2009, Proceedings. Lecture Notes in Computer Science. Vol. 5757. Berlin: Springer. pp. 682–693. CiteSeerX 10.1.1.568.130. doi:10.1007/978-3-642-04128-0_61. MR 2557794.
- ^ a b Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "Chapter 11: Hash Tables". Introduction to Algorithms (2nd ed.). Massachusetts Institute of Technology. ISBN 978-0-262-53196-2.
- ^ Stroustrup, Bjarne (1997). The C++ Programming Language Third Edition. Reading Massachusetts: Addison-Wesley. p. 503. ISBN 0-201-88954-4.
- ^ Pearson, Karl (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling". Philosophical Magazine. Series 5. 50 (302): 157–175. doi:10.1080/14786440009463897.
- ^ Plackett, Robin (1983). "Karl Pearson and the Chi-Squared Test". International Statistical Review. 51 (1): 59–72. doi:10.2307/1402731. JSTOR 1402731.
- ^ a b Wang, Thomas (March 1997). "Prime Double Hash Table". Archived from the original on September 3, 1999. Retrieved May 10, 2015.
- ^ Wegman, Mark N.; Carter, J.Lawrence (June 1981). "New hash functions and their use in authentication and set equality". Journal of Computer and System Sciences. 22 (3): 265–279. doi:10.1016/0022-0000(81)90033-7.
- ^ Demaine, Erik; Lind, Jeff (Spring 2003). "Lecture 2" (PDF). 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Archived (PDF) from the original on June 15, 2010. Retrieved June 30, 2008.
- ^ a b c Culpepper, J. Shane; Moffat, Alistair (2005). "Enhanced Byte Codes with Restricted Prefix Properties". String Processing and Information Retrieval. Lecture Notes in Computer Science. Vol. 3772. pp. 1–12. doi:10.1007/11575832_1. ISBN 978-3-540-29740-6.
- ^ Willard, Dan E. (2000). "Examining computational geometry, van Emde Boas trees, and hashing from the perspective of the fusion tree". SIAM Journal on Computing. 29 (3): 1030–1049. doi:10.1137/S0097539797322425. MR 1740562..
- ^ Askitis, Nikolas; Sinha, Ranjan (October 2010). "Engineering scalable, cache and space efficient tries for strings". The VLDB Journal. 19 (5): 633–660. doi:10.1007/s00778-010-0183-9.
- ^ Askitis, Nikolas; Zobel, Justin (October 2005). "Cache-conscious Collision Resolution in String Hash Tables". Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005). Vol. 3772/2005. pp. 91–102. doi:10.1007/11575832_11. ISBN 978-3-540-29740-6.
- ^ Askitis, Nikolas (2009). "Fast and Compact Hash Tables for Integer Keys" (PDF). Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009). Vol. 91. pp. 113–122. ISBN 978-1-920682-72-9. Archived from the original (PDF) on February 16, 2011. Retrieved June 13, 2010.
- ^ Tenenbaum, Aaron M.; Langsam, Yedidyah; Augenstein, Moshe J. (1990). Data Structures Using C. Prentice Hall. pp. 456–461, p. 472. ISBN 978-0-13-199746-2.
- ^ a b Pagh, Rasmus; Rodler, Flemming Friche (2001). "Cuckoo Hashing". Algorithms — ESA 2001. Lecture Notes in Computer Science. Vol. 2161. pp. 121–133. CiteSeerX 10.1.1.25.4189. doi:10.1007/3-540-44676-1_10. ISBN 978-3-540-42493-2.
- ^ a b c Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), "11 Hash Tables", Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp. 221–252, ISBN 0-262-03293-7.
- ^ a b c Vitter, Jeffery S.; Chen, Wen-Chin (1987). The design and analysis of coalesced hashing. New York, United States: Oxford University Press. ISBN 978-0-19-504182-8 – via Archive.org.
- ^ Pagh, Rasmus; Rodler, Flemming Friche (2001). "Cuckoo Hashing". Algorithms — ESA 2001. Lecture Notes in Computer Science. Vol. 2161. pp. 121–133. CiteSeerX 10.1.1.25.4189. doi:10.1007/3-540-44676-1_10. ISBN 978-3-540-42493-2.
- ^ a b c d e f Herlihy, Maurice; Shavit, Nir; Tzafrir, Moran (2008). "Hopscotch Hashing". Distributed Computing. Lecture Notes in Computer Science. Vol. 5218. pp. 350–364. doi:10.1007/978-3-540-87779-0_24. ISBN 978-3-540-87778-3.
- ^ a b Celis, Pedro (1986). Robin Hood Hashing (PDF). Ontario, Canada: University of Waterloo, Dept. of Computer Science. ISBN 978-0-315-29700-5. OCLC 14083698. Archived (PDF) from the original on November 1, 2021. Retrieved November 2, 2021.
- ^ Poblete, P. V.; Viola, A. (July 2019). "Analysis of Robin Hood and Other Hashing Algorithms Under the Random Probing Model, With and Without Deletions". Combinatorics, Probability and Computing. 28 (4): 600–617. doi:10.1017/S0963548318000408. S2CID 125374363.
- ^ Clarkson, Michael (2014). "Lecture 13: Hash tables". Cornell University, Department of Computer Science. Archived from the original on October 7, 2021. Retrieved November 1, 2021 – via cs.cornell.edu.
- ^ Gries, David (2017). "JavaHyperText and Data Structure: Robin Hood Hashing" (PDF). Cornell University, Department of Computer Science. Archived (PDF) from the original on April 26, 2021. Retrieved November 2, 2021 – via cs.cornell.edu.
- ^ Celis, Pedro (March 28, 1988). External Robin Hood Hashing (PDF) (Technical report). Bloomington, Indiana: Indiana University, Department of Computer Science. 246. Archived (PDF) from the original on November 3, 2021. Retrieved November 2, 2021.
- ^ Goddard, Wayne (2021). "Chapter C5: Hash Tables" (PDF). Clemson University. pp. 15–16. Retrieved December 4, 2023.
- ^ Devadas, Srini; Demaine, Erik (February 25, 2011). "Intro to Algorithms: Resizing Hash Tables" (PDF). Massachusetts Institute of Technology, Department of Computer Science. Archived (PDF) from the original on May 7, 2021. Retrieved November 9, 2021 – via MIT OpenCourseWare.
- ^ Thareja, Reema (2014). "Hashing and Collision". Data Structures Using C. Oxford University Press. pp. 464–488. ISBN 978-0-19-809930-7.
- ^ a b Friedman, Scott; Krishnan, Anand; Leidefrost, Nicholas (March 18, 2003). "Hash Tables for Embedded and Real-time systems" (PDF). All Computer Science and Engineering Research. Washington University in St. Louis. doi:10.7936/K7WD3XXV. Archived (PDF) from the original on June 9, 2021. Retrieved November 9, 2021 – via Northwestern University, Department of Computer Science.
- ^ Litwin, Witold (1980). "Linear hashing: A new tool for file and table addressing" (PDF). Proc. 6th Conference on Very Large Databases. Carnegie Mellon University. pp. 212–223. Archived (PDF) from the original on May 6, 2021. Retrieved November 10, 2021 – via cs.cmu.edu.
- ^ a b Dijk, Tom Van (2010). "Analysing and Improving Hash Table Performance" (PDF). Netherlands: University of Twente. Archived (PDF) from the original on November 6, 2021. Retrieved December 31, 2021.
- ^ Baeza-Yates, Ricardo; Poblete, Patricio V. (1999). "Chapter 2: Searching". In Atallah (ed.). Algorithms and Theory of Computation Handbook. CRC Press. pp. 2–6. ISBN 0849326494.
- ^ Lech Banachowski. "Indexes and external sorting". pl:Polsko-Japońska Akademia Technik Komputerowych. Archived from the original on March 26, 2022. Retrieved March 26, 2022.
- ^ Zhong, Liang; Zheng, Xueqian; Liu, Yong; Wang, Mengting; Cao, Yang (February 2020). "Cache hit ratio maximization in device-to-device communications overlaying cellular networks". China Communications. 17 (2): 232–238. Bibcode:2020CComm..17b.232Z. doi:10.23919/jcc.2020.02.018. S2CID 212649328.
- ^ Bottommley, James (January 1, 2004). "Understanding Caching". Linux Journal. Archived from the original on December 4, 2020. Retrieved April 16, 2022.
- ^ Jill Seaman (2014). "Set & Hash Tables" (PDF). Texas State University. Archived from the original on April 1, 2022. Retrieved March 26, 2022.
{{cite web}}: CS1 maint: bot: original URL status unknown (link) - ^ "Transposition Table - Chessprogramming wiki". chessprogramming.org. Archived from the original on February 14, 2021. Retrieved May 1, 2020.
- ^ "JavaScript data types and data structures - JavaScript | MDN". developer.mozilla.org. Retrieved July 24, 2022.
- ^ "Map - JavaScript | MDN". developer.mozilla.org. June 20, 2023. Retrieved July 15, 2023.
- ^ "Programming language C++ - Technical Specification" (PDF). International Organization for Standardization. pp. 812–813. Archived from the original (PDF) on January 21, 2022. Retrieved February 8, 2022.
- ^ "The Go Programming Language Specification". go.dev. Retrieved January 1, 2023.
- ^ "Lesson: Implementations (The Java™ Tutorials > Collections)". docs.oracle.com. Archived from the original on January 18, 2017. Retrieved April 27, 2018.
- ^ Zhang, Juan; Jia, Yunwei (2020). "Redis rehash optimization based on machine learning". Journal of Physics: Conference Series. 1453 (1): 3. Bibcode:2020JPhCS1453a2048Z. doi:10.1088/1742-6596/1453/1/012048. S2CID 215943738.
- ^ Jonan Scheffler (December 25, 2016). "Ruby 2.4 Released: Faster Hashes, Unified Integers and Better Rounding". heroku.com. Archived from the original on July 3, 2019. Retrieved July 3, 2019.
- ^ "doc.rust-lang.org". Archived from the original on December 8, 2022. Retrieved December 14, 2022.
- ^ "HashSet Class (System.Collections.Generic)". learn.microsoft.com. Retrieved July 1, 2023.
- ^ dotnet-bot. "Dictionary Class (System.Collections.Generic)". learn.microsoft.com. Retrieved January 16, 2024.
- ^ "VB.NET HashSet Example". Dot Net Perls.
Further reading
[edit]- Tamassia, Roberto; Goodrich, Michael T. (2006). "Chapter Nine: Maps and Dictionaries". Data structures and algorithms in Java : [updated for Java 5.0] (4th ed.). Hoboken, NJ: Wiley. pp. 369–418. ISBN 978-0-471-73884-8.
- McKenzie, B. J.; Harries, R.; Bell, T. (February 1990). "Selecting a hashing algorithm". Software: Practice and Experience. 20 (2): 209–224. doi:10.1002/spe.4380200207. hdl:10092/9691. S2CID 12854386.
External links
[edit]- NIST entry on hash tables
- Open Data Structures – Chapter 5 – Hash Tables, Pat Morin
- MIT's Introduction to Algorithms: Hashing 1 MIT OCW lecture Video
- MIT's Introduction to Algorithms: Hashing 2 MIT OCW lecture Video
Hash table
View on GrokipediaFundamentals
Definition and Operations
A hash table, also known as a hash map, is an associative array abstract data type that implements mappings from keys to values, storing key-value pairs in an array of buckets or slots where a hash function computes the index for each key to enable efficient access.[4][5] This structure leverages the array as its underlying storage mechanism, where the array serves as a fixed-size collection of slots that can accommodate the mapped entries.[6] Under the assumption of simple uniform hashing, the basic operations of insertion, search, and deletion achieve an average time complexity of O(1).[4] The insert operation adds a new key-value pair to the hash table. It computes the hash of the key to determine the target bucket index and places the pair into that bucket; if the key already exists, the value may be updated depending on the implementation.[6] Pseudocode for a basic insert, assuming no collisions for illustration, is as follows:function insert(key, value):
index = hash(key) // maps key to [array](/page/Array) index
table[index] = (key, value)
function insert(key, value):
index = hash(key) // maps key to [array](/page/Array) index
table[index] = (key, value)
function search(key):
index = hash(key)
if table[index].key == key:
return table[index].value
else:
return not found
function search(key):
index = hash(key)
if table[index].key == key:
return table[index].value
else:
return not found
function delete(key):
index = hash(key)
if table[index].key == key:
remove table[index]
else:
do nothing // key not present
function delete(key):
index = hash(key)
if table[index].key == key:
remove table[index]
else:
do nothing // key not present
Load Factor
The load factor of a hash table, denoted as , is defined as the ratio of the number of entries stored in the table to the number of buckets , expressed as . This metric quantifies the fullness of the table and serves as a key indicator for maintaining performance efficiency. Typical implementations set resizing thresholds around 0.7 to 0.75 to balance space usage and operation speed, as higher values risk excessive collisions.[7] A higher load factor increases the probability of collisions, which degrades search, insertion, and deletion performance by requiring more probes or comparisons.[8] In separate chaining, where collisions are handled by linking entries in buckets, the load factor directly influences the expected chain length, which averages under the simple uniform hashing assumption.[9] Consequently, the expected time for a successful search is probes, as the initial bucket access is followed by traversing an average of elements in the chain.[8] In open addressing schemes, the load factor must remain strictly less than 1 to ensure available slots for insertions, since all entries occupy distinct buckets. For quadratic probing, performance notably degrades when exceeds 0.5, as secondary clustering—where probes tend to revisit similar bucket sequences—intensifies, leading to longer probe sequences and potential insertion failures even before the table fills. To mitigate these effects, hash tables trigger resizing when the load factor surpasses a predefined threshold, typically by doubling the number of buckets to approximately halve the new .[7] This process assumes a uniform distribution of hash values to ensure collisions remain manageable post-resizing.Hash Functions
Properties and Design Principles
A good hash function for use in hash tables must exhibit several key properties to ensure balanced distribution of keys and efficient operations. Primarily, it must be deterministic, meaning that identical input keys always produce the same output hash value, which is essential for consistent retrieval and storage.[4] Additionally, the function should achieve uniform distribution, mapping keys evenly across the range of possible hash values to minimize clustering in buckets and approximate the simple uniform hashing assumption.[4] Efficiency in computation is another critical property, as the hash function must evaluate quickly, often in constant time relative to key size, to avoid bottlenecking insert, search, and delete operations.[4] A desirable diffusion property is the avalanche effect, where even a single-bit change in the input causes approximately half the output bits to change, enhancing resistance to patterns in keys and improving overall mixing.[10] To provide theoretical guarantees on uniformity, especially when keys arrive in adversarial or non-random order, universal hashing employs a family of functions rather than a single one. A family of hash functions from a key universe to (where is the number of buckets) is universal if, for any distinct keys , the probability over a random choice of .[11] This bound ensures that the expected number of collisions for any set of keys is close to that under ideal random hashing, with the probability of exceeding times the expected collisions being less than for .[11] In a simple probabilistic model, selecting randomly from such a family simulates independent uniform mapping of each key to buckets, yielding average linear-time performance for chained hash tables even with up to insertions, where is the load factor.[11] An explicit construction of a universal family, introduced by Carter and Wegman, uses affine transformations over a finite field. Let be a prime larger than the key universe size, and the table size; the family consists of functions , where and , chosen uniformly at random.[11]function universal_hash(k, a, b, p, m):
return ((a * k + b) % p) % m
function universal_hash(k, a, b, p, m):
return ((a * k + b) % p) % m
Common Techniques
Common techniques for hash functions focus on transforming keys into table indices efficiently while promoting uniformity. For integer keys, several methods have been developed to achieve this.Integer Hashing
The division method is a straightforward approach for integer keys, defined as , where is the key and is the table size, producing a remainder in the range [0, m-1].[14] This method assumes keys are drawn from a universe of integers, typically from 0 to U-1 where U >> m, and derives uniformity from the property that if m is chosen coprime to the key distribution's characteristics (e.g., avoiding powers of 2 for decimal keys), the remainders approximate a uniform distribution over the slots.[14] For example, with m=10 and k=27, ; for k=123, .[14] The multiplication method improves on this by leveraging fractional parts for better distribution, given by , where A is a constant between 0 and 1, and {x} denotes the fractional part of x (i.e., k A mod 1).[14] Knuth recommends A ≈ (√5 - 1)/2 ≈ 0.6180339887, the reciprocal of the golden ratio, as it yields low-discrepancy sequences that enhance uniformity regardless of key distribution patterns.[14] The derivation of uniformity stems from the irrationality of A, ensuring that the sequence {k A} is dense and equidistributed in [0,1) for sequential integer k, thus mapping evenly to [0, m-1] after scaling and flooring.[14] For m=10, A=0.618, and k=27, first compute 27 * 0.618 ≈ 16.686, fractional part 0.686, then ; for k=123, 123 * 0.618 ≈ 76.018, fractional 0.018, .[14] As a historical example, the mid-square method squares the key and extracts the middle digits to form the hash value, suitable for fixed-length integer representations.[15] For a key with d digits, square it to get up to 2d digits and take the middle d digits as h(k), then mod m if needed.[15] This was an early technique but often produces poor uniformity due to clustering in squared values.[14] For m=10, k=27 (squared=729, middle digit assuming 3-digit pad 729 → 2), h(27)=2; for k=123 (squared=15129, middle 2 digits 51, 51 mod 10=1).[15] These integer methods typically assume keys belong to a universe of non-negative integers from 0 to U-1, with U much larger than m to ensure probabilistic uniformity.[14]String Hashing
For string keys, the polynomial rolling hash treats the string as a base-b number modulo a prime p, computed as , where s_i are character codes (e.g., ASCII), n is length, b is the base, and p is a large prime.[16] The base b is chosen as a small prime like 31 or 131 to balance computation and avalanche effects, while p (e.g., a 64-bit prime) is selected large to minimize collision probability, approximating 1/p for random strings under the birthday paradox.[16] This formulation, used in the Rabin-Karp algorithm, enables efficient sliding-window updates for substring hashing by multiplying the prior hash by b, subtracting the dropped term, and adding the new one, all mod p.[16] For example, with string "abc" (a=97, b=98, c=99), b=31, p=101, n=3: h("abc") = (9731^2 + 9831 + 99) mod 101 = (97961 + 9831 + 99) mod 101 = (93217 + 3038 + 99) mod 101 = 96354 mod 101 = 0.[16]Hybrid Approaches
For complex keys like memory addresses, hybrid methods combine techniques such as XOR folding, which divides the key into fixed-width parts and XORs them to produce a compact value before applying another hash like division.[17] This is useful for network addresses, folding octets via successive XOR to mix bits evenly without overflow issues.[17] For a 32-bit address 0x1A2B3C4D with m=10, fold into 16-bit halves (0x1A2B ^ 0x3C4D = 0x2666), then 0x26 ^ 0x66 = 0x40, and h=0x40 mod 10=4; for 0x11223344, (0x1122 ^ 0x3344=0x2266), 0x22 ^ 0x66=0x44, 0x44 mod 10=8.[17]Collision Resolution
Separate Chaining
Separate chaining resolves collisions in hash tables by associating each bucket with a linked list (or chain) that stores all keys hashing to that index. When inserting a key that collides with existing entries, it is appended to the corresponding chain, allowing multiple elements per bucket without requiring alternative slots. This approach maintains the hash table's array structure while deferring collision handling to the auxiliary lists, which grow dynamically as needed. The primary operations—insertion, search, and deletion—in separate chaining rely on computing the hash index and then traversing the chain linearly. For insertion, the key is typically added at the head of the list for constant-time access, though appending to the end preserves order if desired (requiring a tail pointer for efficiency). Search and deletion involve scanning the chain from the head until the key is found or the end is reached; deletion requires updating the next pointers to remove the node while preserving the list. Pseudocode for these operations, assuming a hash table as an array of linked lists and a hash function , is as follows: Insertion:CHAINED-INSERT(T, k)
x = create new node with key k and value (if applicable)
i = h(k) mod m // m is number of buckets
x.next = T[i].head
T[i].head = x
CHAINED-INSERT(T, k)
x = create new node with key k and value (if applicable)
i = h(k) mod m // m is number of buckets
x.next = T[i].head
T[i].head = x
CHAINED-SEARCH(T, k)
i = h(k) mod m
x = T[i].head
while x ≠ null and x.key ≠ k
x = x.next
return x // null if not found
CHAINED-SEARCH(T, k)
i = h(k) mod m
x = T[i].head
while x ≠ null and x.key ≠ k
x = x.next
return x // null if not found
CHAINED-DELETE(T, k)
i = h(k) mod m
x = T[i].head
if x.key = k
T[i].head = x.next
free x
return
y = x
x = x.next
while x ≠ null and x.key ≠ k
y = x
x = x.next
if x ≠ null
y.next = x.next
free x
CHAINED-DELETE(T, k)
i = h(k) mod m
x = T[i].head
if x.key = k
T[i].head = x.next
free x
return
y = x
x = x.next
while x ≠ null and x.key ≠ k
y = x
x = x.next
if x ≠ null
y.next = x.next
free x
Open Addressing
Open addressing, also known as closed hashing, resolves collisions in hash tables by storing all elements directly within a fixed-size array, without using auxiliary data structures like linked lists. Upon inserting a key , the initial position is computed as , where is the array size and is the hash function; if occupied, a probe sequence for is followed until an empty slot is found or the key is matched during search or deletion. This approach assumes simple uniform hashing and keeps the table load factor , where is the number of elements, to ensure termination with high probability.[20] Deletion in open addressing cannot simply empty a slot, as this would disrupt probe sequences for other elements that might traverse it; instead, a special marker called a tombstone is placed, indicating the slot is available for future insertions but treated as occupied during searches to preserve chain integrity. Tombstones accumulate over time, effectively increasing the load factor and necessitating periodic rehashing to maintain performance.[20] Linear probing, the earliest and simplest variant, defines the probe sequence as . Introduced by Peterson in 1957 for random-access storage systems, it promotes sequential access but suffers from primary clustering: insertions near an occupied run of slots tend to extend the cluster, leading to longer average probe lengths. For instance, if keys hashing to indices 10, 11, and 12 are inserted into a table of size 20, a new key hashing to 10 will probe sequentially through 10–19 before wrapping, exacerbating delays for nearby hashes. Knuth analyzed this in detail, showing average unsuccessful search probes approximate . Quadratic probing mitigates primary clustering by using a quadratic offset: , where constants and are chosen to ensure good distribution, such as and (or equivalently, using in integer arithmetic) for balanced steps when is prime. This produces non-linear jumps, reducing the tendency for probes to fill contiguous blocks, though it introduces secondary clustering—keys sharing the same follow identical sequences. If is prime, the first probes are distinct, guaranteeing coverage of half the table before repetition. Knuth discusses its analysis, noting it performs comparably to linear probing at low loads but requires careful parameter selection to avoid permutation issues. Double hashing further improves distribution by employing two independent hash functions: , where provides a variable step size, commonly set as to ensure it is positive and less than . This method approximates uniform probing, minimizing both primary and secondary clustering by making probe sequences key-dependent and pseudo-random. Introduced in Knuth's work and analyzed in subsequent papers, double hashing achieves near-ideal performance up to load factors around 0.8, with average unsuccessful search probes close to .[21] Open addressing variants excel in cache efficiency due to localized, predictable memory accesses, outperforming chaining in modern hardware for small-to-medium tables. However, they demand load factors below 0.7–0.8 to keep probe lengths reasonable—beyond this, clustering causes exponential degradation, with unsuccessful searches averaging up to probes. Deletion via tombstones adds overhead, as they occupy space without utility, often requiring compaction or resizing to control effective load. In contrast to separate chaining, open addressing avoids pointer overhead but imposes stricter capacity limits.[20]Resizing Strategies
Standard Resizing
In standard hash table resizing, the process is triggered when the load factor α, defined as the ratio of the number of elements n to the table size m (α = n/m), exceeds a maximum threshold α_max, typically set between 0.5 and 1 to maintain performance.[22][23] Upon exceeding this threshold, the table size m is doubled to 2m, and all existing n elements are reinserted into the new table using the hash function adjusted for the updated size, often by recomputing h(k) mod 2m where h is the same underlying hash function.[22][23] This rehashing step ensures the load factor returns to approximately α_max / 2, preserving the uniform distribution of keys under the assumption of a good hash function.[23] The resizing operation incurs a temporary cost of Θ(n + m) time and additional space for the new table, as every element must be rehashed and reinserted, which can lead to pauses in performance during the rebuild.[23] However, by doubling the size each time, the frequency of resizes decreases geometrically—each subsequent resize handles roughly twice as many elements before the next trigger—resulting in an amortized O(1) cost per insertion or deletion operation when analyzed using the potential method.[22][23] For a sequence of n insertions starting from an empty table, the total rehashing cost sums to O(n), as the series of resize costs (n + n/2 + n/4 + ...) converges to less than 2n operations.[22] To prevent excessive space usage when the table becomes underutilized, shrinking is performed by halving the table size m to m/2 when the load factor falls below a minimum threshold, such as α_max / 4 (e.g., 0.25 if α_max = 1), followed by rehashing all elements into the smaller table.[22] This threshold avoids oscillation between frequent grows and shrinks, maintaining efficiency while reclaiming memory.[22] The standard resizing approach applies to both separate chaining and open addressing implementations, as the rehashing process rebuilds the underlying structure regardless of collision resolution method.[23] The following pseudocode illustrates a resize-integrated insert operation for a separate chaining hash table, where resizing is checked after each insertion (adaptable to open addressing by modifying the insert step):function INSERT(key, value):
if n / m > α_max:
RESIZE(2 * m)
index = HASH(key) mod m
// For separate chaining: prepend to chain at T[index]
new_node = new Node(key, value)
new_node.next = T[index]
T[index] = new_node
n = n + 1
function RESIZE(new_m):
old_T = T
m = new_m
T = new array of size m // Initialize empty chains
for i = 0 to old_m - 1:
current = old_T[i]
while current != null:
index = HASH(current.key) mod m
temp = current.next
current.next = T[index]
T[index] = current
current = temp
// For shrinking, call RESIZE(m / 2) in DELETE if n / m < α_min
delete old_T
function INSERT(key, value):
if n / m > α_max:
RESIZE(2 * m)
index = HASH(key) mod m
// For separate chaining: prepend to chain at T[index]
new_node = new Node(key, value)
new_node.next = T[index]
T[index] = new_node
n = n + 1
function RESIZE(new_m):
old_T = T
m = new_m
T = new array of size m // Initialize empty chains
for i = 0 to old_m - 1:
current = old_T[i]
while current != null:
index = HASH(current.key) mod m
temp = current.next
current.next = T[index]
T[index] = current
current = temp
// For shrinking, call RESIZE(m / 2) in DELETE if n / m < α_min
delete old_T
Incremental Approaches
Incremental approaches to hash table resizing aim to mitigate the performance pauses associated with standard doubling or halving operations by distributing the rehashing workload over multiple insertions or operations, enabling gradual adaptation to changing data volumes.[25] These methods are particularly valuable in environments requiring low-latency responses, such as databases and distributed systems, where full rehashing could introduce unacceptable delays.[26]Linear Hashing
Linear hashing, introduced by Witold Litwin in 1980, facilitates dynamic growth by splitting buckets incrementally using a split pointer, avoiding the need for a complete table reorganization.[25] The algorithm employs a family of hash functions defined as , where is the initial number of buckets, is the current level (starting at 0), and is a base hash function producing values much larger than the table size.[26] A split pointer, denoted as Next, indicates the next bucket to split and advances linearly through the table. During insertion, the key is hashed using to locate the primary bucket at the current level . If the bucket has space, the key is inserted directly; otherwise, an overflow page is chained to it temporarily.[25] To resolve the overflow, the bucket at the Next position is split: its contents are rehashed using , distributing records into the original bucket and a newly allocated one, after which Next increments by 1.[26] This process continues per insertion as needed, with one split per overflow event, maintaining a near-constant load factor (up to 90% for file-oriented implementations).[25] When Next reaches the end of the current range (), the level increments, Next resets to 0, and a new round of splits begins, effectively doubling the address space over time without batch reprocessing.[26] Overflow handling relies on chaining pages, but the round-robin splitting prevents long chains by proactively redistributing load.[25] Example: Step-by-Step Insertion with Split in Linear HashingConsider an initial table with buckets (indices 0 and 1), level , Next = 0, and for simplicity (actual implementations use a strong hash). Assume each bucket holds up to 2 records.[26]
- Insert 10: , bucket 0 empty → insert into bucket 0 (now: [27]).
- Insert 12: , bucket 0 full → add overflow page to bucket 0 (now: [10, 12] via overflow). Trigger split at Next=0.
- Split bucket 0 using : Rehash contents—10 mod 4 = 2 (stays in 0, but remapped), 12 mod 4 = 0 (to new bucket 2? Wait, linear maps 0 to 0 and 2). Actually, split creates buckets for higher modulus; records with bit i=0 stay in old, i=1 to new. Assume binary view: split distributes based on next bit. Post-split: bucket 0: [28], new bucket 2: [27], Next=1.
- Subsequent insertions proceed similarly, with searches checking both primary and potential split images if applicable. This example illustrates one split per overflow, spreading cost.[26]
Extendible Hashing
Extendible hashing, proposed by Ronald Fagin et al. in 1979, uses a directory of pointers to bucket pages, enabling dynamic resizing through selective directory expansion without rehashing the entire dataset. The directory is an array of size , where is the global depth representing the number of high-order bits from the hash value used to index it. Each bucket has a local depth , indicating how many bits uniquely identify records within it.[26] For insertion, the hash provides the binary address; the first bits index the directory to find the bucket pointer. If space exists, insert; if full, split the bucket by incrementing its local depth to , creating two new buckets for the differing bit, and redistributing records based on that bit. If , double the directory (global depth increments), duplicating pointers to maintain mappings. Multiple directory entries may point to the same bucket if local depths are lower, allowing coalescing during contraction.[26] This approach guarantees at most two disk accesses for lookups and confines rehashing to the split bucket's records, making it suitable for disk-based systems.Consistent Hashing
Consistent hashing, developed by David Karger et al. in 1997, addresses incremental resizing in distributed hash tables by minimizing key migrations when nodes are added or removed.[29] Keys and nodes are mapped via independent hash functions to points on a fixed circle (e.g., [0, 1) or modulo ), with each key assigned to the nearest succeeding node in clockwise order.[29] To improve load balance, each physical node is represented by multiple virtual nodes (typically copies, where is the number of nodes), hashed separately to create evenly spaced points.[29] Upon adding or removing a node, only the keys in the affected arc (between the new/old virtual nodes and predecessors) are reassigned, impacting an expected fraction of keys, where is the number of virtual nodes per physical node.[29] This ensures smooth load distribution, with maximum load deviating by at most from ideal, and spread (number of nodes holding copies of a key across views) bounded by .[29] Deletions follow symmetrically, reassigning to successors. These incremental methods offer lower latency during resizes for large-scale tables compared to batch rehashing, as costs are amortized over operations, but they introduce greater implementation complexity due to pointers, levels, and virtual mappings.[25][29] They are especially suitable for database indexing and distributed caching, where predictable pauses are critical.[26]Performance Analysis
Time Complexity
The time complexity of hash table operations is analyzed under the simple uniform hashing assumption, where each key is equally likely to hash to any slot independently of other keys. For both separate chaining and open addressing, the average-case time complexity for insertion, search, and deletion is when the load factor (with keys and table size ) is held constant. On modern CPUs, these operations achieve practical latencies dominated by cache access times, with L1 cache hits around 1 ns and higher levels up to 4 ns, enabling lookups as low as 2 ns and iterations or probes in 2–3 ns under ideal cache-resident conditions and minimal collisions.[30] In separate chaining, the expected length of each chain is , leading to an expected time for unsuccessful searches and for successful searches; with constant , both simplify to . For open addressing with linear probing, the expected number of probes for an unsuccessful search is , while for a successful search it is ; again, constant yields average time.[20] Amortized analysis accounts for resizing, which doubles the table size when exceeds a threshold (e.g., 0.5–1.0) to maintain performance. Over a sequence of insertions starting from an empty table, the total time is because each key is moved a constant expected number of times during resizes, yielding amortized time per operation via the aggregate method.[20] In the worst case, all keys may hash to the same slot, resulting in time for operations due to linear scanning of the chain or probe sequence, exacerbated by poor hash functions or adversarial inputs that defeat uniformity. In 2025, research showed that static open-addressing hash tables can achieve sublinear expected search times without element reordering during construction, improving upon classical worst-case bounds under certain conditions.[20][31][31] A key space-time trade-off arises from the load factor: lower (achieved by increasing ) reduces the expected number of probes and thus average time, but at the cost of higher space usage, as the table holds more empty slots.[20] Certain variants improve worst-case guarantees. Replacing linked lists in separate chaining with balanced binary search trees yields worst-case time for operations, though at higher constant factors. Robin Hood hashing, an open-addressing technique, balances probe lengths by displacing keys with shorter probes during insertion, reducing maximum probe sequences and improving practical worst-case performance without asymptotic changes.[20][32]Space Efficiency and Trade-offs
Hash tables achieve linear space complexity in the number of elements , denoted as , but incur overhead that elevates the constant factor beyond 1. This overhead arises from the allocation of buckets, where , plus storage for keys and values; typically, each bucket requires space for a pointer or fixed-size entry, leading to times the size of a pointer or entry for the table structure alone.[33] In separate chaining, collisions are resolved by linking elements in lists, adding one pointer per element beyond the initial bucket pointers, which can increase space usage by approximately 50% or more depending on the load factor, as each node in the chain requires additional memory for links. Open addressing, by contrast, avoids these per-element pointers by probing within the array but wastes slots due to clustering, necessitating a larger to maintain performance and thus higher overall space for the same . This pointer overhead in chaining precludes full space efficiency in stable hash tables, while open addressing trades density for simplicity but still leaves allocated space underutilized at high loads.[34][33] The load factor critically influences this space usage, with optimal values of 0.5 to 0.7 recommended for open addressing to balance probe lengths and avoid excessive clustering, resulting in a space multiplier of (e.g., about 1.4 to 2 times the minimum slots needed). For separate chaining, higher loads up to 0.7 or 0.8 are feasible with minimal space penalty beyond pointers, as chains grow linearly without wasted slots, though performance degrades if exceeds 1 significantly. These ranges ensure space efficiency without disproportionate time costs from collisions.[35] Compression techniques further mitigate overhead in space-constrained settings. Using power-of-two table sizes enables fast modulo operations via bit masking (e.g., ), avoiding costly division while maintaining even distribution and allowing efficient resizing by factors of 2. Bit-packing of keys and fingerprints in compact hashing schemes reduces per-entry size by storing multiple elements per word or using succinct representations, achieving near-optimal space (close to times key size) while supporting constant-time access in expectation.[33] In distributed systems, consistent hashing introduces space trade-offs for availability and load balancing; virtual nodes or explicit replication assign multiple copies of data across nodes to handle failures or hotspots, increasing total storage by a factor equal to the replication degree (often 3 or more) at the cost of reduced per-node efficiency. This ensures minimal remapping on node changes but elevates overall space usage compared to non-replicated single-node hash tables. Compared to balanced binary search trees, which also require space due to per-node pointers and keys, hash tables exhibit a higher constant factor from load factor underutilization and collision structures, typically 1.5 to 2 times the raw element storage versus about 3 times for trees (two pointers per node plus key/value). However, hash tables provide average O(1) time without the O(log n) scaling of tree-based structures, making them preferable when average-case performance and density are prioritized over worst-case guarantees.Applications
Core Data Structures
Hash tables provide an efficient implementation for associative arrays, also known as maps or dictionaries, which store pairs of unique keys and associated values. A hash function maps each key to an array index, allowing average constant-time access, insertion, and deletion operations by storing or retrieving values directly from the computed slot, subject to collision resolution.[36][37] Hash sets, a keys-only variant of maps, leverage hash tables by storing elements as keys with implicit or dummy values, enforcing uniqueness since identical keys result in overwrites rather than duplicates. This structure supports efficient membership testing in average O(1) time, while operations like union and intersection can be computed by hashing elements from one set into a temporary table and checking overlaps with another.[37][38] For ordered variants, hash tables can be combined with balanced search trees, such as treaps, either by using trees within overflow buckets or augmenting the overall structure to maintain sorted order alongside hashed indexing, enabling both fast lookups and ordered traversal.[39][40] Unordered collections like Python's built-indict and set rely on hash tables for their core implementation, where keys are hashed to enable rapid access; set does not preserve insertion order, while dict has preserved insertion order since Python 3.7.[41][38]
A key limitation of hash table-based structures is the absence of inherent ordering, requiring separate mechanisms for sorted access, and the prohibition of duplicates, as the mapping enforces unique keys by design.[36][37]
Specialized Uses
Hash tables find specialized application in database indexing, where they support rapid equality-based queries. Hash indexes map keys to data locations using a hash function, enabling average O(1) lookup times for exact matches, which is particularly advantageous in scenarios like point queries in relational databases.[42] In contrast to ordered structures like B-trees, which facilitate range scans and inequality operations through their sorted leaf nodes, hash indexes do not preserve order and thus perform poorly on range queries, making B-trees the standard for versatile indexing needs.[42] Extendible hashing addresses the limitations of static hashing by dynamically adjusting bucket sizes via a directory that points to data pages based on hash prefixes, allowing the structure to grow or shrink without full reorganization as the database expands.[43] This technique, introduced by Fagin, Nievergelt, Pippenger, and Plotkin in 1979, is employed in filesystems and databases to maintain performance under varying loads, with directory splits occurring only when overflows demand deeper prefix resolution.[43] In caching mechanisms, hash tables provide the foundation for constant-time key access, often combined with other structures for eviction policies. Least Recently Used (LRU) caches integrate a hash table for O(1) lookups with a doubly-linked list to track usage order, where accessed items are moved to the front and the tail item is evicted upon capacity limits, achieving amortized O(1) operations overall. Bloom filters extend hash-based sets probabilistically, using multiple independent hash functions to set bits in a bit array for membership representation; they offer space savings over traditional hash sets by tolerating a tunable false-positive rate (typically 1-10%) while guaranteeing no false negatives, making them ideal for applications like duplicate detection in large streams where exact storage is impractical. Transposition tables in game artificial intelligence, such as chess engines, leverage hash tables to cache search results and avoid redundant computations for equivalent board states. These tables store evaluation scores, depths, and best moves keyed by position hashes, with collisions resolved by depth or exact-match checks to prune search trees. Zobrist hashing, a seminal method for generating these keys, assigns random 64-bit values to each piece-type-square combination and XORs them across the board, producing a near-unique signature with low collision probability (about 2^{-64}) that updates incrementally with moves.[44] Developed by Albert Zobrist in 1970, this approach significantly accelerates alpha-beta search by detecting transpositions—positions reached via different move sequences.[44] Beyond these, hash tables underpin symbol tables in compilers, mapping identifiers to attributes like types, scopes, and memory offsets for efficient semantic analysis and code generation across multiple compilation phases.[45] In storage systems, they facilitate data deduplication by computing cryptographic hashes (e.g., SHA-256) of file chunks as unique fingerprints, storing only one copy per unique hash in a table to eliminate redundancies and achieve up to 95% space reduction in backup scenarios.[46][47] Cryptographically, rainbow tables precompute chains of hash reductions to invert one-way functions like password hashes via time-memory trade-offs, reducing brute-force costs from O(2^n) to O(2^{n/2}) for n-bit keys; however, their effectiveness is severely limited by salting and peppering in modern systems, rendering them insecure for production use and primarily educational for demonstrating collision vulnerabilities. In contemporary big data processing, Apache Spark employs hash joins to merge datasets efficiently, building in-memory hash tables on the smaller relation's partitions for probing by the larger one, with broadcast variants distributing compact tables cluster-wide to minimize shuffling.[48] Distributed hash tables (DHTs), exemplified by Chord, scale hash table semantics across peer-to-peer networks by mapping keys to a circular identifier space via consistent hashing, where nodes maintain finger tables for O(log N) lookups and automatic load balancing under churn.[49] Introduced by Stoica et al. in 2001, Chord underpins decentralized storage in systems like file-sharing overlays, ensuring resilience with O(log N) stabilization after node joins or failures.[49]Implementations
Pseudocode Examples
Hash table implementations typically involve an array of fixed size , a hash function mapping keys to indices 0 to , and maintenance of load factor (where is the number of elements) below a threshold (e.g., 0.75 for chaining, 0.5 for open addressing) to ensure efficiency. Core operations include:- Insert(key, value): Compute index = ; if key exists, update value; else add new entry, increment ; if exceeds threshold, resize.
- Search(key): Compute index = ; probe the bucket/slot until key found or end of chain/probe sequence.
- Delete(key): Compute index = ; locate and remove entry, decrement ; for open addressing, mark as deleted (tombstone) to preserve probe chains.
Chaining Implementation
A basic hash table using chaining resolves collisions by storing multiple elements in linked lists at each bucket. The structure consists of an array of lists, initialized with a number of buckets , typically a power of 2 for efficient resizing. The hash function is denoted as , which maps a key to an index between 0 and . Operations maintain a load factor , where is the number of elements, and resize when exceeds a threshold like 0.75 to ensure amortized constant-time performance. The following pseudocode outlines the core class and operations for chaining, including insertion that checks for resizing.class HashTable
list[] table // array of m lists
integer m // number of buckets
integer n // number of elements
real load_threshold = 0.75
function HashTable(integer initial_m)
m = initial_m
n = 0
table = new list[m] // each entry is an empty list
for i = 0 to m-1
table[i] = empty list
end for
end function
function integer hash(key)
return h(key) mod m // stub for hash function
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
index = hash(key)
// Check if key exists and update value
for each entry in table[index]
if entry.key == key
entry.value = value
return
end if
end for
// Insert new entry at head of list
table[index].insert_front(new entry(key, value))
n = n + 1
end function
function search(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
return entry.value
end if
end for
return null // not found
end function
function delete(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
remove entry from table[index]
n = n - 1
return
end if
end for
// Key not found
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new list[m]
for i = 0 to m-1
table[i] = empty list
end for
for i = 0 to old_m - 1
for each entry in old_table[i]
insert(entry.key, entry.value) // reinsert using new hash
end for
end for
end function
class HashTable
list[] table // array of m lists
integer m // number of buckets
integer n // number of elements
real load_threshold = 0.75
function HashTable(integer initial_m)
m = initial_m
n = 0
table = new list[m] // each entry is an empty list
for i = 0 to m-1
table[i] = empty list
end for
end function
function integer hash(key)
return h(key) mod m // stub for hash function
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
index = hash(key)
// Check if key exists and update value
for each entry in table[index]
if entry.key == key
entry.value = value
return
end if
end for
// Insert new entry at head of list
table[index].insert_front(new entry(key, value))
n = n + 1
end function
function search(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
return entry.value
end if
end for
return null // not found
end function
function delete(key)
index = hash(key)
for each entry in table[index]
if entry.key == key
remove entry from table[index]
n = n - 1
return
end if
end for
// Key not found
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new list[m]
for i = 0 to m-1
table[i] = empty list
end for
for i = 0 to old_m - 1
for each entry in old_table[i]
insert(entry.key, entry.value) // reinsert using new hash
end for
end for
end function
Open Addressing Implementation
Open addressing stores elements directly in the array without lists, using probing sequences to resolve collisions. Linear probing, a common method, computes the next position as , where is the probe number. Deletions use tombstones (marked as "deleted") to avoid breaking search chains, allowing probes to continue past them during searches but treating them as available slots for insertions. The table resizes similarly when load exceeds a threshold, typically lower (e.g., 0.5) to mitigate clustering.[50] The pseudocode below shows operations for linear probing open addressing.class OpenHashTable
entry[] table // array of m entries, each may be null, key-value pair, or deleted
[integer](/page/Integer) m // number of buckets
[integer](/page/Integer) n // number of elements
real load_threshold = 0.5
function OpenHashTable([integer](/page/Integer) initial_m)
m = initial_m
n = 0
table = new entry[m] // all null initially
end function
function [integer](/page/Integer) hash(key, [integer](/page/Integer) i)
return (h(key) + i) mod m // linear probing stub
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
i = 0
repeat
j = hash(key, i)
if table[j] == null or table[j] == deleted
table[j] = new entry(key, value)
n = n + 1
return
end if
if table[j].key == key
table[j].value = value
return
end if
i = i + 1
until i == m
error "table overflow"
end function
function search(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return null // not found, end of chain
end if
if table[j] != deleted and table[j].key == key
return table[j].value
end if
i = i + 1
until i == m
return null
end function
function delete(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return // not found
end if
if table[j].key == key
table[j] = deleted // tombstone
n = n - 1
return
end if
i = i + 1
until i == m
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new entry[m] // all null
for each old_entry in old_table
if old_entry != null and old_entry != deleted
insert(old_entry.key, old_entry.value) // reinsert
end if
end for
end function
class OpenHashTable
entry[] table // array of m entries, each may be null, key-value pair, or deleted
[integer](/page/Integer) m // number of buckets
[integer](/page/Integer) n // number of elements
real load_threshold = 0.5
function OpenHashTable([integer](/page/Integer) initial_m)
m = initial_m
n = 0
table = new entry[m] // all null initially
end function
function [integer](/page/Integer) hash(key, [integer](/page/Integer) i)
return (h(key) + i) mod m // linear probing stub
end function
function insert(key, value)
if n / m > load_threshold
resize(2 * m)
end if
i = 0
repeat
j = hash(key, i)
if table[j] == null or table[j] == deleted
table[j] = new entry(key, value)
n = n + 1
return
end if
if table[j].key == key
table[j].value = value
return
end if
i = i + 1
until i == m
error "table overflow"
end function
function search(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return null // not found, end of chain
end if
if table[j] != deleted and table[j].key == key
return table[j].value
end if
i = i + 1
until i == m
return null
end function
function delete(key)
i = 0
repeat
j = hash(key, i)
if table[j] == null
return // not found
end if
if table[j].key == key
table[j] = deleted // tombstone
n = n - 1
return
end if
i = i + 1
until i == m
end function
function resize(integer new_m)
old_table = table
m = new_m
n = 0
table = new entry[m] // all null
for each old_entry in old_table
if old_entry != null and old_entry != deleted
insert(old_entry.key, old_entry.value) // reinsert
end if
end for
end function
Language-Specific Considerations
In Java, theHashMap class implements a hash table using separate chaining for collision resolution, where each bucket contains a linked list of entries. The default load factor is set to 0.75, providing a balance between time and space efficiency by triggering resizing when the table reaches 75% capacity. Additionally, to mitigate performance degradation from long chains, HashMap converts linked lists to balanced binary search trees (red-black trees) when a chain exceeds a threshold of 8 elements, a feature introduced in Java 8 for improved worst-case lookup times. For concurrent environments, ConcurrentHashMap extends this design with fine-grained locking and segmenting, supporting full concurrency for reads and adjustable concurrency for updates without blocking retrievals.
Python's built-in dict type (in the CPython implementation) employs open addressing with a perturbed quadratic probing strategy for collision resolution, implemented in C for efficiency. The probing sequence uses the recurrence index = ((5 * index) + 1 + perturb) % size, where perturb is derived from the key's hash and right-shifted by 5 bits each iteration to add randomness and reduce clustering.[51] Hash randomization, enabled by default since Python 3.3, perturbs the hash function using a random seed at process startup to prevent hash-flooding denial-of-service attacks by ensuring consistent distribution across runs. The set type is essentially an optimized dict variant, using the same underlying hash table structure but storing only keys with dummy values, which allows for efficient membership testing and set operations.
In C++, the standard std::unordered_map utilizes separate chaining, maintaining linked lists (or equivalent forward iterators in some implementations) within buckets determined by a user-supplied or default hash function. It supports customizable hash policies via the Hash template parameter and equality predicates via KeyEqual, enabling adaptations for user-defined types. The reserve method preallocates buckets to accommodate a specified number of elements without exceeding the maximum load factor, avoiding intermediate rehashes during insertion and improving performance for known sizes.
Implementing hash tables in these languages often involves challenges such as providing custom hashers and equality predicates to ensure proper distribution and collision handling for non-standard key types, particularly in C++ where the standard requires hash functions to be consistent with equality. Move semantics, introduced in C++11, further complicate implementations by necessitating efficient transfer of ownership in containers like std::unordered_map, avoiding unnecessary copies during resizing or insertion.
Specialized libraries address these issues with alternative strategies; for instance, Google's dense_hash_map from the Sparsehash library uses open addressing with quadratic probing, optimizing for dense storage and faster lookups at the cost of higher space usage under low loads. Similarly, Abseil's SwissTable implementation in C++ employs open addressing with SIMD-accelerated metadata lookups and robin-hood hashing for probe redistribution, offering superior performance over standard chaining in high-throughput scenarios.
Best practices for hash table usage include seeding hash functions with random values to defend against collision-based attacks like hash flooding, where adversaries craft inputs to degrade performance to O(n). Monitoring and maintaining load factors below 0.75 ensures efficient operation, with proactive resizing or reservation to prevent excessive collisions and rehashing overhead.References
- Jan 15, 2025 · Cryptographic hashes have a security requirement, which is usually phrased in terms of resistance to collisions and pre-images. Indeed, for a ...
- Sep 18, 2012 · Target applications include network traffic authentication and hash-table lookups protected against hash-flooding denial-of-service attacks.