Linear probing
View on Wikipedia

Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene Amdahl, Elaine M. McGraw, and Arthur Samuel (and, independently, by Andrey Yershov[1]) and first analyzed in 1963 by Donald Knuth.
Along with quadratic probing and double hashing, linear probing is a form of open addressing. In these schemes, each cell of a hash table stores a single key–value pair. When the hash function causes a collision by mapping a new key to a cell of the hash table that is already occupied by another key, linear probing searches the table for the closest following free location and inserts the new key there. Lookups are performed in the same way, by searching the table sequentially starting at the position given by the hash function, until finding a cell with a matching key or an empty cell.
As Thorup & Zhang (2012) write, "Hash tables are the most commonly used nontrivial data structures, and the most popular implementation on standard hardware uses linear probing, which is both fast and simple."[2] Linear probing can provide high performance because of its good locality of reference, but is more sensitive to the quality of its hash function than some other collision resolution schemes. It takes constant expected time per search, insertion, or deletion when implemented using a random hash function, a 5-independent hash function, or tabulation hashing. Good results can also be achieved in practice with other hash functions such as MurmurHash.[3]
Operations
[edit]Linear probing is a component of open addressing schemes for using a hash table to solve the dictionary problem. In the dictionary problem, a data structure should maintain a collection of key–value pairs subject to operations that insert or delete pairs from the collection or that search for the value associated with a given key. In open addressing solutions to this problem, the data structure is an array T (the hash table) whose cells T[i] (when nonempty) each store a single key–value pair. A hash function is used to map each key into the cell of T where that key should be stored, typically scrambling the keys so that keys with similar values are not placed near each other in the table. A hash collision occurs when the hash function maps a key into a cell that is already occupied by a different key. Linear probing is a strategy for resolving collisions, by placing the new key into the closest following empty cell.[4][5]
Search
[edit]To search for a given key x, the cells of T are examined, beginning with the cell at index h(x) (where h is the hash function) and continuing to the adjacent cells h(x) + 1, h(x) + 2, ..., until finding either an empty cell or a cell whose stored key is x. If a cell containing the key is found, the search returns the value from that cell. Otherwise, if an empty cell is found, the key cannot be in the table, because it would have been placed in that cell in preference to any later cell that has not yet been searched. In this case, the search returns as its result that the key is not present in the dictionary.[4][5]
Insertion
[edit]To insert a key–value pair (x,v) into the table (possibly replacing any existing pair with the same key), the insertion algorithm follows the same sequence of cells that would be followed for a search, until finding either an empty cell or a cell whose stored key is x. The new key–value pair is then placed into that cell.[4][5]
If the insertion would cause the load factor of the table (its fraction of occupied cells) to grow above some preset threshold, the whole table may be replaced by a new table, larger by a constant factor, with a new hash function, as in a dynamic array. Setting this threshold close to zero and using a high growth rate for the table size leads to faster hash table operations but greater memory usage than threshold values close to one and low growth rates. A common choice would be to double the table size when the load factor would exceed 1/2, causing the load factor to stay between 1/4 and 1/2.[6]
Deletion
[edit]
It is also possible to remove a key–value pair from the dictionary. However, it is not sufficient to do so by simply emptying its cell. This would affect searches for other keys that have a hash value earlier than the emptied cell, but that are stored in a position later than the emptied cell. The emptied cell would cause those searches to incorrectly report that the key is not present.
Instead, when a cell i is emptied, it is necessary to search forward through the following cells of the table until finding either another empty cell or a key that can be moved to cell i (that is, a key whose hash value is equal to or earlier than i). When an empty cell is found, then emptying cell i is safe and the deletion process terminates. But, when the search finds a key that can be moved to cell i, it performs this move. This has the effect of speeding up later searches for the moved key, but it also empties out another cell, later in the same block of occupied cells. The search for a movable key continues for the new emptied cell, in the same way, until it terminates by reaching a cell that was already empty. In this process of moving keys to earlier cells, each key is examined only once. Therefore, the time to complete the whole process is proportional to the length of the block of occupied cells containing the deleted key, matching the running time of the other hash table operations.[4]
Alternatively, it is possible to use a lazy deletion strategy in which a key–value pair is removed by replacing the value by a special flag value indicating a deleted key. However, these flag values will contribute to the load factor of the hash table. With this strategy, it may become necessary to clean the flag values out of the array and rehash all the remaining key–value pairs once too large a fraction of the array becomes occupied by deleted keys.[4][5]
Properties
[edit]Linear probing provides good locality of reference, which causes it to require few uncached memory accesses per operation. Because of this, for low to moderate load factors, it can provide very high performance. However, compared to some other open addressing strategies, its performance degrades more quickly at high load factors because of primary clustering, a tendency for one collision to cause more nearby collisions.[4] Additionally, achieving good performance with this method requires a higher-quality hash function than for some other collision resolution schemes.[7] When used with low-quality hash functions that fail to eliminate nonuniformities in the input distribution, linear probing can be slower than other open-addressing strategies such as double hashing, which probes a sequence of cells whose separation is determined by a second hash function, or quadratic probing, where the size of each step varies depending on its position within the probe sequence.[8]
Analysis
[edit]Using linear probing, dictionary operations can be implemented in constant expected time. In other words, insert, remove and search operations can be implemented in O(1), as long as the load factor of the hash table is a constant strictly less than one.[9]
In more detail, the time for any particular operation (a search, insertion, or deletion) is proportional to the length of the contiguous block of occupied cells at which the operation starts. If all starting cells are equally likely, in a hash table with N cells, then a maximal block of k occupied cells will have probability k/N of containing the starting location of a search, and will take time O(k) whenever it is the starting location. Therefore, the expected time for an operation can be calculated as the product of these two terms, O(k2/N), summed over all of the maximal blocks of contiguous cells in the table. A similar sum of squared block lengths gives the expected time bound for a random hash function (rather than for a random starting location into a specific state of the hash table), by summing over all the blocks that could exist (rather than the ones that actually exist in a given state of the table), and multiplying the term for each potential block by the probability that the block is actually occupied. That is, defining Block(i,k) to be the event that there is a maximal contiguous block of occupied cells of length k beginning at index i, the expected time per operation is
This formula can be simplified by replacing Block(i,k) by a simpler necessary condition Full(k), the event that at least k elements have hash values that lie within a block of cells of length k. After this replacement, the value within the sum no longer depends on i, and the 1/N factor cancels the N terms of the outer summation. These simplifications lead to the bound
But by the multiplicative form of the Chernoff bound, when the load factor is bounded away from one, the probability that a block of length k contains at least k hashed values is exponentially small as a function of k, causing this sum to be bounded by a constant independent of n.[4] It is also possible to perform the same analysis using Stirling's approximation instead of the Chernoff bound to estimate the probability that a block contains exactly k hashed values.[5][10]
In terms of the load factor α, the expected time to perform a successful search on a random key is O(1 + 1/(1 − α)), and the expected time to perform an unsuccessful search (or the insertion of a new key) is O(1 + 1/(1 − α)2).[11] For constant load factors, with high probability, the longest probe sequence (among the probe sequences for all keys stored in the table) has logarithmic length.[12]
Primary Clustering
[edit]Linear probing hash tables suffer from a problem known as primary clustering, in which elements to group together into long contiguous runs.[13][11] In terms of the load factor α, the expected length of the run containing a given element is .[11] This is why insertions take time , as opposed to the running time of achieved by other open-addressed hash tables such as uniform probing and double hashing.[11] Primary clustering also affects unsuccessful searches, since like insertions, they must travel to the end of a run.[11] Some variations of linear probing are able to achieve better bounds for unsuccessful searches and insertions, by using techniques that reduce primary clustering.[14]
Choice of hash function
[edit]Because linear probing is especially sensitive to unevenly distributed hash values,[8] it is important to combine it with a high-quality hash function that does not produce such irregularities.
The analysis above assumes that each key's hash is a random number independent of the hashes of all the other keys. This assumption is unrealistic for most applications of hashing. However, random or pseudorandom hash values may be used when hashing objects by their identity rather than by their value. For instance, this is done using linear probing by the IdentityHashMap class of the Java collections framework.[15] The hash value that this class associates with each object, its identityHashCode, is guaranteed to remain fixed for the lifetime of an object but is otherwise arbitrary.[16] Because the identityHashCode is constructed only once per object, and is not required to be related to the object's address or value, its construction may involve slower computations such as the call to a random or pseudorandom number generator. For instance, Java 8 uses an Xorshift pseudorandom number generator to construct these values.[17]
For most applications of hashing, it is necessary to compute the hash function for each value every time that it is hashed, rather than once when its object is created. In such applications, random or pseudorandom numbers cannot be used as hash values, because then different objects with the same value would have different hashes. And cryptographic hash functions (which are designed to be computationally indistinguishable from truly random functions) are usually too slow to be used in hash tables.[18] Instead, other methods for constructing hash functions have been devised. These methods compute the hash function quickly, and can be proven to work well with linear probing. In particular, linear probing has been analyzed from the framework of k-independent hashing, a class of hash functions that are initialized from a small random seed and that are equally likely to map any k-tuple of distinct keys to any k-tuple of indexes. The parameter k can be thought of as a measure of hash function quality: the larger k is, the more time it will take to compute the hash function but it will behave more similarly to completely random functions. For linear probing, 5-independence is enough to guarantee constant expected time per operation,[19] while some 4-independent hash functions perform badly, taking up to logarithmic time per operation.[7]
Another method of constructing hash functions with both high quality and practical speed is tabulation hashing. In this method, the hash value for a key is computed by using each byte of the key as an index into a table of random numbers (with a different table for each byte position). The numbers from those table cells are then combined by a bitwise exclusive or operation. Hash functions constructed this way are only 3-independent. Nevertheless, linear probing using these hash functions takes constant expected time per operation.[5][20] Both tabulation hashing and standard methods for generating 5-independent hash functions are limited to keys that have a fixed number of bits. To handle strings or other types of variable-length keys, it is possible to compose a simpler universal hashing technique that maps the keys to intermediate values and a higher quality (5-independent or tabulation) hash function that maps the intermediate values to hash table indices.[2][21]
In an experimental comparison, Richter et al. found that the Multiply-Shift family of hash functions (defined as ) was "the fastest hash function when integrated with all hashing schemes, i.e., producing the highest throughputs and also of good quality" whereas tabulation hashing produced "the lowest throughput".[3] They point out that each table look-up require several cycles, being more expensive than simple arithmetic operations. They also found MurmurHash to be superior than tabulation hashing: "By studying the results provided by Mult and Murmur, we think that the trade-off for by tabulation (...) is less attractive in practice".
History
[edit]The idea of an associative array that allows data to be accessed by its value rather than by its address dates back to the mid-1940s in the work of Konrad Zuse and Vannevar Bush,[22] but hash tables were not described until 1953, in an IBM memorandum by Hans Peter Luhn. Luhn used a different collision resolution method, chaining, rather than linear probing.[23]
Knuth (1963) summarizes the early history of linear probing. It was the first open addressing method, and was originally synonymous with open addressing. According to Knuth, it was first used by Gene Amdahl, Elaine M. McGraw (née Boehme), and Arthur Samuel in 1954, in an assembler program for the IBM 701 computer.[9] The first published description of linear probing is by Peterson (1957),[9] who also credits Samuel, Amdahl, and Boehme but adds that "the system is so natural, that it very likely may have been conceived independently by others either before or since that time".[24] Another early publication of this method was by Soviet researcher Andrey Ershov, in 1958.[25]
The first theoretical analysis of linear probing, showing that it takes constant expected time per operation with random hash functions, was given by Knuth.[9] Sedgewick calls Knuth's work "a landmark in the analysis of algorithms".[11] Significant later developments include a more detailed analysis of the probability distribution of the running time,[26][27] and the proof that linear probing runs in constant time per operation with practically usable hash functions rather than with the idealized random functions assumed by earlier analysis.[19][20]
References
[edit]- ^ Knuth, Donald E. (1996). "Memories of Andrei Ershov". Don Knuth's Home Page.
- ^ a b Thorup, Mikkel; Zhang, Yin (2012), "Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation", SIAM Journal on Computing, 41 (2): 293–331, doi:10.1137/100800774, MR 2914329
- ^ a b Richter, Stefan; Alvarez, Victor; Dittrich, Jens (2015), "A seven-dimensional analysis of hashing methods and its implications on query processing" (PDF), Proceedings of the VLDB Endowment, 9 (3): 293–331, doi:10.14778/2850583.2850585
- ^ a b c d e f g Goodrich, Michael T.; Tamassia, Roberto (2015), "Section 6.3.3: Linear Probing", Algorithm Design and Applications, Wiley, pp. 200–203
- ^ a b c d e f Morin, Pat (February 22, 2014), "Section 5.2: LinearHashTable: Linear Probing", Open Data Structures (in pseudocode) (0.1Gβ ed.), pp. 108–116, retrieved 2016-01-15
- ^ Sedgewick, Robert; Wayne, Kevin (2011), Algorithms (4th ed.), Addison-Wesley Professional, p. 471, ISBN 9780321573513. Sedgewick and Wayne also halve the table size when a deletion would cause the load factor to become too low, causing them to use a wider range [1/8,1/2] in the possible values of the load factor.
- ^ a b Pătraşcu, Mihai; Thorup, Mikkel (2010), "On the k-independence required by linear probing and minwise independence" (PDF), Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6–10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, vol. 6198, Springer, pp. 715–726, arXiv:1302.5127, doi:10.1007/978-3-642-14165-2_60
- ^ a b Heileman, Gregory L.; Luo, Wenbin (2005), "How caching affects hashing" (PDF), Seventh Workshop on Algorithm Engineering and Experiments (ALENEX 2005), pp. 141–154
- ^ a b c d Knuth, Donald (1963), Notes on "Open" Addressing, archived from the original on 2016-03-03
- ^ Eppstein, David (October 13, 2011), "Linear probing made easy", 0xDE
- ^ a b c d e f Sedgewick, Robert (2003), "Section 14.3: Linear Probing", Algorithms in Java, Parts 1–4: Fundamentals, Data Structures, Sorting, Searching (3rd ed.), Addison Wesley, pp. 615–620, ISBN 9780321623973
- ^ Pittel, B. (1987), "Linear probing: the probable largest search time grows logarithmically with the number of records", Journal of Algorithms, 8 (2): 236–249, doi:10.1016/0196-6774(87)90040-X, MR 0890874
- ^ Cormen, Thomas H.; Leiserson, Charles Eric; Rivest, Ronald Linn; Stein, Clifford (2009). Introduction to algorithms (3rd ed.). Cambridge, Massachusetts London, England: MIT Press. ISBN 978-0-262-53305-8.
- ^ Bender, Michael A.; Kuszmaul, Bradley C.; Kuszmaul, William (2022). "Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering". 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). IEEE. pp. 1171–1182. doi:10.1109/focs52979.2021.00115. ISBN 978-1-6654-2055-6. S2CID 260502788.
- ^ "IdentityHashMap", Java SE 7 Documentation, Oracle, retrieved 2016-01-15
- ^ Friesen, Jeff (2012), Beginning Java 7, Expert's voice in Java, Apress, p. 376, ISBN 9781430239109
- ^ Kabutz, Heinz M. (September 9, 2014), "Identity Crisis", The Java Specialists' Newsletter, 222
- ^ Weiss, Mark Allen (2014), "Chapter 3: Data Structures", in Gonzalez, Teofilo; Diaz-Herrera, Jorge; Tucker, Allen (eds.), Computing Handbook, vol. 1 (3rd ed.), CRC Press, p. 3-11, ISBN 9781439898536.
- ^ a b Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), "Linear probing with constant independence", SIAM Journal on Computing, 39 (3): 1107–1120, arXiv:cs/0612055, doi:10.1137/070702278, MR 2538852
- ^ a b Pătraşcu, Mihai; Thorup, Mikkel (2011), "The power of simple tabulation hashing", Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11), pp. 1–10, arXiv:1011.5200, doi:10.1145/1993636.1993638, S2CID 195776653
- ^ Thorup, Mikkel (2009), "String hashing for linear probing", Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA: SIAM, pp. 655–664, CiteSeerX 10.1.1.215.4253, doi:10.1137/1.9781611973068.72, MR 2809270
- ^ Parhami, Behrooz (2006), Introduction to Parallel Processing: Algorithms and Architectures, Series in Computer Science, Springer, 4.1 Development of early models, p. 67, ISBN 9780306469640
- ^ Morin, Pat (2004), "Hash tables", in Mehta, Dinesh P.; Sahni, Sartaj (eds.), Handbook of Data Structures and Applications, Chapman & Hall / CRC, p. 9-15, ISBN 9781420035179.
- ^ Peterson, W. W. (April 1957), "Addressing for random-access storage", IBM Journal of Research and Development, 1 (2), Riverton, NJ, USA: IBM Corp.: 130–146, doi:10.1147/rd.12.0130
- ^ Ershov, A. P. (1958), "On Programming of Arithmetic Operations", Communications of the ACM, 1 (8): 3–6, doi:10.1145/368892.368907, S2CID 15986378. Translated from Doklady AN USSR 118 (3): 427–430, 1958, by Morris D. Friedman. Linear probing is described as algorithm A2.
- ^ Flajolet, P.; Poblete, P.; Viola, A. (1998), "On the analysis of linear probing hashing" (PDF), Algorithmica, 22 (4): 490–515, doi:10.1007/PL00009236, MR 1701625, S2CID 5436036
- ^ Knuth, D. E. (1998), "Linear probing and graphs", Algorithmica, 22 (4): 561–568, arXiv:cs/9801103, doi:10.1007/PL00009240, MR 1701629, S2CID 12254909
Linear probing
View on GrokipediaBackground
Hash tables and collision resolution
A hash table is a data structure that stores key-value pairs in an array of fixed size, using a hash function to compute an index for each key, thereby enabling average-case constant-time O(1) performance for insertions, deletions, and lookups.[11][12] The hash function transforms a key into a non-negative integer within the range of array indices, ideally distributing keys uniformly to minimize overlaps.[11] Collisions occur when two or more distinct keys map to the same array index via the hash function, a situation inevitable given the finite table size relative to the potentially vast key space.[11][13] For example, in a table of size 10, the keys "David" and "Eva" might both hash to index 4, requiring a resolution strategy to maintain data integrity and accessibility.[11] To address collisions, hash tables employ resolution techniques such as separate chaining or open addressing. In separate chaining, each array slot holds a linked list, with colliding keys appended to the list at the hashed index; this approach simplifies implementation and supports high occupancy without severe degradation, though it incurs extra space for list pointers and may involve traversal costs.[11][13] Conversely, open addressing resolves collisions by searching for an unoccupied slot within the array itself, promoting space efficiency by eliminating pointer overhead, but risking longer probe sequences as the table fills, which can amplify access times.[11][12] The load factor α, defined as the number of keys n divided by the table size m (α = n/m), quantifies table utilization and directly influences collision frequency; values approaching 1 heighten resolution demands, potentially eroding the O(1) performance guarantee.[11][12] Linear probing serves as one technique within the open addressing paradigm.[12]Open addressing overview
Open addressing, also known as closed hashing, is a collision resolution technique in hash tables where all elements are stored directly within the table's array, without using external lists or chains. When a collision occurs—meaning the computed hash index for a key is already occupied—the method resolves it by systematically probing subsequent slots in the array to find an empty one, guided by a secondary probe function $ h(k, i) $, where $ k $ is the key and $ i $ is the probe count starting from 0. This approach was first formalized as a practical method for random-access storage systems.[14] Unlike separate chaining, which appends collided keys to linked lists at each array slot and incurs overhead from pointer storage, open addressing maintains a single contiguous array, reducing memory usage by avoiding pointers and enabling more compact data representation. However, this efficiency comes at the cost of potentially longer probe sequences as the load factor—the ratio of stored elements to array size—increases, since unsuccessful searches must traverse occupied slots until an empty one or the key is found.[15] Several variants of open addressing differ in how the probe sequence is generated to mitigate clustering, where consecutive probes tend to examine nearby slots. Linear probing uses a constant step size of 1, advancing sequentially from the initial hash position. Quadratic probing employs a step size of $ i^2 $, producing a nonlinear sequence that spreads probes more evenly. Double hashing utilizes a secondary hash function to determine the step size, typically $ h_2(k) $, yielding the probe offset as $ i \times h_2(k) \mod m $, where $ m $ is the table size, for greater permutation quality. The general insertion procedure in open addressing follows this skeleton:function insert(key k):
i ← 0
repeat:
slot ← (h₁(k) + [probe](/page/P.R.O.B.E.)(k, i)) mod m
if table[slot] is empty or is deleted:
table[slot] ← k
return
i ← i + 1
until i == m // table full, handle resize or failure
Here, $ h_1 $ is the primary hash, and probe is the variant-specific function (e.g., probe(k, i) = i for linear). Search and verification follow a similar loop, terminating on match, empty slot, or exhaustion.[15][16]
Open addressing offers advantages in modern systems, particularly cache efficiency, as probe sequences often access contiguous memory locations, improving locality and reducing cache misses compared to scattered chain traversals. Nonetheless, it presents challenges in deletion, where simply removing an element could disrupt probe paths for other keys, necessitating markers like tombstones to preserve sequence integrity during searches. Additionally, the method is prone to clustering, where insertions form dense groups of occupied slots, exacerbating probe lengths and limiting effective load factors to around 0.7–0.8 for practicality.[17][18][15]
Core operations
Search procedure
In linear probing, the search procedure begins by computing the initial hash position $ h(k) = \hash(k) \mod m $, where $ k $ is the key, $ \hash $ is the hash function, and $ m $ is the table size. The algorithm then probes sequentially forward from this position using the probe sequence $ h(k, i) = (h(k) + i) \mod m $ for $ i = 0, 1, 2, \dots $, examining each slot until it either locates the key $ k $, encounters an empty slot (indicating the key is absent), or completes a full cycle through the table (indicating the table is full and the key is not present).[19][20] The following pseudocode illustrates the search algorithm, incorporating the loop termination conditions:Algorithm Search(table T, key k, size m)
i ← 0
while true do
j ← (hash(k) + i) mod m
if T[j] is empty then
return not found // Key absent
if T[j].key = k then
return found // Key located at j
i ← i + 1
if i = m then
return not found // Table full, key absent
end while
This procedure ensures that all potential positions in the probe sequence are checked without prematurely stopping unless justified by an empty slot or full cycle.[19][20]
Consider a hash table of size 4 with initial empty slots, after insertions of "apple" (hash value 1) and "banana" (hash value 3), resulting in slots: [empty, "apple", empty, "banana"]. Searching for "apple" starts at index 1, finds the key immediately, and returns success after one probe. In contrast, searching for "cherry" (hash value 0) probes index 0 (empty), terminates immediately, and returns not found.[21][22]
To support deletions without disrupting searches, deleted slots are marked as occupied (e.g., with a special "deleted" indicator) rather than empty, requiring the search to continue probing past them until an empty slot or the key is found.[19][20]
Insertion process
In linear probing, the insertion process begins by computing the initial hash value $ h'(k) $ for the key $ k $, where the hash table $ T $ has size $ m $. The algorithm then follows the probe sequence $ h(k, i) = (h'(k) + i) \mod m $ for $ i = 0, 1, \dots $, sequentially checking each slot starting from $ h'(k) $. If the key $ k $ is found at any slot, the insertion terminates without adding a duplicate. Otherwise, the key is placed in the first empty slot encountered (typically denoted as NIL). If the probing completes a full cycle through all $ m $ slots without finding an empty one, the table is considered full, and an overflow error is raised.[23] The following pseudocode outlines the insertion procedure, adapted from standard open-addressing implementations:HASH-INSERT(T, k)
1 i ← 0
2 repeat
3 j ← (h'(k) + i) mod m
4 if T[j] is empty
5 T[j] ← k
6 return j
7 else if T[j] = k
8 return j // key already present
9 i ← i + 1
10 until i = m
11 error "[hash table](/page/Hash_table) overflow"
This procedure ensures that insertions respect the probe sequences used for searches, preventing overwrites of existing keys.[23]
To maintain performance, practical implementations monitor the load factor $ \alpha = n/m $ (where $ n $ is the number of elements) and suggest resizing the table—typically by doubling $ m $ and rehashing all elements—when $ \alpha $ exceeds a threshold like 0.7, as higher loads degrade probe lengths significantly in linear probing.[24]
For example, consider a hash table of size $ m = 5 $ with an initial hash function where $ h'(\text{"dog"}) = 2 $, and slots 2 and 3 already occupied by other keys. The probe sequence starts at slot 2 (occupied), moves to slot 3 (occupied), then slot 4 (empty), inserting "dog" there.[23]
The insertion process relies on the same probing mechanism as search to detect duplicates and locate empty slots, ensuring consistency across operations without overwriting valid entries.[23]
Deletion handling
Deletion in linear probing hash tables presents a challenge because simply setting a slot to empty after removing a key can disrupt the probe sequences of other keys that would have continued searching through that position, potentially causing those keys to become unfindable during subsequent searches. To address this, the standard approach uses special markers known as tombstones or "deleted" flags placed in the slots of removed keys. These markers indicate that the slot is available for future insertions but is treated as occupied during searches to preserve the integrity of probe sequences. The deletion algorithm begins by performing the standard linear probing search to locate the key. If the key is found, the slot's value is replaced with the deleted marker rather than being set to empty. Searches and insertions are modified accordingly: during a search, if a tombstone is encountered, the probing continues as if the slot were occupied by a non-matching key; for insertion, tombstones are considered available slots, allowing new keys to be placed there and potentially reusing the space without breaking existing chains. The following pseudocode illustrates the deletion process:function DELETE(key):
i ← 0
start ← HASH(key) mod m
while true do
j ← (start + i) mod m
if table[j] is empty:
return // key not found
if table[j] = key:
table[j] ← deleted // place tombstone
return
i ← i + 1
if i = m:
return // table full, key not found
end while
This ensures that probe sequences remain continuous for other elements.
However, repeated deletions lead to an accumulation of tombstones, which effectively increase the load factor by occupying slots without holding useful data, thereby lengthening average probe sequences and degrading overall performance over time. To mitigate this, implementations often monitor the number of tombstones and trigger periodic rehashing into a larger table or compaction to remove them when they reach a threshold, such as half the table size.
Consider an example with a hash table using linear probing and a simple hash function where both "apple" and "banana" hash to position 2. "Apple" is inserted first at 2. "Banana" collides at 2, probes to 3 (empty), and is inserted there (position 1 and 4+ empty). Deleting "apple" at position 2 by setting it to a tombstone allows a search for "banana" to start at 2 (tombstone, continue probing), reach 3, and find it correctly, whereas setting position 2 to empty would stop the search prematurely. Subsequent insertion of a new key hashing to 2 could then reuse the tombstone slot.
Performance analysis
Load factor impact
The load factor α in linear probing is defined as the ratio of the number of stored elements n to the table size m, or α = n/m.[25] Linear probing achieves efficient performance when α remains below 0.7–0.8, but degrades sharply as α nears 1.0 due to substantially longer probe sequences required for operations.[5][25] To sustain efficiency, implementations typically resize the table by doubling its size and rehashing all elements when α exceeds a threshold such as 0.7, which keeps the load factor in a range where probe lengths stay manageable.[26] Empirical analysis under uniform hashing reveals that at α = 0.5, the expected probes for a successful search average about 1.5, while unsuccessful searches or insertions require around 2.5; at α = 0.9, these rise to approximately 5.5 and over 50 probes, respectively, highlighting the method's vulnerability at high loads.[25]| Load Factor α | Successful Search (Probes) | Unsuccessful Search (Probes) |
|---|---|---|
| 0.25 | 1.2 | 1.4 |
| 0.50 | 1.5 | 2.5 |
| 0.67 | 2.0 | 5.0 |
| 0.75 | 2.5 | 8.5 |
| 0.90 | 5.5 | 50.5 |
Probe sequence length
In linear probing, the expected number of probes for an unsuccessful search is given by , where denotes the load factor, representing the ratio of occupied slots to total table size. This formula arises from probabilistic analysis assuming uniform hashing, where the positions of empty slots follow a geometric distribution with success probability . Specifically, the probability that a probe sequence requires exactly probes is under the approximation of independent probes, leading to the expected value via summation: ; however, the more precise derivation in linear probing accounts for the sequential nature by averaging over the squared term to yield the quadratic form.[27] For a successful search, the expected probe length simplifies to , derived similarly by considering the average position of inserted keys relative to their hash locations, again rooted in the geometric distribution of empty slots encountered during insertions. This reflects the fact that successful searches tend to probe fewer locations on average, as they terminate upon finding the key rather than an empty slot. In the worst case, probe sequences can require up to probes, where is the table size, occurring when the table is nearly full or when long clusters force traversal of most slots. The distribution of probe lengths approximates an exponential (or discrete geometric) form at moderate load factors .[27] Theoretical predictions from these equations align closely with empirical results from simulations for load factors up to , beyond which clustering effects begin to deviate outcomes from the ideal model.[28]Clustering issues
Primary clustering explanation
Primary clustering is a phenomenon observed in linear probing hash tables, where keys that hash to nearby indices tend to form long, contiguous runs of occupied slots, thereby extending the probe sequences for subsequent insertions and searches in that region of the table. This occurs because the probe sequence in linear probing advances by a fixed step size of 1, causing collided keys to sequentially fill adjacent slots and create dense clusters around the initial hash locations. The root cause stems from this uniform linear progression: when multiple keys hash to slots in close proximity—say, indices 5, 6, or 7—they occupy a continuous block starting from the earliest available slot in that range, forcing later probes to traverse the entire cluster before finding an empty slot or the target key. For instance, consider a hash table of size 13 with initial hashes placing keys at positions 5 (occupies 5), then 6 (occupies 6), and another at 5 (collides and occupies 7); a subsequent key hashing to 7 now must probe through 7 (occupied), 8, and beyond, extending the effective run to slots 5–7 and growing the cluster further with each similar collision.[4] To illustrate cluster growth, envision a linear array representing the hash table slots (0 to 12), initially all empty (denoted as ·). After inserting a key hashing to slot 3 (occupies 3: · · · X · · · · · · · · ·), a second key hashing to 4 (occupies 4: · · · X X · · · · · · · ·), and a third hashing to 3 (probes to 5: · · · X X X · · · · · · ·), a cluster forms at positions 3–5. A fourth key hashing to 2 (if empty, occupies 2, but if another collision arises nearby, it joins or extends the run: · · X X X X · · · · · · ·). Over repeated insertions near this region, the cluster expands contiguously, as shown by the progressive filling of sequential slots.[29][30] This primary clustering in linear probing differs from secondary clustering seen in other open-addressing schemes like quadratic or double hashing, where the issue arises specifically when multiple keys share the exact same initial hash value and thus follow identical probe sequences, rather than from the adjacency of different hash values.[31]Effects on performance
Primary clustering degrades the performance of linear probing by creating long runs of occupied slots, which significantly increase the length of probe sequences in affected regions. In the worst case, insertions or searches starting near a cluster can require traversing the entire run, leading to probe counts that approximate O(1/(1 - \alpha)^2) for unsuccessful operations at load factor \alpha, far exceeding the expected O(1) amortized time under uniform distribution assumptions. This results in practical average probe lengths that are 2-3 times longer than those in clustering-resistant methods like double hashing at moderate load factors, with variance in access times rising due to some operations resolving quickly while others suffer extended traversals through clusters.[19] The sequential nature of linear probing probes benefits cache locality by accessing contiguous memory locations, often yielding fewer cache misses than scattered probing schemes. However, in clustered regions, long probe sequences can span multiple cache lines, incurring misses at cluster boundaries and amplifying the overall slowdown, particularly for larger table sizes where spatial locality advantages diminish.[17] Empirical analyses confirm these impacts, showing linear probing's average unsuccessful probe length rising sharply with clustering. For instance, at \alpha = 0.7, linear probing requires approximately 6 probes on average for unsuccessful searches, compared to about 3.3 for double hashing, representing a roughly 80% slowdown in probe count due to clustering effects. At higher loads like \alpha = 0.9, this gap widens to 5 times more probes (50.5 vs. 10). The following table summarizes average unsuccessful probe lengths from theoretical models and simulations, highlighting the degradation:| Load Factor (\alpha) | Linear Probing (Clustered) | Double Hashing (Uniform) | Slowdown Factor |
|---|---|---|---|
| 0.5 | 2.5 | 2.0 | 1.25x |
| 0.7 | 6.1 | 3.3 | 1.85x |
| 0.9 | 50.5 | 10.0 | 5.05x |
Hash function selection
Ideal hash function properties
For optimal performance in linear probing, a hash function must generate values that are uniformly distributed across the table indices [0, m-1], where m denotes the table size, thereby reducing the probability of initial collisions among inserted keys.[33] This property aligns with the simple uniform hashing assumption, under which each key hashes independently and with equal probability to any slot, promoting even spread and minimizing early clustering tendencies.[27] Furthermore, the hash function should demonstrate the avalanche effect, such that even a minor alteration in the input key—such as flipping a single bit—produces a markedly different hash value, on average changing roughly half the output bits to disrupt potential patterns in key similarities.[34] This diffusion quality ensures that related keys do not predictably map to nearby slots, which is particularly beneficial in linear probing where sequential searches amplify localized biases. To avoid inherent correlations between hash outputs and the linear probe increments (which could exacerbate natural clustering in modular arithmetic), the function requires a degree of statistical independence from the probing mechanism; research establishes that at least 5-wise independent hashing suffices to achieve expected constant-time operations by bounding probe sequence lengths effectively.[35] Such well-designed hash functions interact favorably with the load factor α (the ratio of occupied slots to table size), as their uniformity and independence postpone the buildup of probe chains, sustaining efficient insertions and searches at higher α values—typically up to around 0.5 to 0.7—before degradation sets in.[33] In implementation, the uniformity of candidate hash functions is empirically assessed via the chi-squared goodness-of-fit test, which compares the observed frequency of hash values across bins against the expected uniform distribution, with low p-values indicating non-random deviations that could impair probing efficiency.[36]Practical choices and examples
For integer keys, a simple and common hash function in linear probing tables is the modulo operation, defined as $ h(k) = k \mod m $, where $ k $ is the key and $ m $ is the table size; this maps keys uniformly to slots when $ m $ is chosen appropriately.[11] For example, in a table of size 10, inserting keys 5, 15, and 25 would all hash to slot 5, requiring linear probing to resolve collisions by checking subsequent slots.[4] For string keys, polynomial rolling hashes are widely used to treat the string as a base-$ b $ number in base-$ m $ arithmetic, computed as $ h(s) = \left( \sum_{i=0}^{n-1} s[i] \cdot b^{n-1-i} \right) \mod m $, where $ s[i] $ is the character code at position $ i $, $ n $ is the string length, and $ b $ is typically 31 or 37 to balance computation speed and distribution quality.[37][38] This approach avoids overflow issues by computing incrementally: initialize hash to 0, then for each character $ c $, update as $ \text{hash} = (\text{hash} \cdot b + c) \mod m $.[37]int hash = 0;
for (int i = 0; i < s.[length](/page/Length)(); i++) {
hash = (hash * 31 + s.charAt(i)) % m;
}
In implementations like those in educational algorithms libraries, Java's built-in String.hashCode() method—which uses a polynomial rolling hash with base 31—is directly adapted for linear probing by taking the absolute value and modulo $ m $, ensuring compatibility with the table size while leveraging the language's standard for string hashing.[39] For instance, the LinearProbingHashST class computes the initial probe as hash = (key.hashCode() & 0x7fffffff) % m, then applies linear offsets.[39]
Practical trade-offs favor simple modulo or polynomial hashes for their computational speed in general-purpose linear probing tables, as they require only basic arithmetic operations and achieve good uniformity without excessive overhead.[22] In contrast, non-cryptographic hashes like MurmurHash offer superior avalanche properties for better distribution against patterned inputs, making them suitable for high-performance scenarios, but they involve more complex bit manipulations that increase computation time, rendering them overkill for typical non-security-sensitive uses. Cryptographic hashes, such as SHA-256, provide strong collision resistance but are significantly slower due to their security-focused design, and are rarely used in linear probing unless data integrity is paramount.[22]
During rehashing upon table resize, selecting a new size $ m $ that is prime is essential to preserve hash uniformity, as primes minimize systematic biases in the modulo distribution for common key patterns, reducing clustering in linear probing.[22][40] For example, resizing from 100 to the next prime (e.g., 101) ensures that previously clustered probes spread more evenly under the same hash function.[22]