Hubbry Logo
search
logo

Universal hashing

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

Introduction

[edit]

Assume we want to map keys from some universe into bins (labelled ). The algorithm will have to handle some data set of keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if , since the adversary may choose to be precisely the preimage of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.

The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions is called a universal family if, .

In other words, any two different keys of the universe collide with probability at most when the hash function is drawn uniformly at random from . This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key.

Sometimes, the definition is relaxed by a constant factor, only requiring collision probability rather than . This concept was introduced by Carter and Wegman[1] in 1977, and has found numerous applications in computer science (see, for example[2]).

If we have an upper bound of on the collision probability, we say that we have -almost universality. So for example, a universal family has -almost universality.

Many, but not all, universal families have the following stronger uniform difference property:

, when is drawn randomly from the family , the difference is uniformly distributed in .

Note that the definition of universality is only concerned with whether , which counts collisions. The uniform difference property is stronger.

(Similarly, a universal family can be XOR universal if , the value is uniformly distributed in where is the bitwise exclusive or operation. This is only possible if is a power of two.)

An even stronger condition is pairwise independence: we have this property when we have the probability that will hash to any pair of hash values is as if they were perfectly random: . Pairwise independence is sometimes called strong universality.

Another property is uniformity. We say that a family is uniform if all hash values are equally likely: for any hash value . Universality does not imply uniformity. However, strong universality does imply uniformity.

Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in to the hash functions. (Similarly, if is a power of two, we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant.) Since a shift by a constant is sometimes irrelevant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise independent is sometimes not made.[3]

For some applications (such as hash tables), it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if is a strongly universal family with , then the family made of the functions for all is also strongly universal for . Unfortunately, the same is not true of (merely) universal families. For example, the family made of the identity function is clearly universal, but the family made of the function fails to be universal.

UMAC and Poly1305-AES and several other message authentication code algorithms are based on universal hashing.[4][5] In such applications, the software chooses a new hash function for every message, based on a unique nonce for that message.

Several hash table implementations are based on universal hashing. In such applications, typically the software chooses a new hash function only after it notices that "too many" keys have collided; until then, the same hash function continues to be used over and over. (Some collision resolution schemes, such as dynamic perfect hashing, pick a new hash function every time there is a collision. Other collision resolution schemes, such as cuckoo hashing and 2-choice hashing, allow a number of collisions before picking a new hash function). A survey of fastest known universal and strongly universal hash functions for integers, vectors, and strings is found in.[6]

Mathematical guarantees

[edit]

For any fixed set of keys, using a universal family guarantees the following properties.

  1. For any fixed in , the expected number of keys in the bin is . When implementing hash tables by chaining, this number is proportional to the expected running time of an operation involving the key (for example a query, insertion or deletion).
  2. The expected number of pairs of keys in with that collide () is bounded above by , which is of order . When the number of bins, is chosen linear in (i.e., is determined by a function in ), the expected number of collisions is . When hashing into bins, there are no collisions at all with probability at least a half.
  3. The expected number of keys in bins with at least keys in them is bounded above by .[7] Thus, if the capacity of each bin is capped to three times the average size (), the total number of keys in overflowing bins is at most . This only holds with a hash family whose collision probability is bounded above by . If a weaker definition is used, bounding it by , this result is no longer true.[7]

As the above guarantees hold for any fixed set , they hold if the data set is chosen by an adversary. However, the adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing.

The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some number of collisions. If it observes too many collisions, it chooses another random from the family and repeats. Universality guarantees that the number of repetitions is a geometric random variable.

Constructions

[edit]

Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").

Hashing integers

[edit]

This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be .

The original proposal of Carter and Wegman[1] was to pick a prime and define

where are randomly chosen integers modulo with . (This is a single iteration of a linear congruential generator.)

To see that is a universal family, note that only holds when

for some integer between and . Since , if their difference is nonzero and has an inverse modulo . Solving for yields

.

There are possible choices for (since is excluded) and, varying in the allowed range, possible non-zero values for the right hand side. Thus the collision probability is

.

Another way to see is a universal family is via the notion of statistical distance. Write the difference as

.

Since is nonzero and is uniformly distributed in , it follows that modulo is also uniformly distributed in . The distribution of is thus almost uniform, up to a difference in probability of between the samples. As a result, the statistical distance to a uniform family is , which becomes negligible when .

The family of simpler hash functions

is only approximately universal: for all .[1] Moreover, this analysis is nearly tight; Carter and Wegman [1] show that whenever .

Avoiding modular arithmetic

[edit]

The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997.[8] By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four[9]). The scheme assumes the number of bins is a power of two, . Let be the number of bits in a machine word. Then the hash functions are parametrised over odd positive integers (that fit in a word of bits). To evaluate , multiply by modulo and then keep the high order bits as the hash code. In mathematical notation, this is

This scheme does not satisfy the uniform difference property and is only -almost-universal; for any , .

To understand the behavior of the hash function, notice that, if and have the same highest-order 'M' bits, then has either all 1's or all 0's as its highest order M bits (depending on whether or is larger). Assume that the least significant set bit of appears on position . Since is a random odd integer and odd integers have inverses in the ring , it follows that will be uniformly distributed among -bit integers with the least significant set bit on position . The probability that these bits are all 0's or all 1's is therefore at most . On the other hand, if , then higher-order M bits of contain both 0's and 1's, so it is certain that . Finally, if then bit of is 1 and if and only if bits are also 1, which happens with probability .

This analysis is tight, as can be shown with the example and . To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme that picks higher-order bits

where is a random positive integer with and is a random non-negative integer with . This requires doing arithmetic on -bit unsigned integers. This version of multiply-shift is due to Dietzfelbinger, and was later analyzed more precisely by Woelfel.[10]

Hashing vectors

[edit]

This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector of machine words (integers of bits each). If is a universal family with the uniform difference property, the following family (dating back to Carter and Wegman[1]) also has the uniform difference property (and hence is universal):

, where each is chosen independently at random.

If is a power of two, one may replace summation by exclusive or.[11]

In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of hash functions.[12] Initialize the hash function with a vector of random odd integers on bits each. Then if the number of bins is for :

.

It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.[11] Initialize the hash function with a vector of random odd integers on bits each. The following hash family is universal:[13]

.

If double-precision operations are not available, one can interpret the input as a vector of half-words (-bit integers). The algorithm will then use multiplications, where was the number of half-words in the vector. Thus, the algorithm runs at a "rate" of one multiplication per word of input.

The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based universal hashing schemes.[14]

Strong universality at high speed is also possible.[15] Initialize the hash function with a vector of random integers on bits. Compute

.

The result is strongly universal on bits. Experimentally, it was found to run at 0.2 CPU cycle per byte on recent Intel processors for .

Hashing strings

[edit]

This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate is just the length of . As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function without affecting universality.[11] Note that if zeroes are allowed in the string, then it might be best to append a fictitious non-zero (e.g., 1) character to all strings prior to padding: this will ensure that universality is not affected.[15]

Now assume we want to hash , where a good bound on is not known a priori. A universal family proposed by [12] treats the string as the coefficients of a polynomial modulo a large prime. If , let be a prime and define:

, where is uniformly random and is chosen randomly from a universal family mapping integer domain .

Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows:[16]

uint hash(String x, int a, int p)
	uint h = INITIAL_VALUE
	for (uint i = 0; i < x.length; ++i)
		h = ((h*a) + x[i]) mod p
	return h

This Rabin-Karp rolling hash is based on a linear congruential generator.[17] Above algorithm is also known as Multiplicative hash function.[18] In practice, the mod operator and the parameter p can be avoided altogether by allowing integer to overflow because it is equivalent to mod (Max-Int-Value + 1) in many programming languages. However, using the non-prime modulus is prone to collisions with certain inputs, regardless of the value of a. Below table shows values chosen to initialize h and a for some of the popular implementations.

Implementation INITIAL_VALUE a
Bernstein's hash function djb2[19] 5381 33
STLPort 4.6.2 0 5
Kernighan and Ritchie's hash function[20] 0 31
java.lang.String.hashCode()[21] 0 31

Consider two strings and let be length of the longer one; for the analysis, the shorter string is conceptually padded with zeros up to length . A collision before applying implies that is a root of the polynomial with coefficients . This polynomial has at most roots modulo , so the collision probability is at most . The probability of collision through the random brings the total collision probability to . Thus, if the prime is sufficiently large compared to the length of strings hashed, the family is very close to universal (in statistical distance).

Other universal families of hash functions used to hash unknown-length strings to fixed-length hash values include the Rabin fingerprint and the Buzhash.

Avoiding modular arithmetic

[edit]

To mitigate the computational penalty of modular arithmetic, three tricks are used in practice:[11]

  1. One chooses the prime to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulo to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work with , while 's are 32-bit values.
  2. One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing.
  3. One chooses a power-of-two as the divisor, allowing arithmetic modulo to be implemented without division (using faster operations of bit masking). The NH hash-function family takes this approach.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Universal hashing is a technique in computer science for selecting a hash function randomly from a family of such functions, where the family is designed to ensure that for any two distinct keys, the probability of them mapping to the same hash value is at most 1/m1/m, with mm denoting the number of possible hash outputs.[1] This property, known as 2-universality, guarantees that collisions occur with probability no higher than that of a truly random function, making it robust against worst-case inputs.[1] The concept was introduced by J. Lawrence Carter and Mark N. Wegman in their seminal 1977 paper, "Universal Classes of Hash Functions," presented at the 9th Annual ACM Symposium on Theory of Computing (STOC).[1] In this work, they defined universal classes (such as H1H_1, H2H_2, and H3H_3) that are both theoretically sound and practically evaluable, enabling input-independent average-case linear time for storage and retrieval operations in hash tables.[1] Their approach built on earlier ideas in randomized algorithms, including probabilistic selection methods by researchers like Michael Rabin, to provide performance bounds that hold regardless of data distribution.[1] Universal hashing underpins efficient implementations of data structures like chained hash tables, where it ensures expected O(1) lookup and insertion times even under adversarial conditions.[2] It has also influenced streaming algorithms for tasks such as frequency estimation and heavy hitters detection, leveraging fast 4-universal variants for high-speed processing.[3] In cryptography, universal hashing enables information-theoretically secure message authentication codes (e.g., the Wegman–Carter MAC) when combined with a secret key, while extensions like universal one-way hash functions provide computational security for applications such as digital signatures and compression.[4] Over time, refinements such as strongly universal hashing have emerged, offering tighter collision bounds (e.g., pairwise independence) while maintaining computational efficiency, with modern implementations achieving rates of 0.2 CPU cycles per byte for string hashing.[5]

Fundamentals

Definition and motivation

Universal hashing refers to a method of selecting a hash function randomly from a carefully designed family of functions to achieve desirable probabilistic properties in hash tables and related data structures. A universal hash family H\mathcal{H} is defined as a set of functions from a key universe UU to a slot universe MM such that, for any two distinct keys x,yUx, y \in U, the probability that a randomly chosen function hHh \in \mathcal{H} maps them to the same slot satisfies PrhH[h(x)=h(y)]1/M\Pr_{h \in \mathcal{H}}[h(x) = h(y)] \leq 1/|M|.[1] This bound ensures that collisions occur no more frequently than if the keys were mapped independently and uniformly at random to the slots. The primary motivation for universal hashing arises from the vulnerabilities of fixed hash functions, which can suffer from poor performance when the input distribution is unknown or adversarial. With a fixed hash function, an adversary aware of the function can deliberately choose keys that concentrate in few slots, leading to many collisions and degraded efficiency, such as Θ(n)\Theta(n) time per operation in the worst case for nn keys.[6] Universal hashing addresses this by randomizing the choice of function from the family at runtime (or preprocessing), providing worst-case expected performance guarantees that hold independently of the input, thus mitigating adversarial exploitation and ensuring average-case linear-time operations without assuming benign key distributions.[1] In hashing, collisions inevitably occur when multiple keys map to the same slot, but their probability determines the overall efficiency of structures like hash tables using chaining or open addressing. Classical analyses often assume uniform key distribution, but universal families bound the pairwise collision probability uniformly at 1/M1/|M| for any distinct pair, yielding an expected number of collisions per slot proportional to the load factor, regardless of key correlations.[1] This uniform bound simplifies performance proofs and enhances robustness in practice. To illustrate, consider a simple scenario with a hash table of 2 slots (M=2|M|=2) and keys from a small universe. A non-universal family might consist of two functions where a specific pair of keys always collides (e.g., both map the pair to slot 0 in both functions), yielding Pr[h(x)=h(y)]=1>1/2\Pr[h(x)=h(y)]=1 > 1/2 for that pair and allowing an adversary to exploit it. In contrast, a universal family ensures that for every distinct pair, at most half the functions cause a collision, bounding the probability at 1/21/2 and distributing keys more evenly in expectation.[1]

Historical development

Universal hashing was introduced by J. Lawrence Carter and Mark N. Wegman in the late 1970s as a response to the limitations of traditional hashing methods, which relied on heuristic choices and assumptions about input distributions that often failed in practice. Their work aimed to provide rigorous performance guarantees for hashing in randomized algorithms, ensuring average-case efficiency independent of the input data by selecting a hash function randomly from a carefully designed class. An extended abstract appeared in 1977 at the Symposium on the Theory of Computing, followed by the seminal full paper "Universal Classes of Hash Functions" published in 1979, where they defined universal hash families and presented efficient constructions for hashing integers.[7] In a follow-up publication in 1981, Wegman and Carter extended the framework by introducing strongly universal hash functions, which offer even tighter collision probability bounds and enable applications beyond basic storage, such as authentication codes and set equality testing. This paper, "New Hash Functions and Their Use in Authentication and Set Equality," demonstrated how these functions could achieve information-theoretically secure message authentication without relying on computational assumptions, marking a key milestone in connecting universal hashing to cryptography.[4] The concept gained practical traction in the 1990s through refinements focused on efficiency and implementation, influencing the design of hash tables and parallel algorithms. For instance, Dietzfelbinger et al. proposed high-performance universal hashing schemes in 1992 that supported constant-time evaluation and fast construction, facilitating their use in shared memory emulation and other data structures. Post-2000 developments emphasized hardware and software optimizations, such as Rogaway's PolyR construction in 2001, which enabled fast hashing of short messages with minimal key sizes, paving the way for efficient deployments in modern systems.[8][9]

Properties

Universal hash families

A universal hash family is a collection $ \mathcal{H} $ of hash functions from a universe $ U $ to a range $ M $ with $ |M| = m $, such that for any distinct $ x, y \in U $, the fraction of functions in $ \mathcal{H} $ that map $ x $ and $ y $ to the same value is at most $ 1/m $. Formally, $ \left| { h \in \mathcal{H} : h(x) = h(y) } \right| / |\mathcal{H}| \leq 1/m $. Equivalently, if $ h $ is selected uniformly at random from $ \mathcal{H} $, then $ \Pr[ h(x) = h(y) ] \leq 1/m $. This property ensures that collisions are no more likely than under a perfectly random mapping, providing a probabilistic guarantee independent of the specific keys. The definition implies a bound on the expected collision probability when using a random hash from the family. For any distinct pair $ x, y $, the probability of collision is directly upper-bounded by $ 1/m $, matching the collision rate of an ideal random function. To see the broader implication, consider a fixed key $ x $ and a set $ S \subseteq U \setminus {x} $ with $ |S| = k $. By linearity of expectation, the expected number of collisions between $ x $ and elements of $ S $ is $ \sum_{y \in S} \Pr[ h(x) = h(y) ] \leq k / m $. A proof sketch proceeds as follows: the indicator random variable $ I_y = 1 $ if $ h(x) = h(y) $ and 0 otherwise has expectation $ \mathbb{E}[I_y] = \Pr[ h(x) = h(y) ] \leq 1/m $, so the total expectation $ \mathbb{E}[ \sum I_y ] \leq k/m $. This average-case bound holds regardless of the distribution of keys in $ S $. While stronger notions like pairwise independence require $ h(x) $ and $ h(y) $ to be uniformly distributed and independent (yielding exactly $ 1/m $ collision probability), universality only demands the upper bound and thus allows for more efficient constructions with smaller families. Universality suffices for many applications, as the average collision rate of $ 1/m $ ensures good performance in expectation without the overhead of full independence. A simple example of a universal hash family for small universes is the shift-based family over $ U = M = {0, 1, \dots, m-1} $, defined as $ \mathcal{H} = { h_a \mid a = 0, 1, \dots, m-1 } $ where $ h_a(x) = (x + a) \mod m $. For any distinct $ x, y \in U $, $ h_a(x) = h_a(y) $ implies $ x + a \equiv y + a \pmod{m} $, or $ x \equiv y \pmod{m} $, which contradicts $ x \neq y $ since both are in $ {0, \dots, m-1} $. Thus, no function in $ \mathcal{H} $ causes a collision for this pair, so $ |{ h \in \mathcal{H} : h(x) = h(y) }| = 0 \leq m / m = 1 $, and the collision probability is $ 0 \leq 1/m $. This family achieves the universality bound (trivially, with zero collisions) and can be evaluated efficiently via modular arithmetic.

Strong and k-wise universality

A strongly universal hash family, also known as a 2-universal family with exact pairwise independence, requires that for any two distinct keys xyx \neq y in the universe UU and any two values a,ba, b in the range [m]={0,1,,m1}[m] = \{0, 1, \dots, m-1\}, the probability that a randomly selected hash function hh from the family satisfies h(x)=ah(x) = a and h(y)=bh(y) = b is exactly 1/m21/m^2. [1] This ensures that the hash values for distinct keys are perfectly independent and uniformly distributed, implying that the collision probability Pr[h(x)=h(y)]=1/m\Pr[h(x) = h(y)] = 1/m holds exactly. [10] This concept generalizes to k-wise universal hash families, where for any kk distinct keys x1,,xkUx_1, \dots, x_k \in U and any kk values v1,,vk[m]v_1, \dots, v_k \in [m], the probability Pr[h(x1)=v1,,h(xk)=vk]=1/mk\Pr[h(x_1) = v_1, \dots, h(x_k) = v_k] = 1/m^k exactly, mimicking the behavior of fully independent uniform random variables restricted to kk samples. [11] Such families provide progressively stronger guarantees of independence as kk increases, enabling analysis of higher-order interactions in hashing schemes. The primary benefits of strong and k-wise universality lie in their ability to yield tighter probabilistic bounds on multi-key collisions and dependent events compared to basic universality; for instance, strong universality directly implies the standard universal property, as Pr[h(x)=h(y)]=a=0m1Pr[h(x)=ah(y)=a]=a=0m1(1/m)2=1/m\Pr[h(x) = h(y)] = \sum_{a=0}^{m-1} \Pr[h(x) = a \land h(y) = a] = \sum_{a=0}^{m-1} (1/m)^2 = 1/m. [1] These enhanced properties support more precise performance guarantees in structures sensitive to variance or tail probabilities, such as those involving multiple insertions or queries. However, achieving k-wise universality for k>2k > 2 typically requires larger family sizes than 2-universal families, often scaling polynomially with kk and the universe size to maintain the exact independence, which can increase storage and selection overhead. [11] A high-level outline for constructing a 3-universal family involves parameterizing hash functions as low-degree polynomials over a suitable finite field, with random coefficients chosen to enforce the triple-wise independence condition. [12]

Constructions

Hashing integers

One of the foundational constructions for universal hash families targeting integer keys is the Carter-Wegman method, which maps elements from a universe $ U $ of integers to a range $ M = {0, 1, \dots, m-1} $. The hash function is defined as $ h_{a,b}(x) = ((a x + b) \mod p) \mod m $, where $ p $ is a prime number greater than $ |U| $, $ a \in {1, 2, \dots, p-1} $ (ensuring $ a \not\equiv 0 \mod p $), and $ b \in {0, 1, \dots, p-1} $, with $ a $ and $ b $ chosen uniformly at random.[1] This family $ H = { h_{a,b} } $ achieves 2-universality by bounding collisions to at most $ 1/m $.[1] The 2-universality of this construction is established in the original paper, showing that for any distinct integers $ x, y \in U $ with $ x \neq y $, the collision probability $ \Pr[h_{a,b}(x) = h_{a,b}(y)] \leq \frac{1}{m} $.[1] For practical implementation with $ w $-bit integers (where $ |U| \leq 2^w $), a suitable prime $ p $ is chosen greater than $ 2^w $, often a large prime to minimize computational overhead while ensuring $ p > |U| $. This choice allows efficient modular arithmetic using integer operations, achieving $ O(1) $ time complexity per hash evaluation on standard hardware, as multiplications and reductions modulo $ p $ (such as a Mersenne prime like $ 2^{61} - 1 $ for larger keys) can leverage bit shifts and additions. As an illustrative example, consider hashing 32-bit integers ($ w = 32 $, $ |U| = 2^{32} $) into $ m = 1024 $ buckets using $ p = 4294967311 $ (a prime). For distinct $ x $ and $ y $, the collision probability is at most $ \frac{1}{1024} \approx 0.000976 $, meaning that in a set of $ n $ keys, the expected number of colliding pairs is at most $ \binom{n}{2} / 1024 $, which remains manageable for moderate $ n $ (e.g., for 10,000 keys, at most around 48,800).

Hashing strings and sequences

A prominent construction for universal hashing of strings and sequences interprets the input as coefficients of a polynomial over a finite field, extending the linear hashing methods for integers discussed previously. Consider a string $ s = s_{l-1} \dots s_0 $ of length $ l $ over an alphabet $ \Sigma $ embedded into the finite field $ \mathbb{F}p $, where $ p $ is a large prime. The hash function evaluates the polynomial $ P_s(r) = \sum{i=0}^{l-1} s_i r^i \mod p $, with $ r $ chosen uniformly at random from $ {1, 2, \dots, p-1} $, and then applies a final modulo $ m $ operation to map into the range $ {0, 1, \dots, m-1} $, where $ m $ is the desired output size. This approach, originally proposed in the context of universal hash families, achieves strong (2-)universality because for any two distinct strings, the polynomials differ, ensuring the probability that $ P_s(r) \equiv P_t(r) \pmod{p} $ is at most $ 1/(p-1) $, and thus the overall collision probability is approximately $ 1/m $.[13] To enhance practicality, $ r $ is often selected as a primitive root modulo $ p $ to promote uniform distribution across the field, particularly when $ \Sigma $ is the ASCII alphabet with 128 symbols. The method draws inspiration from the Rabin-Karp algorithm for string matching, which employs a similar polynomial rolling hash but fixes parameters for deterministic use; randomizing $ r $ elevates it to a universal family suitable for hashing applications. For fixed-length strings, this directly yields the hash value; however, the construction preserves universality even without padding, as distinct sequences produce distinct polynomials. Strings of variable lengths are accommodated by padding shorter ones with zero symbols to a maximum length or, more efficiently, via the iterative form of the polynomial evaluation: start with $ h_0 = 0 $ and compute $ h_i = (r \cdot h_{i-1} + s_i) \mod p $ for $ i = 1 $ to $ l $, followed by modulo $ m $. This rolling hash variant not only handles variable lengths but also enables O(1) updates for sliding windows over sequences, maintaining a collision probability of at most $ 1/m $ for distinct keys under random $ r $. Such flexibility is crucial for sequence data where lengths vary, ensuring the hash family remains strongly universal across all possible inputs.[13] In practice, this polynomial hashing serves as a fingerprinting mechanism for applications like plagiarism detection, where ASCII text documents are processed to generate compact hashes of passages for similarity comparison; typical parameters include a 64-bit prime $ p \approx 2^{64} - 2^{32} + 1 $ and $ r = 31 $ or another small primitive root to balance speed and collision resistance.

Hashing vectors and tuples

One approach to hashing vectors involves applying universal hash functions component-wise. Consider a vector v=(v1,v2,,vd)\mathbf{v} = (v_1, v_2, \dots, v_d) where each viv_i belongs to a universe UU of integers. Select dd independent universal hash functions hi:U{0,1,,m1}h_i: U \to \{0, 1, \dots, m-1\} from a universal family, and define the hash as
h(v)=(i=1dhi(vi))modm. h(\mathbf{v}) = \left( \sum_{i=1}^d h_i(v_i) \right) \mod m.
This construction, known as vector multiply-shift when the hih_i are linear, ensures efficient computation for moderate dd, and can approximate low collision probabilities when vectors differ in few components. The universality may be preserved under stronger conditions on the hih_i, such as pairwise independence. For guaranteed 2-universality, more robust methods are preferred. For fixed-size vectors over finite fields or rings, the Toeplitz matrix method provides a compact representation. A Toeplitz matrix TT of size k×dk \times d (with m=pkm = p^k for prime pp) has constant values along each diagonal, allowing specification with O(k+d)O(k + d) random elements from Zm\mathbb{Z}_m. The hash is computed as h(v)=Tvmodmh(\mathbf{v}) = T \mathbf{v} \mod m, where the multiplication yields a vector result projected modulo mm. When TT is chosen uniformly at random from Toeplitz matrices, this family achieves strong universality, with collision probability exactly 1/m1/m for distinct v,w\mathbf{v}, \mathbf{w}, due to the linear independence properties over the ring.[14] These methods are particularly useful for hashing multi-attribute data, such as IP addresses represented as 4-tuples of 32-bit integers (e.g., v=(a.b.c.d)\mathbf{v} = (a.b.c.d) with each component hashed via multiply-shift modulo mm) or database records as fixed-length tuples of fields. In networking applications, this enables uniform distribution of 128-bit IPv6 addresses into smaller tables without pathological collisions.[15][16]

Applications and extensions

Use in data structures

Universal hashing plays a crucial role in enhancing the performance and robustness of hash tables by mitigating the risks of poor hash function choices that could lead to excessive collisions. In standard hash table implementations, a random hash function is selected from a universal family at initialization. This approach ensures that for any two distinct keys, the probability of collision is at most 1/m, where m is the table size, leading to an expected constant-time O(1) performance for insertions, deletions, and lookups when the load factor is kept below 1.[1] To maintain this efficiency, tables often rehash all elements using a new random function from the family upon reaching a high load factor, with the overall amortized cost remaining O(1) per operation due to the bounded expected chain lengths.[1] In open-addressing hash tables, universal hashing addresses the issue of clustering, where collisions cause long probe sequences. Double hashing serves as a practical variant, employing two independent universal hash functions h_1 and h_2 to generate probe sequences defined as h(k, i) = (h_1(k) + i \cdot h_2(k)) \mod m for the i-th probe. This method approximates the behavior of ideal random probing, distributing probes more uniformly and avoiding the primary clustering seen in linear probing with fixed hashes, thereby achieving expected O(1/\alpha) probe lengths, where \alpha is the load factor. The universality of h_1 and h_2 ensures that the probe sequences for different keys are sufficiently independent, preventing worst-case degenerations. Cuckoo hashing further leverages universal hashing by using multiple (typically two) hash functions from a universal family to assign each key to one of several possible slots across two or more tables. During insertion, if a slot is occupied, the existing key is evicted and "kicked" to an alternative slot determined by its own hash functions, potentially triggering a chain of reassignments. This process succeeds with high probability, as the universal property guarantees that the graph of possible placements remains sparse, resulting in a failure probability that is exponentially small in the table size for load factors below 0.5.[17] Empirically, universal hashing in open-addressing tables exhibits reduced clustering and probe sequence lengths compared to fixed hash functions, particularly under adversarial key distributions, leading to more predictable and efficient performance in real-world applications like databases and caches.

Role in cryptography and beyond

Universal hashing plays a pivotal role in cryptography, particularly in constructing message authentication codes (MACs) that provide information-theoretic security. The Carter-Wegman MAC, introduced in the late 1970s, leverages strongly universal hash functions combined with a pseudorandom function or block cipher to authenticate messages, ensuring that an adversary cannot forge a valid tag with probability better than 1/2^k for a k-bit tag, even after seeing polynomially many message-tag pairs. This construction achieves information-theoretic security against existential forgery under chosen-message attacks, providing unconditional security even against computationally unbounded adversaries, making it a foundational primitive for unconditionally secure authentication.[18][19] Universal hash functions also underpin practical keyed hashing schemes for data integrity in protocols like HMAC and various stream ciphers. While standard HMAC relies on collision-resistant cryptographic hashes, variants and related constructions such as UMAC employ universal hash families to enable fast, parallelizable authentication with provable security bounds against forgery, processing messages at rates exceeding gigabits per second on modern hardware. In stream cipher-based authenticated encryption, such as ChaCha20-Poly1305, a universal hash like Poly1305 computes tags over ciphertexts, providing integrity while resisting quantum attacks when paired with appropriate keys, as the hash's security holds information-theoretically.[20][21][22] Beyond cryptography, universal hashing enhances probabilistic data structures for approximate membership queries, such as Bloom filters, where multiple universal hash functions map elements to bit positions, minimizing false positives through pairwise independence guarantees that bound collision probabilities to 1/m for an m-bit array. In streaming algorithms, universal hashes facilitate efficient estimation of distinct elements (F_0 moment) in massive data streams; for instance, the optimal space O(1/ε^2 log(1/δ) log n) algorithm uses a pairwise independent hash to sample and count unique items with (1±ε) approximation and failure probability δ, enabling real-time analysis without storing the entire stream.[23][24] Despite these strengths, universal hashing faces limitations in quantum settings, where Grover's algorithm can accelerate collision searches, necessitating quantum-resistant variants. Recent work in the 2020s has developed post-quantum universal hash families, such as those based on lattice problems or ε-almost universal constructions over finite fields, ensuring collision resistance against quantum adversaries while maintaining efficiency for MACs and sketches in hybrid classical-quantum protocols.[25]

References

User Avatar
No comments yet.