Hubbry Logo
Rolling hashRolling hashMain
Open search
Rolling hash
Community hub
Rolling hash
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Rolling hash
Rolling hash
from Wikipedia

A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.

A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters; and similar to the way a Zobrist hash can be rapidly updated from the old hash value.

One of the main applications is the Rabin–Karp string search algorithm, which uses the rolling hash described below. Another popular application is the rsync program, which uses a checksum based on Mark Adler's adler-32 as its rolling hash. Low Bandwidth Network Filesystem (LBFS) uses a Rabin fingerprint as its rolling hash. FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash.

At best, rolling hash values are pairwise independent[1] or strongly universal. They cannot be 3-wise independent, for example.

Polynomial rolling hash

[edit]

The Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions:

,

where is a constant, and are the input characters (but this function is not a Rabin fingerprint, see below).

In order to avoid manipulating huge values, all math is done modulo . The choice of and is critical to get good hashing; in particular, the modulus is typically a prime number. See linear congruential generator for more discussion.

Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum by . Shifting all characters by one position to the right requires dividing the entire sum by . Note that in modulo arithmetic, can be chosen to have a multiplicative inverse by which can be multiplied to get the result of the division without actually performing a division.

Rabin fingerprint

[edit]

The Rabin fingerprint is another hash, which also interprets the input as a polynomial, but over the Galois field GF(2). Instead of seeing the input as a polynomial of bytes, it is seen as a polynomial of bits, and all arithmetic is done in GF(2) (similarly to CRC-32). The hash is the remainder after the division of that polynomial by an irreducible polynomial over GF(2). It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash.

Because it shares the same author as the Rabin–Karp string search algorithm, which is often explained with another, simpler rolling hash, and because this simpler rolling hash is also a polynomial, both rolling hashes are often mistaken for each other. The backup software restic uses a Rabin fingerprint for splitting files, with blob size varying between 512KiB and 8MiB and 64-byte window.[2][3]

Cyclic polynomial

[edit]

Hashing by cyclic polynomial[4]—sometimes called Buzhash[5]—is also simple and it has the benefit of avoiding multiplications, using circular shift instead. It is a form of tabulation hashing: it presumes that there is some substitution function from characters to integers in the interval . This substitution function might be a lookup table mapping characters to random integers. Let the function be a bitwise rotation. E.g., . Let be the bitwise exclusive or. The hash values are defined as

resulting a number in .

Computing the hash values in a rolling fashion is done as follows. Let be the previous hash value. Rotate once: . If is the character to be removed, rotate it times: . Then simply set

where is the new character.

Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first bits. That is, take the result and dismiss any consecutive bits.[1] In practice, this can be achieved by an integer division .

The backup software Borg uses a BuzHash for splitting files. It has blob size varying between 512KiB and 8MiB, uses 4095-byte window[6] and non-balanced substitution function.[7]

Content-based slicing using a rolling hash

[edit]

One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network: a simple byte addition at the front of the file would normally cause all fixed size windows to become updated, while in reality, only the first "chunk" has been modified.[3]

A simple approach to making dynamic chunks is to calculate a rolling hash, and if the hash value matches an arbitrary pattern (e.g. all zeroes) in the lower N bits (with a probability of , given the hash has a uniform probability distribution) then it’s chosen to be a chunk boundary. Each chunk will thus have an average size of bytes. This approach ensures that unmodified data (more than a window size away from the changes) will have the same boundaries.

Once the boundaries are known, the chunks need to be compared by cryptographic hash value to detect changes.[8] The backup software Borg uses the Buzhash algorithm with a customizable chunk size range for splitting file streams.[6]

Such content-defined chunking is often used for data deduplication.[3][6]

Content-based slicing using moving sum

[edit]

Several programs, including gzip (with the --rsyncable option) and rsyncrypto, do content-based slicing based on this specific (unweighted) moving sum:[9]

where

  • is the sum of 8196 consecutive bytes ending with byte (requires 21 bits of storage),
  • is byte of the file,
  • is a "hash value" consisting of the bottom 12 bits of .

Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum.

For every where , these programs cut the file between and . This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but no other chunk.

Gear fingerprint and content-based chunking algorithm FastCDC

[edit]

Chunking is a technique to divide a data stream into a set of blocks, also called chunks. Content-defined chunking (CDC) is a chunking technique in which the division of the data stream is not based on fixed chunk size, as in fixed-size chunking, but on its content.

The Content-Defined Chunking algorithm needs to compute the hash value of a data stream byte by byte and split the data stream into chunks when the hash value meets a predefined value. However, comparing a string byte-by-byte will introduce the heavy computation overhead. FastCDC [10] proposes a new and efficient Content-Defined Chunking approach. It uses a fast rolling Gear hash algorithm,[11] skipping the minimum length, normalizing the chunk-size distribution, and last but not the least, rolling two bytes each time to speed up the CDC algorithm, which can achieve about 10X higher throughput than Rabin-based CDC approach.[12]

The basic version pseudocode is provided as follows:

algorithm FastCDC
    input: data buffer src, 
           data length n, 
    output: cut point i
    
    MinSize ← 2KB     // split minimum chunk size is 2 KB
    MaxSize ← 64KB    // split maximum chunk size is 64 KB
    Mask0x0000d93003530000LL
    fp0
    i0
    
    // buffer size is less than minimum chunk size
    if nMinSize then
        return n
    if nMaxSize then
        nMaxSize
    
    // Skip the first MinSize bytes, and kickstart the hash
    while i < MinSize do
        fp ← (fp << 1 ) + Gear[src[i]]
        ii + 1
     
    while i < n do
        fp ← (fp << 1 ) + Gear[src[i]]
        if !(fp & Mask) then
            return i
        ii + 1
   
    return i

Where Gear array is a pre-calculated hashing array. Here FastCDC uses Gear hashing algorithm which can calculate the rolling hashing results quickly and keep the uniform distribution of the hashing results as Rabin. Compared with the traditional Rabin hashing algorithm, it achieves a much faster speed. Experiments suggest that it can generate nearly the same chunk size distribution in the much shorter time (about 1/10 of rabin-based chunking [12]) when segmenting the data stream.

Computational complexity

[edit]

All rolling hash functions can be computed in time linear in the number of characters and updated in constant time when characters are shifted by one position. In particular, computing the Rabin–Karp rolling hash of a string of length requires modular arithmetic operations, and hashing by cyclic polynomials requires bitwise exclusive ors and circular shifts.[1]

See also

[edit]

Footnotes

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A rolling hash is a designed to compute hash values for consecutive substrings or sliding s of a or in constant time per update, by reusing previous computations as the shifts, typically through or checksum-based methods. In the rolling hash approach, commonly used in , a ss of length nn is mapped to an via the formula H(s)=i=0n1spimodmH(s) = \sum_{i=0}^{n-1} s \cdot p^i \mod m, where ss represents the numeric value of the ii-th character, pp is a base (often a prime like 31), and mm is a large prime modulus (e.g., 109+910^9 + 9) to reduce collisions. Substring hashes are efficiently derived from precomputed prefix hashes using and powers of pp, enabling O(1)O(1) queries after O(n)O(n) preprocessing. To minimize collision probability (approximately 1/m1/m), multiple independent hash functions are often combined, such as using two different bases and moduli. Rolling hashes power key algorithms in pattern matching and data synchronization; for instance, the employs them to search for a pattern of length mm in a text of length nn in expected O(n+m)O(n + m) time by comparing rolling hashes of text windows against the pattern's hash, verifying exact matches only on hash collisions. Beyond string search, they facilitate applications like plagiarism detection through substring similarity checks and counting distinct substrings in large texts. In file synchronization, tools like use a weak 32-bit rolling checksum alongside a strong hash (e.g., ) to identify matching blocks across files efficiently over networks, transmitting only differing portions to minimize bandwidth usage. This technique, introduced in the context of efficient string matching by and Richard M. Karp in 1987, remains foundational in due to its balance of simplicity and performance.

Fundamentals

Definition and Motivation

A rolling hash function is a specialized hashing technique designed to compute the hash value of a fixed-length or sliding within a larger or by incrementally updating the previous hash value as the window advances by one position, thereby avoiding the computational overhead of recalculating the entire hash from scratch for each new . This approach leverages to ensure efficient updates, typically in constant time per slide, making it suitable for processing sequential data where overlapping substrings need repeated hashing. The primary motivation for rolling hashes stems from the inefficiency of naive substring hashing methods in applications requiring multiple overlapping window comparisons, such as string pattern matching or data deduplication. In a naive implementation, hashing each of the (n - m + 1) possible windows of size m in an n-length string would incur O((n - m + 1) * m) time complexity due to the per-window recomputation cost. By contrast, rolling hashes reduce this to O(n + m) overall time through incremental updates, enabling scalable processing of large datasets where frequent substring evaluations are essential. Rolling hashes originated in the 1980s amid developments in efficient string processing algorithms, with a seminal application in the Rabin-Karp string-searching algorithm introduced by Richard M. Karp and in 1987. This work built on earlier linear-time matching techniques like Knuth-Morris-Pratt (1977) but innovated by using randomized fingerprinting—essentially an early form of rolling hashing—to achieve constant-space efficiency and near-real-time performance for pattern detection in texts and multidimensional arrays. As a basic illustration, consider hashing consecutive s in a over an of size b by treating each as the evaluation of a of degree (k-1) with coefficients from the character values, computed a large prime p; the hash for the next is then obtained by subtracting the outgoing character's contribution, multiplying by b, and adding the incoming character's value, all p.

Advantages over Naive Methods

Rolling hash provides significant efficiency gains over naive methods for computing hashes of overlapping substrings, primarily through its incremental update mechanism. In naive approaches, hashing each possible of kk in a of nn requires recomputing the hash from scratch for every position, resulting in a of O(nk)O(nk). By contrast, rolling hash computes an initial hash in O(k)O(k) time and then updates the hash value in O(1)O(1) time per window slide by subtracting the contribution of the outgoing character and adding the incoming one, multiplied by appropriate powers of the base, yielding a total of O(n+k)O(n + k) or simply O(n)O(n) for large nn. This linear-time performance makes rolling hash particularly suitable for large-scale processing tasks where kk is substantial relative to nn. In terms of space efficiency, rolling hash maintains only a single running hash value at any time, requiring constant additional memory beyond the input string itself. Naive methods, while not necessarily storing all hashes, incur repeated full computations that indirectly burden memory through redundant operations on the entire window data for each position. This minimal of rolling hash facilitates its use in resource-constrained environments, such as embedded systems or when processing massive datasets where storing intermediate results for all windows would be prohibitive. Practically, rolling hash enables real-time processing of by allowing continuous updates without restarting computations, which is infeasible with naive full rehashes at each step. For instance, in detection, rolling hash can scan long documents to identify similar passages by fingerprinting overlapping windows efficiently, avoiding the exhaustive hashing of every possible that would render naive methods impractical for texts exceeding thousands of characters. Additionally, when implemented with over large primes, rolling hash supports probabilistic matching with low collision rates, further enhancing its reliability in applications requiring quick filtering before exact verification.

Core Techniques

Polynomial Rolling Hash

The polynomial rolling hash function computes a numeric value for a fixed-length by interpreting the characters as coefficients of a evaluated at a chosen base bb, taken a large prime pp. For a s=s1s2sks = s_1 s_2 \dots s_k where each sis_i is mapped to an between 0 and the size minus 1, the initial hash hh for the window of length kk is h=s1bk1+s2bk2++skb0(modp).h = s_1 \cdot b^{k-1} + s_2 \cdot b^{k-2} + \dots + s_k \cdot b^0 \pmod{p}. This formulation allows the hash to represent the substring uniquely with high probability when pp is sufficiently large, facilitating quick comparisons in applications like pattern matching. The initial computation can be performed efficiently in O(k)O(k) time using Horner's rule, which rewrites the expression as h=((s1b+s2)b++sk)(modp)h = ((s_1 \cdot b + s_2) \cdot b + \dots + s_k) \pmod{p}. To slide the window by one position to the substring s2s3sk+1s_2 s_3 \dots s_{k+1}, the hash is updated in constant time without recomputing from scratch. The new hash hh' is calculated as h=(hs1bk1)b+sk+1(modp).h' = (h - s_1 \cdot b^{k-1}) \cdot b + s_{k+1} \pmod{p}. If the subtraction yields a negative value, add pp to ensure non-negativity before and . This update leverages the structure, effectively "shifting" the coefficients by removing the leading term, multiplying by bb, and appending the new character. Precomputing bk1(modp)b^{k-1} \pmod{p} once allows all updates to run in O(1)O(1) time after the initial hash. The parameters bb and pp are selected to balance computational efficiency, hash range, and . The base bb is typically a small prime larger than the size, such as 131 for general text or 256 for direct byte values in ASCII strings, ensuring the spreads values evenly. The modulus pp is chosen as a large prime, often 109+710^9 + 7, to confine the hash to 30 bits while approximating uniform distribution over [0,p1][0, p-1]. For random strings, the probability of two distinct substrings having the same hash (a collision) is roughly 1/p1/p; to improve safety in sensitive applications, double-hashing with two independent moduli p1p_1 and p2p_2 can reduce this to approximately 1/(p1p2)1/(p_1 p_2). The following pseudocode illustrates initialization and sliding for a string ss (0-indexed) of length nn with window size kk:

function initialize_hash(s, k, b, p): h = 0 for i = 0 to k-1: h = (h * b + ord(s[i])) % p return h function slide_hash(h, s_old, s_new, b, bk_minus_1, p): h = (h - ord(s_old) * bk_minus_1 + p) % p // +p ensures non-negative h = (h * b + ord(s_new)) % p return h // Usage example: bk_minus_1 = pow(b, k-1, p) // Precompute b^{k-1} mod p h = initialize_hash(s, k, b, p) for i = k to n-1: h = slide_hash(h, s[i-k], s[i], b, bk_minus_1, p) // Use h for window s[i-k+1 .. i]

function initialize_hash(s, k, b, p): h = 0 for i = 0 to k-1: h = (h * b + ord(s[i])) % p return h function slide_hash(h, s_old, s_new, b, bk_minus_1, p): h = (h - ord(s_old) * bk_minus_1 + p) % p // +p ensures non-negative h = (h * b + ord(s_new)) % p return h // Usage example: bk_minus_1 = pow(b, k-1, p) // Precompute b^{k-1} mod p h = initialize_hash(s, k, b, p) for i = k to n-1: h = slide_hash(h, s[i-k], s[i], b, bk_minus_1, p) // Use h for window s[i-k+1 .. i]

This approach enables linear-time processing of all nk+1n - k + 1 windows after O(k)O(k) setup. The method forms the basis for more advanced techniques like the , which enhances error bounds for .

Rabin Fingerprint

The Rabin fingerprint extends the polynomial rolling hash by computing the hash value as a pair (h mod p₁, h mod p₂), where p₁ and p₂ are two large distinct primes, providing a compact representation for probabilistic string comparison. The update mechanism follows the same sliding window principle as the base polynomial hash but is performed independently modulo each prime: subtract the outgoing character's contribution scaled by bk1b^{k-1} modulo pip_i, multiply the result by bb modulo pip_i, add the incoming character modulo pip_i, for i=1,2, enabling constant-time shifts across overlapping substrings. In the Rabin-Karp algorithm for string searching, this fingerprint pair for the pattern is compared against pairs for each m-length substring of the text, where m is the pattern length, achieving O(n + m) average-case for text length n, since equal fingerprints suggest potential matches (verified exactly) while unequal pairs confirm non-matches with overwhelming probability. The collision probability, where distinct strings produce the same fingerprint pair, is roughly 1/(p₁ p₂); for instance, with p₁ ≈ p₂ ≈ 10⁹, this yields an error rate of approximately 10⁻¹⁸, sufficiently low for most applications without increasing computational overhead significantly. This method originated in 's 1981 work on fingerprinting via random polynomials and was formalized for by Richard M. Karp and in 1987.

Variants and Extensions

Cyclic Polynomial Hash

The cyclic polynomial hash, also known as Buzhash, modifies the standard rolling hash by treating the sliding as a , employing bit rotations and XOR operations instead of s to achieve efficient updates and improved handling of wrap-around effects. This approach draws from polynomial representations over finite fields, where cyclic shifts correspond to by primitive elements, ensuring the hash value remains consistent under rotations of the input . Unlike linear polynomial methods, it avoids edge biases in periodic or repeating structures by modeling the window as a cycle, which enhances uniformity in hash distributions for applications involving circular or modular . The formulation computes the hash of a k-length s0,s1,,sk1s_0, s_1, \dots, s_{k-1} as H=i=0k1roli(h(si)),H = \bigoplus_{i=0}^{k-1} \text{rol}_i \left( h(s_i) \right), where h()h(\cdot) maps each sis_i to a random fixed-width (e.g., 64 bits), roli\text{rol}_i denotes a left by ii bits, and \bigoplus is bitwise XOR. To update the hash for a sliding —removing s0s_0 and adding sks_k—the new value is obtained via H=rol1(H)h(sk)h(s0),H' = \text{rol}_1(H) \bigoplus h(s_k) \bigoplus h(s_0), which requires only one and two XORs, maintaining constant-time per update. This recursive structure leverages the cyclic nature to simulate under , where the operator acts analogously to by a primitive root in the underlying field. A key advantage of the cyclic polynomial hash lies in its avoidance of costly multiplications, relying instead on fast bitwise operations, which can yield up to twice the speed of alternatives on hardware with limited multiplication throughput, while achieving for low collision probabilities in n-gram hashing—which represents the best achievable level for recursive n-gram hashing functions like cyclic polynomial hashes. It excels with periodic or rotational data, such as in cyclic redundancy checks adapted for probabilistic hashing, by reducing sensitivity to boundaries and providing more uniform distributions compared to linear shifts in standard hashes. This makes it particularly suitable for scenarios where data exhibits repetitive patterns, minimizing biases that could cluster similar hashes. In practice, cyclic polynomial hashing finds application in content deduplication systems with variable chunk boundaries, such as backup tools like Borg, where it identifies duplicate blocks by computing rolling hashes over streams to trigger chunks at natural data seams, thereby reducing storage overhead without fixed-size partitioning. It also supports bioinformatics tasks, including k-mer counting and sequence alignment, by enabling efficient rolling computations over genomic data to filter duplicates and build probabilistic structures like Bloom filters with minimal collisions. These uses highlight its role in reducing hash distribution biases for irregular or cyclic inputs, distinct from standard polynomial rolling hashes that assume linear progression and may exhibit edge effects in sliding windows.

Gear Fingerprint

The Gear fingerprint is a specialized rolling hash variant tailored for efficient variable-length chunking in tasks, particularly in and systems. It leverages a lightweight recursive update mechanism to compute hash values over sliding windows, facilitating the identification of content-defined boundaries without relying on fixed block sizes. First introduced in the Ddelta delta compression framework in , the Gear fingerprint was refined in subsequent work on content-defined chunking algorithms during the mid-2010s to address performance bottlenecks in deduplication pipelines. At its core, the Gear fingerprint operates by maintaining a rolling hash computed over the using the update formula fp=(fp1)+G(b)fp = (fp \ll 1) + G(b), where fpfp is the current , 1\ll 1 denotes a left bit-shift by one position, and G(b)G(b) is a precomputed 256-entry table assigning random 64-bit integers to each possible byte value bb. A chunk boundary is selected when the Gear hash matches a predefined or threshold, such as verifying if the most significant bits equal a target value (e.g., fp&(2131)=0fp \& (2^{13} - 1) = 0 for an average 8 KB chunk size). This approach allows efficient boundary detection as the window slides byte-by-byte. The primary advantages of the Gear fingerprint lie in its computational efficiency and adaptability for content-aware partitioning, which enhances deduplication ratios by producing variable-length chunks aligned with semantic shifts in the , such as file format transitions or repetitive patterns. Unlike naive fixed-size chunking, it avoids boundary misalignment issues that reduce storage savings, achieving up to 10× speedup over -based methods and 3× over prior Gear/AE-based methods while preserving hash uniformity and low collision rates. In practical deployments, such as cloud systems, this results in significant overall chunking throughput improvements with negligible impact on deduplication effectiveness (e.g., less than 1% ratio loss compared to hashing). In a typical application to a , the algorithm slides a byte-by-byte to generate successive Gear hash values, accumulating content influence through the shifting mechanism. At each position after a minimum chunk , the Gear hash is evaluated against the threshold; a match signals a boundary, segmenting the file into content-coherent chunks suitable for storage optimization. This ensures balanced chunk sizes (e.g., 2–64 KB) while minimizing overhead from excessive small or oversized fragments.

Applications

Content-Based Chunking

Content-based chunking, also known as content-defined chunking (CDC), leverages rolling hashes to partition data streams into variable-sized chunks based on inherent content patterns rather than fixed byte offsets. This approach scans the input data sequentially using a sliding window of fixed length, typically 32 to 64 bytes, over which a rolling hash function—often a —is computed incrementally as each new byte is incorporated and the oldest byte is removed. Chunk boundaries are dynamically selected when the rolling hash satisfies a predefined condition, such as the low-order bits of the hash value being zero, ensuring that similar content produces identical chunk divisions regardless of positional shifts. This method was pioneered in systems like the Low-Bandwidth File System (LBFS), where it enables efficient deduplication by aligning chunks to semantic boundaries in the data. The specific technique involves applying a to the rolling hash output to control chunk size distribution. For instance, if the hash value bitwise ANDed with a (e.g., 2k12^k - 1 for checking the lowest kk bits) equals zero, the current window position marks the end of a chunk, balancing chunk sizes around 4-64 KB while allowing variation based on content. Minimum and maximum chunk size constraints are enforced to prevent overly small or large fragments, with the tuned to achieve the desired (e.g., k=13k=13 for approximately 8 KB chunks). Variants like the can further optimize this by using multiple for faster without sacrificing boundary detection . In practice, the full chunk is then fingerprinted using a cryptographic hash like for deduplication lookup, but the rolling hash serves solely for boundary detection. This content-based approach yields higher deduplication rates compared to fixed-size chunking. By aligning boundaries to content features, it achieves better similarity detection in diverse workloads like email archives or file backups, where duplicates often span shifted regions. In contrast to fixed chunking, which suffers from boundary-shift errors—where even a single-byte insertion causes all subsequent chunks to differ, triggering unnecessary rehashing and storage—content-based chunking localizes changes to affected regions only. A small edit early in a file might shift only one or two chunks, preserving the rest for reuse in deduplication. This resilience is particularly valuable in incremental backups or versioned storage, minimizing data transfer and recomputation. Real-world implementations include backup tools like Restic, which employs 64-bit fingerprints over 64-byte windows with a mask checking the lowest 21 bits for boundaries, enabling efficient inter-file and inter-snapshot deduplication. Distributed storage systems such as GlusterFS also integrate Rabin-Karp rolling hashes for variable chunking in their deduplication translator, optimizing space in scale-out environments. These applications underscore the technique's role in modern storage efficiency, from personal backups to enterprise file systems.

FastCDC Algorithm

FastCDC is an efficient content-defined chunking (CDC) algorithm designed for high-performance , leveraging a rolling hash mechanism to identify chunk boundaries without requiring full file hashing. Introduced by Xia et al. in , it builds on prior Gear-based approaches by incorporating three key optimizations: simplified hash judgment, skipping of sub-minimum cut-points, and chunk-size normalization, all integrated with a fast rolling hash to achieve near-linear . This enables FastCDC to process large datasets rapidly while maintaining deduplication ratios comparable to more computationally intensive methods. The algorithm proceeds in structured steps to detect boundaries efficiently. First, it normalizes chunk sizes by dynamically adjusting the hash mask based on an expected average chunk size (e.g., 8KB), using a stricter mask (with more '1' bits, corresponding to a smaller sliding window) for the initial phase up to this size, and a looser mask (fewer '1' bits, larger window) thereafter; this ensures chunks fall within a target range (e.g., 8KB–16KB) without excessive small or large fragments. Second, it employs a two-level rolling hash: a small-window hash to identify candidate cut-points starting from a minimum chunk size (e.g., 2KB), followed by confirmation with a larger-window hash to validate boundaries and skip invalid small candidates. Third, boundaries are selected where the rolling hash value matches a predefined mask pattern (e.g., specific low-order bits are zero), using the hash to slide incrementally over the . A core innovation in FastCDC is its find_boundary function, which integrates the to skip potential chunks efficiently by advancing the hash computation from the minimum size threshold, avoiding redundant evaluations of sub-minimum cut-points and enabling linear-time boundary detection across the data. The itself, known as the Gear hash, uses an array of 256 precomputed 64-bit random integers (Gear for byte b) and updates via a left-shift and addition: fp = (fp << 1) + Gear[src[i]], allowing constant-time incremental computation as the window slides byte-by-byte. This combination yields up to 10× speedup over traditional -based chunkers and 3× over other Gear- or AE-based methods, with deduplication ratios remaining similar (e.g., 47.64% for FastCDC vs. 47.58% for on TAR datasets at 8KB average size). The following pseudocode illustrates the rolling hash integration in the boundary detection process (adapted from Algorithm 1 in Xia et al.):

Input: [data buffer](/page/Data_buffer) src, length n MaskS ← 0x0003590703530000LL // 15 '1' bits for small [window](/page/Window) MaskL ← 0x0000d90003530000LL // 11 '1' bits for large [window](/page/Window) MinSize ← 2KB; NormalSize ← 8KB fp ← 0; i ← 0 // Initialize hash up to MinSize for j from 0 to MinSize-1: fp = (fp << 1) + Gear[src[j]] i ← MinSize while i < n: // Advance hash one byte fp = ((fp << 1) + Gear[src[i]]) & 0xFFFFFFFFFFFFFFFFLL // 64-bit mask if i < NormalSize: if !(fp & MaskS): return i // Candidate boundary in small phase else: if !(fp & MaskL): return i // Confirmed boundary in large phase i ← i + 1 return n // End of buffer

Input: [data buffer](/page/Data_buffer) src, length n MaskS ← 0x0003590703530000LL // 15 '1' bits for small [window](/page/Window) MaskL ← 0x0000d90003530000LL // 11 '1' bits for large [window](/page/Window) MinSize ← 2KB; NormalSize ← 8KB fp ← 0; i ← 0 // Initialize hash up to MinSize for j from 0 to MinSize-1: fp = (fp << 1) + Gear[src[j]] i ← MinSize while i < n: // Advance hash one byte fp = ((fp << 1) + Gear[src[i]]) & 0xFFFFFFFFFFFFFFFFLL // 64-bit mask if i < NormalSize: if !(fp & MaskS): return i // Candidate boundary in small phase else: if !(fp & MaskL): return i // Confirmed boundary in large phase i ← i + 1 return n // End of buffer

Analysis

Computational Complexity

The computational complexity of rolling hash functions centers on their ability to update hash values efficiently as the window slides over the input stream. For core techniques like polynomial rolling hash and Rabin fingerprinting, preprocessing to compute the initial hash for a window of size kk requires O(k)O(k) time, while each subsequent update for sliding the window by one position takes O(1)O(1) time, resulting in O(n)O(n) total time to process a stream of length nn. Space requirements are minimal, using O(1)O(1) additional space beyond the input stream, as the algorithm maintains only the current hash value, the window size kk, the base, and the . The core operations rely on , where multiplication of two numbers a prime pp has bit O((logp)2)O((\log p)^2), though in practical implementations using 64-bit integers, these operations execute in constant time via optimized hardware instructions. Probabilistically, rolling hashes exhibit strong average-case performance due to low collision probabilities when using large primes for the modulus. In the worst case, a collision triggers verification via direct , costing O(k)O(k) time, but with a suitably large prime pp (e.g., around 2642^{64}), the collision probability is approximately 1/p1/p, ensuring expected O(1)O(1) time per update. Comparisons across variants reveal consistent efficiency: both rolling hash and fingerprinting achieve O(1)O(1) time per update. The gear fingerprint, employed in content-defined chunking, also supports O(1)O(1) updates through simple shift and addition operations.

Limitations and Trade-offs

Despite their efficiency, rolling hashes face significant collision risks, even with large prime moduli. Adversarial inputs or low-entropy data can exploit uneven feature coverage in Rabin's randomized fingerprinting model, leading to high false positive rates that exceed 10% for entropy scores up to 180 in typical text corpora. Mitigation strategies include employing multiple independent hash functions—such as with distinct bases and moduli—or performing exact verification on potential matches, which can reduce collision probabilities to approximately 101810^{-18} using with two large prime moduli around 10910^9. Parameter selection presents key trade-offs in rolling hash design. A larger modulus, such as 109+910^9 + 9, minimizes collision likelihood by expanding the hash but necessitates careful handling of arithmetic operations to prevent degradation from big-integer simulations. Conversely, a small base (e.g., 31 for lowercase letters) risks patterned distributions and elevated collisions in repetitive or structured inputs, while larger bases (e.g., 53 for full ASCII) promote uniformity at the expense of higher costs during rolling updates. Rolling hashes lack cryptographic security and are unsuitable for adversarial environments like or against tampering; for such purposes, secure alternatives like SHA-256 must be used instead. In certain applications, such as incremental message , they remain vulnerable to length-extension attacks, where an attacker can predictably append data to a known hash value without access to the original input. Practical implementation introduces challenges like in polynomial computations, which can be mitigated by using unsigned 64-bit integers and explicit modular reduction to maintain non-negative values. Additionally, processing non-ASCII data requires robust character mapping—such as full ordinal values—to avoid equivalence collisions, with the base selected to exceed the effective alphabet size for uniform distribution. Emerging applications in for preprocessing high-volume datasets highlight opportunities for improvement, where hybrid designs integrating rolling hashes with cryptographically secure primitives can balance speed and robustness.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.