Recent from talks
Nothing was collected or created yet.
Rolling hash
View on WikipediaIt has been suggested that this article be split out into a new article titled Content-Defined Chunking. (Discuss) (August 2020) |
A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.
A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters; and similar to the way a Zobrist hash can be rapidly updated from the old hash value.
One of the main applications is the Rabin–Karp string search algorithm, which uses the rolling hash described below. Another popular application is the rsync program, which uses a checksum based on Mark Adler's adler-32 as its rolling hash. Low Bandwidth Network Filesystem (LBFS) uses a Rabin fingerprint as its rolling hash. FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash.
At best, rolling hash values are pairwise independent[1] or strongly universal. They cannot be 3-wise independent, for example.
Polynomial rolling hash
[edit]The Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions:
- ,
where is a constant, and are the input characters (but this function is not a Rabin fingerprint, see below).
In order to avoid manipulating huge values, all math is done modulo . The choice of and is critical to get good hashing; in particular, the modulus is typically a prime number. See linear congruential generator for more discussion.
Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum by . Shifting all characters by one position to the right requires dividing the entire sum by . Note that in modulo arithmetic, can be chosen to have a multiplicative inverse by which can be multiplied to get the result of the division without actually performing a division.
Rabin fingerprint
[edit]The Rabin fingerprint is another hash, which also interprets the input as a polynomial, but over the Galois field GF(2). Instead of seeing the input as a polynomial of bytes, it is seen as a polynomial of bits, and all arithmetic is done in GF(2) (similarly to CRC-32). The hash is the remainder after the division of that polynomial by an irreducible polynomial over GF(2). It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash.
Because it shares the same author as the Rabin–Karp string search algorithm, which is often explained with another, simpler rolling hash, and because this simpler rolling hash is also a polynomial, both rolling hashes are often mistaken for each other. The backup software restic uses a Rabin fingerprint for splitting files, with blob size varying between 512KiB and 8MiB and 64-byte window.[2][3]
Cyclic polynomial
[edit]Hashing by cyclic polynomial[4]—sometimes called Buzhash[5]—is also simple and it has the benefit of avoiding multiplications, using circular shift instead. It is a form of tabulation hashing: it presumes that there is some substitution function from characters to integers in the interval . This substitution function might be a lookup table mapping characters to random integers. Let the function be a bitwise rotation. E.g., . Let be the bitwise exclusive or. The hash values are defined as
resulting a number in .
Computing the hash values in a rolling fashion is done as follows. Let be the previous hash value. Rotate once: . If is the character to be removed, rotate it times: . Then simply set
where is the new character.
Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first bits. That is, take the result and dismiss any consecutive bits.[1] In practice, this can be achieved by an integer division .
The backup software Borg uses a BuzHash for splitting files. It has blob size varying between 512KiB and 8MiB, uses 4095-byte window[6] and non-balanced substitution function.[7]
Content-based slicing using a rolling hash
[edit]One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network: a simple byte addition at the front of the file would normally cause all fixed size windows to become updated, while in reality, only the first "chunk" has been modified.[3]
A simple approach to making dynamic chunks is to calculate a rolling hash, and if the hash value matches an arbitrary pattern (e.g. all zeroes) in the lower N bits (with a probability of , given the hash has a uniform probability distribution) then it’s chosen to be a chunk boundary. Each chunk will thus have an average size of bytes. This approach ensures that unmodified data (more than a window size away from the changes) will have the same boundaries.
Once the boundaries are known, the chunks need to be compared by cryptographic hash value to detect changes.[8] The backup software Borg uses the Buzhash algorithm with a customizable chunk size range for splitting file streams.[6]
Such content-defined chunking is often used for data deduplication.[3][6]
Content-based slicing using moving sum
[edit]Several programs, including gzip (with the --rsyncable option) and rsyncrypto, do content-based slicing based on this specific (unweighted) moving sum:[9]
where
- is the sum of 8196 consecutive bytes ending with byte (requires 21 bits of storage),
- is byte of the file,
- is a "hash value" consisting of the bottom 12 bits of .
Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum.
For every where , these programs cut the file between and . This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but no other chunk.
Gear fingerprint and content-based chunking algorithm FastCDC
[edit]Chunking is a technique to divide a data stream into a set of blocks, also called chunks. Content-defined chunking (CDC) is a chunking technique in which the division of the data stream is not based on fixed chunk size, as in fixed-size chunking, but on its content.
The Content-Defined Chunking algorithm needs to compute the hash value of a data stream byte by byte and split the data stream into chunks when the hash value meets a predefined value. However, comparing a string byte-by-byte will introduce the heavy computation overhead. FastCDC [10] proposes a new and efficient Content-Defined Chunking approach. It uses a fast rolling Gear hash algorithm,[11] skipping the minimum length, normalizing the chunk-size distribution, and last but not the least, rolling two bytes each time to speed up the CDC algorithm, which can achieve about 10X higher throughput than Rabin-based CDC approach.[12]
The basic version pseudocode is provided as follows:
algorithm FastCDC
input: data buffer src,
data length n,
output: cut point i
MinSize ← 2KB // split minimum chunk size is 2 KB
MaxSize ← 64KB // split maximum chunk size is 64 KB
Mask ← 0x0000d93003530000LL
fp ← 0
i ← 0
// buffer size is less than minimum chunk size
if n ≤ MinSize then
return n
if n ≥ MaxSize then
n ← MaxSize
// Skip the first MinSize bytes, and kickstart the hash
while i < MinSize do
fp ← (fp << 1 ) + Gear[src[i]]
i ← i + 1
while i < n do
fp ← (fp << 1 ) + Gear[src[i]]
if !(fp & Mask) then
return i
i ← i + 1
return i
Where Gear array is a pre-calculated hashing array. Here FastCDC uses Gear hashing algorithm which can calculate the rolling hashing results quickly and keep the uniform distribution of the hashing results as Rabin. Compared with the traditional Rabin hashing algorithm, it achieves a much faster speed. Experiments suggest that it can generate nearly the same chunk size distribution in the much shorter time (about 1/10 of rabin-based chunking [12]) when segmenting the data stream.
Computational complexity
[edit]All rolling hash functions can be computed in time linear in the number of characters and updated in constant time when characters are shifted by one position. In particular, computing the Rabin–Karp rolling hash of a string of length requires modular arithmetic operations, and hashing by cyclic polynomials requires bitwise exclusive ors and circular shifts.[1]
See also
[edit]- MinHash – Data mining technique
- w-shingling
Footnotes
[edit]- ^ a b c Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698–710, 2010. arXiv:0705.4676.
- ^ "References — restic 0.9.0 documentation". restic.readthedocs.io. Retrieved 2018-05-24.
- ^ a b c "Foundation - Introducing Content Defined Chunking (CDC)". 2015.
- ^ Jonathan D. Cohen, Recursive Hashing Functions for n-Grams, ACM Trans. Inf. Syst. 15 (3), 1997.
- ^ Uzgalis, Robert. "BUZ Hash".
- ^ a b c "Data structures and file formats — Borg - Deduplicating Archiver 1.1.5 documentation". borgbackup.readthedocs.io. Retrieved 2018-05-24.
- ^ Waldmann, Thomas. "borg docs: buzhash insights / discussion". GitHub.
- ^ Horvath, Adam (October 24, 2012). "Rabin Karp rolling hash - dynamic sized chunks based on hashed content".
- ^ "Rsyncrypto Algorithm".
- ^ Xia, Wen; Zhou, Yukun; Jiang, Hong; Feng, Dan; Hua, Yu; Hu, Yuchong; Liu, Qing; Zhang, Yucheng (2005). FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication. Usenix Association. ISBN 9781931971300. Retrieved 2020-07-24.
{{cite book}}:|website=ignored (help) - ^ Xia, Wen; Jiang, Hong; Feng, Dan; Tian, Lei; Fu, Min; Zhou, Yukun (2014). "Ddelta: A deduplication-inspired fast delta compression approach". Performance Evaluation. 79: 258–272. doi:10.1016/j.peva.2014.07.016. ISSN 0166-5316.
- ^ a b Xia, Wen; Zou, Xiangyu; Jiang, Hong; Zhou, Yukun; Liu, Chuanyi; Feng, Dan; Hua, Yu; Hu, Yuchong; Zhang, Yucheng (2020-06-16). "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems". IEEE Transactions on Parallel and Distributed Systems. 31 (9): 2017–2031. Bibcode:2020ITPDS..31.2017X. doi:10.1109/TPDS.2020.2984632. S2CID 215817722.
External links
[edit]Rolling hash
View on GrokipediaFundamentals
Definition and Motivation
A rolling hash function is a specialized hashing technique designed to compute the hash value of a fixed-length substring or sliding window within a larger string or data stream by incrementally updating the previous hash value as the window advances by one position, thereby avoiding the computational overhead of recalculating the entire hash from scratch for each new window.[4] This approach leverages modular arithmetic to ensure efficient updates, typically in constant time per slide, making it suitable for processing sequential data where overlapping substrings need repeated hashing.[5] The primary motivation for rolling hashes stems from the inefficiency of naive substring hashing methods in applications requiring multiple overlapping window comparisons, such as string pattern matching or data deduplication. In a naive implementation, hashing each of the (n - m + 1) possible windows of size m in an n-length string would incur O((n - m + 1) * m) time complexity due to the per-window recomputation cost.[4] By contrast, rolling hashes reduce this to O(n + m) overall time through incremental updates, enabling scalable processing of large datasets where frequent substring evaluations are essential.[4] Rolling hashes originated in the 1980s amid developments in efficient string processing algorithms, with a seminal application in the Rabin-Karp string-searching algorithm introduced by Richard M. Karp and Michael O. Rabin in 1987. This work built on earlier linear-time matching techniques like Knuth-Morris-Pratt (1977) but innovated by using randomized fingerprinting—essentially an early form of rolling hashing—to achieve constant-space efficiency and near-real-time performance for pattern detection in texts and multidimensional arrays. As a basic illustration, consider hashing consecutive k-mers in a string over an alphabet of size b by treating each k-mer as the evaluation of a polynomial of degree (k-1) with coefficients from the character values, computed modulo a large prime p; the hash for the next k-mer is then obtained by subtracting the outgoing character's contribution, multiplying by b, and adding the incoming character's value, all modulo p.[5]Advantages over Naive Methods
Rolling hash provides significant efficiency gains over naive methods for computing hashes of overlapping substrings, primarily through its incremental update mechanism. In naive approaches, hashing each possible substring of length in a string of length requires recomputing the hash from scratch for every position, resulting in a time complexity of . By contrast, rolling hash computes an initial hash in time and then updates the hash value in time per window slide by subtracting the contribution of the outgoing character and adding the incoming one, multiplied by appropriate powers of the base, yielding a total time complexity of or simply for large . This linear-time performance makes rolling hash particularly suitable for large-scale string processing tasks where is substantial relative to .[6][7][8] In terms of space efficiency, rolling hash maintains only a single running hash value at any time, requiring constant additional memory beyond the input string itself. Naive methods, while not necessarily storing all hashes, incur repeated full computations that indirectly burden memory through redundant operations on the entire window data for each position. This minimal memory footprint of rolling hash facilitates its use in resource-constrained environments, such as embedded systems or when processing massive datasets where storing intermediate results for all windows would be prohibitive.[9][10] Practically, rolling hash enables real-time processing of streaming data by allowing continuous updates without restarting computations, which is infeasible with naive full rehashes at each step. For instance, in plagiarism detection, rolling hash can scan long documents to identify similar passages by fingerprinting overlapping windows efficiently, avoiding the exhaustive hashing of every possible substring that would render naive methods impractical for texts exceeding thousands of characters. Additionally, when implemented with modular arithmetic over large primes, rolling hash supports probabilistic matching with low collision rates, further enhancing its reliability in applications requiring quick filtering before exact verification.[11][12]Core Techniques
Polynomial Rolling Hash
The polynomial rolling hash function computes a numeric value for a fixed-length substring by interpreting the characters as coefficients of a polynomial evaluated at a chosen base , taken modulo a large prime . For a string where each is mapped to an integer between 0 and the alphabet size minus 1, the initial hash for the window of length is This formulation allows the hash to represent the substring uniquely with high probability when is sufficiently large, facilitating quick comparisons in applications like pattern matching. The initial computation can be performed efficiently in time using Horner's rule, which rewrites the expression as .[6] To slide the window by one position to the substring , the hash is updated in constant time without recomputing from scratch. The new hash is calculated as If the subtraction yields a negative value, add to ensure non-negativity before multiplication and addition. This update leverages the polynomial structure, effectively "shifting" the coefficients by removing the leading term, multiplying by , and appending the new character. Precomputing once allows all updates to run in time after the initial hash.[6] The parameters and are selected to balance computational efficiency, hash range, and collision resistance. The base is typically a small prime larger than the alphabet size, such as 131 for general text processing or 256 for direct byte values in ASCII strings, ensuring the polynomial evaluation spreads values evenly. The modulus is chosen as a large prime, often , to confine the hash to 30 bits while approximating uniform distribution over . For random strings, the probability of two distinct substrings having the same hash (a collision) is roughly ; to improve safety in sensitive applications, double-hashing with two independent moduli and can reduce this to approximately .[6][13] The following pseudocode illustrates initialization and sliding for a string (0-indexed) of length with window size :function initialize_hash(s, k, b, p):
h = 0
for i = 0 to k-1:
h = (h * b + ord(s[i])) % p
return h
function slide_hash(h, s_old, s_new, b, bk_minus_1, p):
h = (h - ord(s_old) * bk_minus_1 + p) % p // +p ensures non-negative
h = (h * b + ord(s_new)) % p
return h
// Usage example:
bk_minus_1 = pow(b, k-1, p) // Precompute b^{k-1} mod p
h = initialize_hash(s, k, b, p)
for i = k to n-1:
h = slide_hash(h, s[i-k], s[i], b, bk_minus_1, p)
// Use h for window s[i-k+1 .. i]
function initialize_hash(s, k, b, p):
h = 0
for i = 0 to k-1:
h = (h * b + ord(s[i])) % p
return h
function slide_hash(h, s_old, s_new, b, bk_minus_1, p):
h = (h - ord(s_old) * bk_minus_1 + p) % p // +p ensures non-negative
h = (h * b + ord(s_new)) % p
return h
// Usage example:
bk_minus_1 = pow(b, k-1, p) // Precompute b^{k-1} mod p
h = initialize_hash(s, k, b, p)
for i = k to n-1:
h = slide_hash(h, s[i-k], s[i], b, bk_minus_1, p)
// Use h for window s[i-k+1 .. i]
Rabin Fingerprint
The Rabin fingerprint extends the polynomial rolling hash by computing the hash value as a pair (h mod p₁, h mod p₂), where p₁ and p₂ are two large distinct primes, providing a compact representation for probabilistic string comparison.[14] The update mechanism follows the same sliding window principle as the base polynomial hash but is performed independently modulo each prime: subtract the outgoing character's contribution scaled by modulo , multiply the result by modulo , add the incoming character modulo , for i=1,2, enabling constant-time shifts across overlapping substrings. In the Rabin-Karp algorithm for string searching, this fingerprint pair for the pattern is compared against pairs for each m-length substring of the text, where m is the pattern length, achieving O(n + m) average-case time complexity for text length n, since equal fingerprints suggest potential matches (verified exactly) while unequal pairs confirm non-matches with overwhelming probability. The collision probability, where distinct strings produce the same fingerprint pair, is roughly 1/(p₁ p₂); for instance, with p₁ ≈ p₂ ≈ 10⁹, this yields an error rate of approximately 10⁻¹⁸, sufficiently low for most applications without increasing computational overhead significantly.[14] This method originated in Michael O. Rabin's 1981 work on fingerprinting via random polynomials and was formalized for pattern matching by Richard M. Karp and Michael O. Rabin in 1987.[14]Variants and Extensions
Cyclic Polynomial Hash
The cyclic polynomial hash, also known as Buzhash, modifies the standard polynomial rolling hash by treating the sliding window as a circular buffer, employing bit rotations and XOR operations instead of multiplications to achieve efficient updates and improved handling of wrap-around effects. This approach draws from polynomial representations over finite fields, where cyclic shifts correspond to multiplication by primitive elements, ensuring the hash value remains consistent under rotations of the input window. Unlike linear polynomial methods, it avoids edge biases in periodic or repeating data structures by modeling the window as a cycle, which enhances uniformity in hash distributions for applications involving circular or modular data.[15][16] The formulation computes the hash of a k-length window as where maps each symbol to a random fixed-width integer (e.g., 64 bits), denotes a left rotation by bits, and is bitwise XOR. To update the hash for a sliding window—removing and adding —the new value is obtained via which requires only one rotation and two XORs, maintaining constant-time complexity per update. This recursive structure leverages the cyclic nature to simulate polynomial evaluation under rotation, where the rotation operator acts analogously to multiplication by a primitive root in the underlying field.[16][15] A key advantage of the cyclic polynomial hash lies in its avoidance of costly multiplications, relying instead on fast bitwise operations, which can yield up to twice the speed of irreducible polynomial alternatives on hardware with limited multiplication throughput, while achieving pairwise independence for low collision probabilities in n-gram hashing—which represents the best achievable level for recursive n-gram hashing functions like cyclic polynomial hashes.[17] It excels with periodic or rotational data, such as in cyclic redundancy checks adapted for probabilistic hashing, by reducing sensitivity to window boundaries and providing more uniform distributions compared to linear shifts in standard polynomial hashes. This makes it particularly suitable for scenarios where data exhibits repetitive patterns, minimizing biases that could cluster similar hashes.[15] In practice, cyclic polynomial hashing finds application in content deduplication systems with variable chunk boundaries, such as backup tools like Borg, where it identifies duplicate blocks by computing rolling hashes over streams to trigger chunks at natural data seams, thereby reducing storage overhead without fixed-size partitioning. It also supports bioinformatics tasks, including k-mer counting and sequence alignment, by enabling efficient rolling computations over genomic data to filter duplicates and build probabilistic structures like Bloom filters with minimal collisions. These uses highlight its role in reducing hash distribution biases for irregular or cyclic inputs, distinct from standard polynomial rolling hashes that assume linear progression and may exhibit edge effects in sliding windows.[16][15]Gear Fingerprint
The Gear fingerprint is a specialized rolling hash variant tailored for efficient variable-length chunking in data processing tasks, particularly in backup and cloud storage systems. It leverages a lightweight recursive update mechanism to compute hash values over sliding windows, facilitating the identification of content-defined boundaries without relying on fixed block sizes. First introduced in the Ddelta delta compression framework in 2014, the Gear fingerprint was refined in subsequent work on content-defined chunking algorithms during the mid-2010s to address performance bottlenecks in deduplication pipelines.[18] At its core, the Gear fingerprint operates by maintaining a rolling hash computed over the data stream using the update formula , where is the current fingerprint, denotes a left bit-shift by one position, and is a precomputed 256-entry table assigning random 64-bit integers to each possible byte value . A chunk boundary is selected when the Gear hash matches a predefined mask or threshold, such as verifying if the most significant bits equal a target value (e.g., for an average 8 KB chunk size). This approach allows efficient boundary detection as the window slides byte-by-byte.[18] The primary advantages of the Gear fingerprint lie in its computational efficiency and adaptability for content-aware partitioning, which enhances deduplication ratios by producing variable-length chunks aligned with semantic shifts in the data stream, such as file format transitions or repetitive patterns. Unlike naive fixed-size chunking, it avoids boundary misalignment issues that reduce storage savings, achieving up to 10× speedup over Rabin-based methods and 3× over prior Gear/AE-based methods while preserving hash uniformity and low collision rates. In practical deployments, such as cloud backup systems, this results in significant overall chunking throughput improvements with negligible impact on deduplication effectiveness (e.g., less than 1% ratio loss compared to Rabin hashing).[18] In a typical application to a data file, the algorithm slides a window byte-by-byte to generate successive Gear hash values, accumulating content influence through the shifting mechanism. At each position after a minimum chunk length, the Gear hash is evaluated against the threshold; a match signals a boundary, segmenting the file into content-coherent chunks suitable for storage optimization. This process ensures balanced chunk sizes (e.g., 2–64 KB) while minimizing overhead from excessive small or oversized fragments.[18]Applications
Content-Based Chunking
Content-based chunking, also known as content-defined chunking (CDC), leverages rolling hashes to partition data streams into variable-sized chunks based on inherent content patterns rather than fixed byte offsets. This approach scans the input data sequentially using a sliding window of fixed length, typically 32 to 64 bytes, over which a rolling hash function—often a Rabin fingerprint—is computed incrementally as each new byte is incorporated and the oldest byte is removed. Chunk boundaries are dynamically selected when the rolling hash satisfies a predefined condition, such as the low-order bits of the hash value being zero, ensuring that similar content produces identical chunk divisions regardless of positional shifts. This method was pioneered in systems like the Low-Bandwidth File System (LBFS), where it enables efficient deduplication by aligning chunks to semantic boundaries in the data.[19] The specific technique involves applying a mask to the rolling hash output to control chunk size distribution. For instance, if the hash value bitwise ANDed with a mask (e.g., for checking the lowest bits) equals zero, the current window position marks the end of a chunk, balancing average chunk sizes around 4-64 KB while allowing variation based on content. Minimum and maximum chunk size constraints are enforced to prevent overly small or large fragments, with the mask parameter tuned to achieve the desired average (e.g., for approximately 8 KB chunks). Variants like the gear fingerprint can further optimize this process by using multiple hash gears for faster computation without sacrificing boundary detection quality. In practice, the full chunk is then fingerprinted using a cryptographic hash like SHA-1 for deduplication lookup, but the rolling hash serves solely for boundary detection. This content-based approach yields higher deduplication rates compared to fixed-size chunking.[20] By aligning boundaries to content features, it achieves better similarity detection in diverse workloads like email archives or file backups, where duplicates often span shifted regions. In contrast to fixed chunking, which suffers from boundary-shift errors—where even a single-byte insertion causes all subsequent chunks to differ, triggering unnecessary rehashing and storage—content-based chunking localizes changes to affected regions only. A small edit early in a file might shift only one or two chunks, preserving the rest for reuse in deduplication. This resilience is particularly valuable in incremental backups or versioned storage, minimizing data transfer and recomputation. Real-world implementations include backup tools like Restic, which employs 64-bit Rabin fingerprints over 64-byte windows with a mask checking the lowest 21 bits for boundaries, enabling efficient inter-file and inter-snapshot deduplication. Distributed storage systems such as GlusterFS also integrate Rabin-Karp rolling hashes for variable chunking in their deduplication translator, optimizing space in scale-out environments.[21] These applications underscore the technique's role in modern storage efficiency, from personal backups to enterprise file systems.[22]FastCDC Algorithm
FastCDC is an efficient content-defined chunking (CDC) algorithm designed for high-performance data deduplication, leveraging a rolling hash mechanism to identify chunk boundaries without requiring full file hashing. Introduced by Xia et al. in 2016, it builds on prior Gear-based approaches by incorporating three key optimizations: simplified hash judgment, skipping of sub-minimum cut-points, and chunk-size normalization, all integrated with a fast rolling hash to achieve near-linear time complexity.[18] This enables FastCDC to process large datasets rapidly while maintaining deduplication ratios comparable to more computationally intensive methods.[18] The algorithm proceeds in structured steps to detect boundaries efficiently. First, it normalizes chunk sizes by dynamically adjusting the hash mask based on an expected average chunk size (e.g., 8KB), using a stricter mask (with more '1' bits, corresponding to a smaller sliding window) for the initial phase up to this size, and a looser mask (fewer '1' bits, larger window) thereafter; this ensures chunks fall within a target range (e.g., 8KB–16KB) without excessive small or large fragments.[18] Second, it employs a two-level rolling hash: a small-window hash to identify candidate cut-points starting from a minimum chunk size (e.g., 2KB), followed by confirmation with a larger-window hash to validate boundaries and skip invalid small candidates.[18] Third, boundaries are selected where the rolling hash value matches a predefined mask pattern (e.g., specific low-order bits are zero), using the hash to slide incrementally over the data stream.[18] A core innovation in FastCDC is its find_boundary function, which integrates the rolling hash to skip potential chunks efficiently by advancing the hash computation from the minimum size threshold, avoiding redundant evaluations of sub-minimum cut-points and enabling linear-time boundary detection across the data.[18] The rolling hash itself, known as the Gear hash, uses an array of 256 precomputed 64-bit random integers (Gear for byte b) and updates via a left-shift and addition:fp = (fp << 1) + Gear[src[i]], allowing constant-time incremental computation as the window slides byte-by-byte.[18] This combination yields up to 10× speedup over traditional Rabin-based chunkers and 3× over other Gear- or AE-based methods, with deduplication ratios remaining similar (e.g., 47.64% for FastCDC vs. 47.58% for Rabin on TAR datasets at 8KB average size).[18]
The following pseudocode illustrates the rolling hash integration in the boundary detection process (adapted from Algorithm 1 in Xia et al.):[18]
Input: [data buffer](/page/Data_buffer) src, length n
MaskS ← 0x0003590703530000LL // 15 '1' bits for small [window](/page/Window)
MaskL ← 0x0000d90003530000LL // 11 '1' bits for large [window](/page/Window)
MinSize ← 2KB; NormalSize ← 8KB
fp ← 0; i ← 0
// Initialize hash up to MinSize
for j from 0 to MinSize-1:
fp = (fp << 1) + Gear[src[j]]
i ← MinSize
while i < n:
// Advance hash one byte
fp = ((fp << 1) + Gear[src[i]]) & 0xFFFFFFFFFFFFFFFFLL // 64-bit mask
if i < NormalSize:
if !(fp & MaskS):
return i // Candidate boundary in small phase
else:
if !(fp & MaskL):
return i // Confirmed boundary in large phase
i ← i + 1
return n // End of buffer
Input: [data buffer](/page/Data_buffer) src, length n
MaskS ← 0x0003590703530000LL // 15 '1' bits for small [window](/page/Window)
MaskL ← 0x0000d90003530000LL // 11 '1' bits for large [window](/page/Window)
MinSize ← 2KB; NormalSize ← 8KB
fp ← 0; i ← 0
// Initialize hash up to MinSize
for j from 0 to MinSize-1:
fp = (fp << 1) + Gear[src[j]]
i ← MinSize
while i < n:
// Advance hash one byte
fp = ((fp << 1) + Gear[src[i]]) & 0xFFFFFFFFFFFFFFFFLL // 64-bit mask
if i < NormalSize:
if !(fp & MaskS):
return i // Candidate boundary in small phase
else:
if !(fp & MaskL):
return i // Confirmed boundary in large phase
i ← i + 1
return n // End of buffer
