Hubbry Logo
Rabin–Karp algorithmRabin–Karp algorithmMain
Open search
Rabin–Karp algorithm
Community hub
Rabin–Karp algorithm
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Rabin–Karp algorithm
Rabin–Karp algorithm
from Wikipedia
Rabin-Karp algorithm
ClassString searching
Worst-case performance plus preprocessing time
Average performance
Worst-case space complexity

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin (1987) that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

To find a single match of a single pattern, the expected time of the algorithm is linear in the combined length of the pattern and text, although its worst-case time complexity is the product of the two lengths. To find multiple matches, the expected time is linear in the input lengths, plus the combined length of all the matches, which could be greater than linear. In contrast, the Aho–Corasick algorithm can find all matches of multiple patterns in worst-case time and space linear in the input length and the number of matches (instead of the total length of the matches).

A practical application of the algorithm is detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.

Overview

[edit]

A naive string matching algorithm compares the given pattern against all positions in the given text. Each comparison takes time proportional to the length of the pattern, and the number of positions is proportional to the length of the text. Therefore, the worst-case time for such a method is proportional to the product of the two lengths. In many practical cases, this time can be significantly reduced by cutting short the comparison at each position as soon as a mismatch is found, but this idea cannot guarantee any speedup.

Several string-matching algorithms, including the Knuth–Morris–Pratt algorithm and the Boyer–Moore string-search algorithm, reduce the worst-case time for string matching by extracting more information from each mismatch, allowing them to skip over positions of the text that are guaranteed not to match the pattern. The Rabin–Karp algorithm instead achieves its speedup by using a hash function to quickly perform an approximate check for each position, and then only performing an exact comparison at the positions that pass this approximate check.

A hash function is a function which converts every string into a numeric value, called its hash value; for example, we might have hash("hello")=5. If two strings are equal, their hash values are also equal. For a well-designed hash function, the inverse is true, in an approximate sense: strings that are unequal are very unlikely to have equal hash values. The Rabin–Karp algorithm proceeds by computing, at each position of the text, the hash value of a string starting at that position with the same length as the pattern. If this hash value equals the hash value of the pattern, it performs a full comparison at that position.

In order for this to work well, the hash function should be selected randomly from a family of hash functions that are unlikely to produce many false positives, that is, positions of the text which have the same hash value as the pattern but do not actually match the pattern. These positions contribute to the running time of the algorithm unnecessarily, without producing a match. Additionally, the hash function used should be a rolling hash, a hash function whose value can be quickly updated from each position of the text to the next. Recomputing the hash function from scratch at each position would be too slow.

The algorithm

[edit]

The algorithm is as shown:

function RabinKarp(string s[1..n], string pattern[1..m])
    hpattern := hash(pattern[1..m]);
    for i from 1 to n-m+1
        hs := hash(s[i..i+m-1])
        if hs = hpattern
            if s[i..i+m-1] = pattern[1..m]
                return i
    return not found

Lines 2, 4, and 6 each require O(m) time. However, line 2 is only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed O(n) times, but each comparison only requires constant time, so its impact is O(n). The issue is line 4.

Naively computing the hash value for the substring s[i+1..i+m] requires O(m) time because each character is examined. Since the hash computation is done on each loop, the algorithm with a naive hash computation requires O(mn) time, the same complexity as a straightforward string matching algorithm. For speed, the hash must be computed in constant time. The trick is the variable hs already contains the previous hash value of s[i..i+m-1]. If that value can be used to compute the next hash value in constant time, then computing successive hash values will be fast.

The trick can be exploited using a rolling hash. A rolling hash is a hash function specially designed to enable this operation. A trivial (but not very good) rolling hash function just adds the values of each character in the substring. This rolling hash formula can compute the next hash value from the previous value in constant time:

s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.

Good performance requires a good hashing function for the encountered data. If the hashing is poor (such as producing the same hash value for every input), then line 6 would be executed O(n) times (i.e. on every iteration of the loop). Because character-by-character comparison of strings with length m takes O(m) time, the whole algorithm then takes a worst-case O(mn) time.

Hash function used

[edit]

The key to the Rabin–Karp algorithm's performance is the efficient computation of hash values of the successive substrings of the text. The Rabin fingerprint is a popular and effective rolling hash function. The hash function described here is not a Rabin fingerprint, but it works equally well. It treats every substring as a number in some base, the base being usually the size of the character set.

For example, if the substring is "hi", the base is 256, and prime modulus is 101, then the hash value would be

 [(104 × 256 ) %[a] 101  + 105] % 101  =  65
 (ASCII of 'h' is 104 and of 'i' is 105)

Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See hash function for a much more detailed discussion. The essential benefit achieved by using a rolling hash such as the Rabin fingerprint is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.

For example, if we have text "abracadabra" and we are searching for a pattern of length 3, the hash of the first substring, "abr", using 256 as the base, and 101 as the prime modulus is:

// ASCII a = 97, b = 98, r = 114. 
hash("abr") =  [ ( [ ( [  (97 × 256) % 101 + 98 ] % 101 ) × 256 ] %  101 ) + 114 ]   % 101   =  4

We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 × 2562, multiplying by the base and adding for the last a of "bra", i.e. 97 × 2560. Like so:

//           old hash   (-ve avoider)[b]   old 'a'   left base offset      base shift    new 'a'    prime modulus
hash("bra") =     [ ( 4   + 101         -  97 * [(256%101)*256] % 101[c] ) * 256[d]       +    97 ] % 101      =  30

If we are matching the search string "bra", using similar calculation of hash("abr"),

hash'("bra") =  [ ( [ ( [ ( 98 × 256) %101  + 114] % 101 ) × 256 ] % 101) + 97 ] % 101 = 30

If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.

Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing the previous hash by the first character value, then multiplying by the new last character's value. The limitation, however, is the limited size of the integer data type and the necessity of using modular arithmetic to scale down the hash results.[e] Meanwhile, naive hash functions do not produce large numbers quickly, but, just like adding ASCII values, are likely to cause many hash collisions and hence slow down the algorithm. Hence the described hash function is typically the preferred one in the Rabin–Karp algorithm.

[edit]

The Rabin–Karp algorithm is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string-search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, it is a useful algorithm for multiple pattern search.

To find any of a large number, say k, fixed length patterns in a text, a simple variant of the Rabin–Karp algorithm uses a Bloom filter or a set data structure to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for:

function RabinKarpSet(string s[1..n], set of string subs, m):
    set hsubs := emptySet
    foreach sub in subs
        insert hash(sub[1..m]) into hsubs
    hs := hash(s[1..m])
    for i from 1 to n-m+1
        if hs  hsubs and s[i..i+m-1]  subs
            return i
        hs := hash(s[i+1..i+m])
    return not found

We assume all the substrings have a fixed length m.

A naïve way to search for k patterns is to repeat a single-pattern search taking O(n+m) time, totaling in O((n+m)k) time. In contrast, the above algorithm can find all k patterns in O(n+km) expected time, assuming that a hash table check works in O(1) expected time.

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Rabin–Karp algorithm is a that uses functions to efficiently detect occurrences of a given (or multiple patterns) within a longer text by comparing hash values of the pattern against those of sliding substrings in the text, verifying matches through direct string comparison only when hashes collide. It was developed by Richard M. Karp and in 1987 as part of their work on efficient randomized pattern-matching techniques. The algorithm begins by selecting a hash function that treats the strings as numbers in a given base (typically a prime larger than the size) and computes a polynomial a large prime to minimize collisions. For a of length mm and text of length nn, it precomputes the 's hash in O(m)O(m) time and the hash of the first mm-length of the text, then slides a window across the text, updating each subsequent hash in O(1)O(1) time using the hi+1=(b(hisibm1)+si+m)mod[q](/page/Q)h_{i+1} = (b \cdot (h_i - s_i \cdot b^{m-1}) + s_{i+m}) \mod [q](/page/Q), where bb is the base, qq the modulus, and ss the text characters. Hash matches trigger an O(m)O(m) string comparison; with a good , the expected number of such verifications is low, yielding average-case of O(n+m)O(n + m). In the worst case, it can degrade to O(nm)O(nm) if many collisions occur, but probabilistic analysis shows this is unlikely with appropriate parameters. A key strength of the Rabin–Karp algorithm lies in its adaptability to multiple-pattern matching, where all patterns are hashed into a set or for O(1)O(1) lookups against text hashes, making it particularly efficient for applications like detection, bioinformatics sequence searching, and tools. It outperforms naive matching, which runs in O(nm)O(nm) time, especially when patterns share common prefixes or when the text has few matches, as demonstrated in empirical tests where it achieved speeds up to 21 times faster than standard library functions on repetitive data. Modern implementations often enhance it with or larger moduli to further reduce collision probabilities, ensuring robustness in practice.

Introduction

Overview

The Rabin–Karp algorithm is a that employs hashing techniques to locate all occurrences of a given pattern within a larger text . Developed by and Richard M. Karp, it was first published in 1987 as part of a broader exploration of randomized pattern-matching methods. At its core, the algorithm computes a hash value for the pattern and corresponding hash values for overlapping substrings of the text, using these to rapidly identify candidate matches that require exact character-by-character verification to confirm or rule out false positives. This approach leverages a rolling hash mechanism to update substring hashes incrementally without recomputing from scratch each time. The algorithm demonstrates expected linear time performance of O(n+m)O(n + m) in practice, where nn is the text length and mm is the pattern length, making it particularly efficient for processing large texts or searching multiple patterns simultaneously. However, it carries a potential worst-case time complexity of O(nm)O(nm) if hash collisions occur frequently, necessitating robust hash functions to mitigate this risk.

History

The Rabin–Karp algorithm was invented by and Richard M. Karp in 1987 as a probabilistic approach to string matching. Their work introduced randomized techniques to achieve expected linear-time performance for finding occurrences of a pattern within a larger text, addressing limitations of deterministic methods prevalent at the time. The algorithm was detailed in their seminal paper, "Efficient Randomized Pattern-Matching Algorithms," published in the Journal of Research and Development. This publication built upon the broader context of advancements in string algorithms, including the (1977) and the Boyer–Moore algorithm (1977), which had established efficient deterministic searching but often required complex preprocessing. Rabin and Karp's contribution leveraged earlier ideas in from Carter and Wegman (1977), adapting them to enable rapid comparisons via rolling hashes. The initial motivation stemmed from the need to surpass the quadratic time complexity of naive string matching, which was inefficient for growing text corpora in early computing environments such as database systems and text processing tools at IBM. By employing hashing to filter potential matches probabilistically, the algorithm offered practical speedups with low error probabilities, tunable through hash function choices. By the late and into the , the Rabin–Karp method gained traction in algorithmic and software implementations, influencing tools for text search and ; for instance, Sedgewick's 1988 textbook on algorithms highlighted it as a key technique for string processing. Its adoption extended to open-source libraries and search utilities, where it supported efficient handling of multiple patterns and large datasets, paving the way for applications in .

Core Concepts

Rolling Hash Functions

A rolling hash function is a technique for efficiently computing hash values of substrings in a string by representing the string as a polynomial evaluated at a chosen base bb, allowing incremental updates as the substring window slides across the text. This approach treats each character as a coefficient in the polynomial, where the alphabet symbols are mapped to numerical values (e.g., using base b=256b = 256 for ASCII characters). The method enables O(1)O(1) time updates for consecutive substrings, making it suitable for pattern matching tasks like those in the Rabin–Karp algorithm. For a pattern string PP of length mm, the initial hash value hPh_P is computed as hP=i=0m1Pbm1i(modp),h_P = \sum_{i=0}^{m-1} P \cdot b^{m-1-i} \pmod{p}, where PP is the numerical value of the ii-th character, bb is the base, and pp is a large prime modulus chosen to reduce the likelihood of collisions. This captures the entire in a single pp, serving as a fingerprint that approximates the string's identity. To compute the hash for the next sliding window in the text TT, starting from position jj, the update formula is hnext=(hcurrentTbm1)b+T[j+m](modp).h_{\text{next}} = \left( h_{\text{current}} - T \cdot b^{m-1} \right) \cdot b + T[j+m] \pmod{p}. This removes the contribution of the outgoing character TT (shifted by the highest power), multiplies by bb to adjust positions, and adds the incoming character T[j+m]T[j+m], all modulo pp to keep values manageable. The operation exploits the polynomial structure to avoid recomputing the entire sum, ensuring constant-time transitions. Key parameters include the modulus pp, typically a large prime such as 109+710^9 + 7 or leveraging 64-bit unsigned integers for natural overflow (equivalent to mod 2642^{64}), and the base bb, selected as a larger than the alphabet size to enhance uniformity (e.g., b=131b = 131 or 257257 for ). These choices balance computational efficiency with , as a good pp distributes hashes evenly across the range, minimizing spurious matches. The primary property of this rolling hash is its use in fingerprinting, where the modulo operation produces a compact signature with a low probability of collisions for distinct strings, theoretically bounded by 1/p1/p for random inputs. To further lower the false positive rate—especially for long texts—double hashing employs two independent hash functions with distinct primes p1p_1 and p2p_2, requiring both hashes to match for a candidate position, reducing collision probability to approximately 1/(p1p2)1/(p_1 \cdot p_2). This enhances reliability without significantly increasing computation.

Collision Resolution

In the Rabin–Karp algorithm, potential matches identified by equal hash values between the and a sliding window of the text are verified through an exact character-by-character comparison to resolve collisions, ensuring that only true matches are reported. This verification step is essential because hash collisions can occur when distinct strings produce the same hash value, leading to false positives if not checked. The probability of a collision at any given position is approximately 1/p for random strings, where p is the chosen modulus in the , assuming a uniform distribution over the hash space. In the worst case, however, adversarial inputs can cause collisions at all n - m + 1 positions, forcing verification at every step and degrading performance. To mitigate this, the algorithm can employ multiple independent functions, such as two with distinct primes p and q, reducing the collision probability to roughly 1/(p * q) and making false positives exponentially unlikely. This verification process adds O(m) time complexity per candidate match, where m is the pattern length, resulting in an expected overall runtime of O(n + m) for the algorithm under random inputs, but potentially O(nm) in the worst case due to frequent collisions. In practice, using 64-bit hash functions renders collisions negligible for most texts shorter than 10^18 characters, as the per-position collision probability falls below 10^{-18}, ensuring high reliability without excessive verification overhead.

Algorithm Mechanics

Preprocessing

In the preprocessing phase of the Rabin–Karp algorithm, the hash value of the entire pattern string PP of length mm is computed to enable efficient comparisons during the search. This hash, denoted hPh_P, is derived from a polynomial rolling hash function: hP=(i=0m1Pbm1i)modp,h_P = \left( \sum_{i=0}^{m-1} P \cdot b^{m-1-i} \right) \mod p, where PP represents the numeric value of the ii-th character (e.g., ASCII ), bb is the base (often 256 for ASCII characters), and pp is a large prime modulus chosen to minimize collisions. The computation is typically performed iteratively to avoid overflow: initialize hP=0h_P = 0, then for each ii from 0 to m1m-1, update hP=(bhP+P)modph_P = (b \cdot h_P + P) \mod p. This step takes O(m)O(m) time. Additionally, a scaling is precomputed, which facilitates the efficient update of substring in the text by allowing the removal of the leading character's contribution and the addition of the new trailing character in constant time per window shift. This precomputation also requires O(m)O(m) time using efficient methods, but since it is done once, the overall preprocessing remains linear. The additional space used is O(1)O(1) beyond storing the input pattern. For illustration, consider the pattern "abc" where a=97a = 97, b=98b = 98, c=99c = 99, b=256b = 256, and p=101p = 101. The hash is hP=(972562+98256+99)mod101=90.h_P = (97 \cdot 256^2 + 98 \cdot 256 + 99) \mod 101 = 90. The scaling factor is h=2562mod101=88h = 256^{2} \mod 101 = 88. These values prepare the algorithm for sliding window comparisons without recomputing full hashes.

Searching Process

The searching process of the Rabin–Karp algorithm begins after the pattern hash hPh_P has been precomputed during preprocessing. For a text TT of length nn and pattern PP of length mm, the algorithm first initializes the hash value for the initial window consisting of the first mm characters of TT, denoted as T[0m1]T[0 \dots m-1]. This initial hash, h0h_0, is calculated using the same hashing function applied to the pattern. The core of the searching phase involves an iterative loop over possible starting positions in the text. For each position jj ranging from 0 to nmn - m, the algorithm compares the current window hash hjh_j (representing T[jj+m1]T[j \dots j+m-1]) with hPh_P. If the hashes match, an exact character-by-character verification is performed on the T[jj+m1]T[j \dots j+m-1] against PP to confirm a true match and avoid false positives due to hash collisions. If the verification succeeds, the starting index jj is recorded in a list of matches. The window is then slid forward by one position, updating hjh_j to hj+1h_{j+1} via the mechanism, and the process repeats until all positions are checked. This iterative approach efficiently scans the text by leveraging hash comparisons to skip unnecessary exact checks in most cases. The output of the searching process is a list of all starting indices where PP occurs in TT; if no matches are found, the list is empty. Edge cases are handled straightforwardly: if n<mn < m, no windows of size mm exist in TT, so the algorithm immediately returns an empty list of matches without entering the loop. The following pseudocode outlines the searching process:

function search(T, n, P, m, h_P): if n < m: return empty list matches = empty list h = hash of T[0..m-1] // Initialize first window hash for j = 0 to n - m: if h == h_P: if T[j..j+m-1] == P: // Exact verification append j to matches if j < n - m: h = update_hash(h, T[j], T[j+m], m) // Slide window return matches

function search(T, n, P, m, h_P): if n < m: return empty list matches = empty list h = hash of T[0..m-1] // Initialize first window hash for j = 0 to n - m: if h == h_P: if T[j..j+m-1] == P: // Exact verification append j to matches if j < n - m: h = update_hash(h, T[j], T[j+m], m) // Slide window return matches

Extensions and Variants

Multiple Pattern Matching

The Rabin–Karp algorithm can be extended to efficiently search for multiple patterns in a text by leveraging hashing techniques to avoid exhaustive comparisons. In this variant, the algorithm processes a set of k patterns simultaneously, making it suitable for scenarios involving dictionaries or keyword lists. During preprocessing, hashes are computed for each of the k patterns using the same function as in the single-pattern case, and these hashes are stored in a set or to enable O(1) average-case lookups. The total preprocessing time is O(∑ m_i), where m_i denotes the length of the i-th pattern, as each pattern requires a single hash computation proportional to its length. This storage allows quick checks without scanning all patterns at every text position. In the searching phase, the text is scanned using sliding windows of appropriate , computing the for each window. For each window hash, the algorithm queries the pattern hash set; if a is found, it verifies the exact string equality against all patterns sharing that hash value (typically few due to low collision rates) or filters by to narrow candidates. This step maintains the of the original algorithm, with expected verification cost proportional only to actual or spurious matches. The overall expected search time remains O(n, where n is the text , under the assumption of a universal with negligible collision probability. To accommodate patterns of variable lengths, they are grouped by length during preprocessing, and separate rolling hash computations are performed for each group using windows matching those lengths. This approach avoids computing hashes for irrelevant window sizes, keeping the expected time complexity at O(n + ∑ m_i) while handling diverse pattern sets without quadratic overhead. For extremely large pattern sets where space is a concern, such as millions of entries, a Bloom filter can approximate the hash set during preprocessing and querying. Pattern hashes are inserted into the Bloom filter, which uses multiple hash functions for probabilistic membership testing with no false negatives but possible false positives. Candidate windows triggering a positive filter result undergo exact verification, trading minor space savings for controlled error rates tunable via filter parameters. This variant is particularly effective for approximate matching in resource-constrained environments. A representative example is searching for words in a large , such as detection or . Patterns ( words) are grouped by length, their hashes stored in a set, and the document scanned to report positions where any word occurs, enabling rapid identification of matches across varied term lengths.

Practical Applications

The Rabin–Karp algorithm is widely employed in detection systems, where it hashes chunks of to identify similarities by comparing rolling hash values across texts. This approach enables efficient scanning of large corpora, such as academic papers or student submissions, to flag potential overlaps without exhaustive character-by-character comparisons. In computational biology, the algorithm facilitates the search for motifs—short, recurring DNA sequences—in massive genomic datasets, leveraging its hashing efficiency to handle texts where the length nn can exceed billions of nucleotides. For instance, it supports pattern matching in sequence alignment tools, accelerating the identification of regulatory elements or genetic markers in human or microbial genomes. The algorithm finds application in search engines for substring matching during web content indexing and querying, where rolling hashes quickly filter potential matches in vast textual data. Similarly, in software development tools, it aids in analysis, such as detecting repeated code patterns. Implementations of the Rabin–Karp algorithm appear in various programming libraries for string searching, often hybridized with other techniques to optimize across pattern lengths; for example, educational and reference implementations in and Python demonstrate its integration for detection. Its principles support duplicate detection in modern text processing pipelines. Recent GPU-accelerated variants enable multi-pattern matching in network intrusion detection systems for cybersecurity applications.

Performance Evaluation

Complexity Analysis

The preprocessing phase of the Rabin–Karp algorithm involves computing the hash value of the pattern string of length mm, which requires O(m)O(m) time. In the searching phase, the algorithm computes rolling hashes for substrings of the text of length nn in constant time per update, leading to an overall expected running time of O(n+m)O(n + m) due to the rarity of hash collisions necessitating full pattern verifications. However, in the worst case, if hashes collide frequently across all positions, the algorithm may require up to O(nm)O(nm) time for exhaustive character-by-character comparisons. The space complexity is O(1)O(1) for searching a single pattern, as it only stores a fixed number of registers for hashes and positions. For the multiple pattern variant, space increases to O(k)O(k) to accommodate the hashes of kk patterns. Performance is influenced by the choice of modulus size in the , which directly affects collision probability; larger primes reduce the likelihood of false positives to negligible levels for typical string lengths, such as a probability below 10910^{-9} for n109n \approx 10^9 when using a 64-bit modulus. A proof sketch for the expected number of verifications leverages the birthday paradox analogy: with nm+1n - m + 1 hash comparisons and collision probability 1/p1/p per pair (where pp is the modulus), the expected false matches follow a with mean (nm+1)/p(n - m + 1)/p; for pnp \gg n, this mean is o(1)o(1), ensuring the additional verification cost is O(m)O(m) with high probability, yielding the linear expected time. In practice, selections of large prime moduli (e.g., around 2642^{64}) on random data result in near-linear O(n)O(n) behavior, as collisions are exceedingly rare and the algorithm avoids worst-case degradation.

Comparisons to Other Algorithms

The Rabin–Karp algorithm improves upon the naive string matching approach by employing hashing to compare substrings, achieving an average-case of O(n + m) rather than the naive method's worst-case O(nm), where n is the text length and m is the length, thus avoiding exhaustive character comparisons in most scenarios. This makes Rabin–Karp particularly advantageous for large texts where hash matches quickly eliminate non-candidate positions, though it reverts to O(nm) in the rare worst case of numerous collisions. In contrast to the Knuth–Morris–Pratt (KMP) algorithm, which guarantees O(n + m) worst-case performance through O(m) preprocessing to build a prefix table and avoid in the text, –Karp relies on probabilistic hashing without such deterministic shifts, offering simpler but potential for collisions that require full verification. KMP excels in scenarios with repetitive patterns due to its failure function, while –Karp's expected linear time suits random texts but lacks KMP's consistency across all inputs. Compared to the Boyer–Moore algorithm, which achieves sublinear average-case performance of O(n/m) by skipping ahead using bad-character and good-suffix heuristics—especially effective for large alphabets and long patterns—Rabin–Karp provides more uniform efficiency across pattern types without relying on alphabet-specific rules, though it does not match Boyer–Moore's skipping capabilities in practice for texts. Boyer–Moore's right-to-left scanning often outperforms Rabin–Karp in worst-case avoidance for single patterns, but Rabin–Karp's hashing scales better with . For multiple pattern matching, the delivers deterministic O(n + ∑ m_i) time, where ∑ m_i is the total pattern length, by constructing a finite that processes the text in a single pass, contrasting with Rabin–Karp extensions that use hashing for faster O(∑ m_i + n) preprocessing but introduce collision risks requiring verification. 's trie-based structure is memory-intensive for large pattern sets, whereas Rabin–Karp variants (detailed in the multiple pattern matching section) prioritize speed in preprocessing at the cost of probabilistic guarantees. Rabin–Karp is preferable for applications leveraging hardware-supported hashing or processing large random texts, such as detection or genomic searches, while hybrids combining it with KMP or Boyer–Moore are common in libraries like those in or Python for balanced performance.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.