Recent from talks
Contribute something
Nothing was collected or created yet.
String-searching algorithm
View on WikipediaA string-searching algorithm, sometimes called string-matching algorithm, is an algorithm that searches a body of text for portions that match by pattern.
A basic example of string searching is when the pattern and the searched text are arrays of elements of an alphabet (finite set) Σ. Σ may be a human language alphabet, for example, the letters A through Z and other applications may use a binary alphabet (Σ = {0,1}) or a DNA alphabet (Σ = {A,C,G,T}) in bioinformatics.
In practice, the method of feasible string-search algorithm may be affected by the string encoding. In particular, if a variable-width encoding is in use, then it may be slower to find the Nth character, perhaps requiring time proportional to N. This may significantly slow some search algorithms. One of many possible solutions is to search for the sequence of code units instead, but doing so may produce false matches unless the encoding is specifically designed to avoid it.[citation needed]
Overview
[edit]The most basic case of string searching involves one (often very long) string, sometimes called the haystack, and one (often very short) string, sometimes called the needle. The goal is to find one or more occurrences of the needle within the haystack. For example, one might search for to within:
Some books are to be tasted, others to be swallowed, and some few to be chewed and digested.
One might request the first occurrence of "to", which is the fourth word; or all occurrences, of which there are 3; or the last, which is the fifth word from the end.
Very commonly, however, various constraints are added. For example, one might want to match the "needle" only where it consists of one (or more) complete words—perhaps defined as not having other letters immediately adjacent on either side. In that case a search for "hew" or "low" should fail for the example sentence above, even though those literal strings do occur.
Another common example involves "normalization". For many purposes, a search for a phrase such as "to be" should succeed even in places where there is something else intervening between the "to" and the "be":
- More than one space
- Other "whitespace" characters such as tabs, non-breaking spaces, line-breaks, etc.
- Less commonly, a hyphen or soft hyphen
- In structured texts, tags or even arbitrarily large but "parenthetical" things such as footnotes, list-numbers or other markers, embedded images, and so on.
Many symbol systems include characters that are synonymous (at least for some purposes):
- Latin-based alphabets distinguish lower-case from upper-case, but for many purposes string search is expected to ignore the distinction.
- Many languages include ligatures, where one composite character is equivalent to two or more other characters.
- Many writing systems involve diacritical marks such as accents or vowel points, which may vary in their usage, or be of varying importance in matching.
- DNA sequences can involve non-coding segments which may be ignored for some purposes, or polymorphisms that lead to no change in the encoded proteins, which may not count as a true difference for some other purposes.
- Some languages have rules where a different character or form of character must be used at the start, middle, or end of words.
Finally, for strings that represent natural language, aspects of the language itself become involved. For example, one might wish to find all occurrences of a "word" despite it having alternate spellings, prefixes or suffixes, etc.
Another more complex type of search is regular expression searching, where the user constructs a pattern of characters or other symbols, and any match to the pattern should fulfill the search. For example, to catch both the American English word "color" and the British equivalent "colour", instead of searching for two different literal strings, one might use a regular expression such as:
colou?r
where the "?" conventionally makes the preceding character ("u") optional.
This article mainly discusses algorithms for the simpler kinds of string searching.
A similar problem introduced in the field of bioinformatics and genomics is the maximal exact matching (MEM).[1] Given two strings, MEMs are common substrings that cannot be extended left or right without causing a mismatch.[2]
Examples of search algorithms
[edit]Naive string search
[edit]A simple and inefficient way to see where one string occurs inside another is to check at each index, one by one. First, we see if there is a copy of the needle starting at the first character of the haystack; if not, we look to see if there's a copy of the needle starting at the second character of the haystack, and so forth. In the normal case, we only have to look at one or two characters for each wrong position to see that it is a wrong position, so in the average case, this takes O(n + m) steps, where n is the length of the haystack and m is the length of the needle; but in the worst case, searching for a string like "aaaab" in a string like "aaaaaaaaab", it takes O(nm)
Finite-state-automaton-based search
[edit]
In this approach, backtracking is avoided by constructing a deterministic finite automaton (DFA) that recognizes a stored search string. These are expensive to construct—they are usually created using the powerset construction—but are very quick to use. For example, the DFA shown to the right recognizes the word "MOMMY". This approach is frequently generalized in practice to search for arbitrary regular expressions.
Stubs
[edit]Knuth–Morris–Pratt computes a DFA that recognizes inputs with the string to search for as a suffix, Boyer–Moore starts searching from the end of the needle, so it can usually jump ahead a whole needle-length at each step. Baeza–Yates keeps track of whether the previous j characters were a prefix of the search string, and is therefore adaptable to fuzzy string searching. The bitap algorithm is an application of Baeza–Yates' approach.
Index methods
[edit]Faster search algorithms preprocess the text. After building a substring index, for example a suffix tree or suffix array, the occurrences of a pattern can be found quickly. As an example, a suffix tree can be built in time, and all occurrences of a pattern can be found in time under the assumption that the alphabet has a constant size and all inner nodes in the suffix tree know what leaves are underneath them. The latter can be accomplished by running a DFS algorithm from the root of the suffix tree.
Other variants
[edit]Some search methods, for instance trigram search, are intended to find a "closeness" score between the search string and the text rather than a "match/non-match". These are sometimes called "fuzzy" searches.
Classification of search algorithms
[edit]Classification by a number of patterns
[edit]The various algorithms can be classified by the number of patterns each uses.
Single-pattern algorithms
[edit]In the following compilation, m is the length of the pattern, n the length of the searchable text, and k = |Σ| is the size of the alphabet.
| Algorithm | Preprocessing time | Matching time[1] | Space |
|---|---|---|---|
| Naïve algorithm | none | Θ(n+m) in average, O(mn) |
none |
| Automaton-based matching | Θ(km) | Θ(n) | Θ(km) |
| Rabin–Karp | Θ(m) | Θ(n) in average, O(mn) at worst |
O(1) |
| Knuth–Morris–Pratt | Θ(m) | Θ(n) | Θ(m) |
| Boyer–Moore | Θ(m + k) | O(n/m) at best, O(mn) at worst |
Θ(k) |
| Two-way algorithm[3][2] | Θ(m) | O(n) | O(log(m)) |
| Backward Non-Deterministic DAWG Matching (BNDM)[4][3] | O(m) | Ω(n/m) at best, O(mn) at worst |
|
| Backward Oracle Matching (BOM)[5] | O(m) | O(mn) |
- 1.^ Asymptotic times are expressed using O, Ω, and Θ notation.
- 2.^ Used to implement the memmem and strstr search functions in the glibc[6] and musl[7] C standard libraries.
- 3.^ Can be extended to handle approximate string matching and (potentially-infinite) sets of patterns represented as regular languages.[citation needed]
The Boyer–Moore string-search algorithm has been the standard benchmark for the practical string-search literature.[8]
Algorithms using a finite set of patterns
[edit]In the following compilation, M is the length of the longest pattern, m their total length, n the length of the searchable text, o the number of occurrences.
| Algorithm | Extension of | Preprocessing time | Matching time[4] | Space |
|---|---|---|---|---|
| Aho–Corasick | Knuth–Morris–Pratt | Θ(m) | Θ(n + o) | Θ(m) |
| Commentz-Walter | Boyer-Moore | Θ(m) | Θ(M * n) worst case sublinear in average[9] |
Θ(m) |
| Set-BOM | Backward Oracle Matching |
Algorithms using an infinite number of patterns
[edit]Naturally, the patterns can not be enumerated finitely in this case. They are represented usually by a regular grammar or regular expression.
Classification by the use of preprocessing programs
[edit]Other classification approaches are possible. One of the most common uses preprocessing as main criteria.
| Text not preprocessed | Text preprocessed | |
|---|---|---|
| Patterns not preprocessed | Elementary algorithms | Index methods |
| Patterns preprocessed | Constructed search engines | Signature methods[11] |
Classification by matching strategies
[edit]Another one classifies the algorithms by their matching strategy:[12]
- Match the prefix first (Knuth–Morris–Pratt, Shift-And, Aho–Corasick)
- Match the suffix first (Boyer–Moore and variants, Commentz-Walter)
- Match the best factor first (BNDM, BOM, Set-BOM)
- Other strategy (Naïve, Rabin–Karp, Vectorized)
Real-time string matching
[edit]In real-time string matching, one requires the matcher to output a response after reading each character of the text, that indicates whether this is the last character of a match. The response has to be given within constant time. The requirement regarding preprocessing vary: O(m) preprocessing may be allowed after the pattern is read (but before the reading of the text), or a stricter requirement may be posed according to which the matcher has to also pause for at most a constant time after reading any character of the pattern (including the last). For the more lenient version, if one does not mind that the preprocessing time and memory requirement dependend on the size of the alphabet, a real-time solution is provided by automaton matching. Zvi Galil developed a method to turn certain algorithms into real-time algorithms, and applied it to produce a variant of the KMP matcher that runs in real time under the strict requirement.[13]
String searching with don't cares
[edit]In this version of the string searching problem, there is a special symbol, ø (read: don't care), which can match any other symbol (including another ø). Don't care symbols can appear either in the pattern or in the text. In 2002, an algorithm for this problem that runs in time has been given by Richard Cole and Ramesh Hariharan, improving on a solution from 1973 by Fischer and Paterson that has complexity , where k is the size of the alphabet.[14] Another algorithm, claimed simpler, has been proposed by Clifford and Clifford.[15]
See also
[edit]References
[edit]- ^ Kurtz, Stefan; Phillippy, Adam; Delcher, Arthur L; Smoot, Michael; Shumway, Martin; Antonescu, Corina; Salzberg, Steven L (2004). "Versatile and open software for comparing large genomes". Genome Biology. 5 (2): R12. doi:10.1186/gb-2004-5-2-r12. ISSN 1465-6906. PMC 395750. PMID 14759262.
- ^ Khan, Zia; Bloom, Joshua S.; Kruglyak, Leonid; Singh, Mona (2009-07-01). "A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays". Bioinformatics. 25 (13): 1609–1616. doi:10.1093/bioinformatics/btp275. PMC 2732316. PMID 19389736.
- ^ Crochemore, Maxime; Perrin, Dominique (1 July 1991). "Two-way string-matching" (PDF). Journal of the ACM. 38 (3): 650–674. doi:10.1145/116825.116845. S2CID 15055316. Archived (PDF) from the original on 24 November 2021. Retrieved 5 April 2019.
- ^ Navarro, Gonzalo; Raffinot, Mathieu (1998). "A bit-parallel approach to suffix automata: Fast extended string matching" (PDF). Combinatorial Pattern Matching. Lecture Notes in Computer Science. Vol. 1448. Springer Berlin Heidelberg. pp. 14–33. doi:10.1007/bfb0030778. ISBN 978-3-540-64739-3. Archived (PDF) from the original on 2019-01-05. Retrieved 2019-11-22.
- ^ Fan, H.; Yao, N.; Ma, H. (December 2009). "Fast Variants of the Backward-Oracle-Marching Algorithm" (PDF). 2009 Fourth International Conference on Internet Computing for Science and Engineering. pp. 56–59. doi:10.1109/ICICSE.2009.53. ISBN 978-1-4244-6754-9. S2CID 6073627. Archived from the original on 2022-05-10. Retrieved 2019-11-22.
- ^ "glibc/string/str-two-way.h". Archived from the original on 2020-09-20. Retrieved 2022-03-22.
- ^ "musl/src/string/memmem.c". Archived from the original on 1 October 2020. Retrieved 23 November 2019.
- ^ Hume; Sunday (1991). "Fast String Searching". Software: Practice and Experience. 21 (11): 1221–1248. doi:10.1002/spe.4380211105. S2CID 5902579.
- ^ Commentz-Walter, Beate (1979). A String Matching Algorithm Fast on the Average (PDF). International Colloquium on Automata, Languages and Programming. LNCS. Vol. 71. Graz, Austria: Springer. pp. 118–132. doi:10.1007/3-540-09510-1_10. ISBN 3-540-09510-1. Archived from the original (PDF) on 2017-10-10.
- ^ Melichar, Borivoj, Jan Holub, and J. Polcar. Text Searching Algorithms. Volume I: Forward String Matching. Vol. 1. 2 vols., 2005. http://stringology.org/athens/TextSearchingAlgorithms/ Archived 2016-03-04 at the Wayback Machine.
- ^ Litwin, Witold; Mokadem, Riad; Rigaux, Philippe; Schwarz, Thomas (2007), Fast nGram-Based String Search Over Data Encoded Using Algebraic Signatures (PDF), International Conference on Very Large Data Bases
- ^ Gonzalo Navarro; Mathieu Raffinot (2008), Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, ISBN 978-0-521-03993-2
- ^ Galil, Zvi (1981). "String matching in real time". Journal of the ACM. 28 (1): 134–149. doi:10.1145/322234.322244.
- ^ Cole, Richard; Hariharan, Ramesh (2002). "Verifying candidate matches in sparse and wildcard matching". Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. pp. 592–601.
- ^ Clifford, Peter; Clifford, Raphaël (January 2007). "Simple deterministic wildcard matching". Information Processing Letters. 101 (2): 53–54. doi:10.1016/j.ipl.2006.08.002.
Further reading
[edit]- R. S. Boyer and J. S. Moore, A fast string searching algorithm, Carom. ACM 20, (10), 262–272(1977).
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. MIT Press and McGraw-Hill, 2009. ISBN 0-262-03293-7. Chapter 32: String Matching, pp. 985–1013.
External links
[edit]- Huge list of pattern matching links Last updated: 12/27/2008 20:18:38
- Large (maintained) list of string-matching algorithms
- NIST list of string-matching algorithms
- StringSearch – high-performance pattern matching algorithms in Java – Implementations of many String-Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or)
- StringsAndChars – Implementations of many String-Matching-Algorithms (for single and multiple patterns) in Java
- Exact String Matching Algorithms — Animation in Java, Detailed description and C implementation of many algorithms.
- (PDF) Improved Single and Multiple Approximate String Matching Archived 2017-03-11 at the Wayback Machine
- Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features
- NyoTengu – high-performance pattern matching algorithm in C – Implementations of Vector and Scalar String-Matching-Algorithms in C
- Nathaniel K. Brown, et al.: "KeBaB: k-mer based breaking for finding long MEMs", arXiv:2502.20338v3 (09 Jun 2025).
String-searching algorithm
View on GrokipediaIntroduction
Definition and Problem Statement
A string-searching algorithm, also known as exact string matching, is a method to locate all occurrences of a given pattern string within a larger text string .[6] Both strings are sequences of characters drawn from a finite alphabet , with having length and having length , where typically .[7] The core problem is formally defined as follows: given the input pair , identify and report all starting positions (where ) such that the substring exactly equals .[8] The output is a list of these matching indices , representing the positions where appears in .[6] This formulation assumes exact matching, meaning no errors or approximations are allowed in the comparison.[7] Standard exact string matching reports all occurrences, which may overlap—for instance, in the text "aaa" searching for "aa" yields matches at positions 0 and 1.[9] Variants of the problem include determining only the existence of any match (without listing positions) or reporting a maximal set of non-overlapping occurrences, where no two matches share characters.[10] These distinctions arise in applications such as text processing, where the choice depends on whether complete coverage or disjoint segments are required.[9]Historical Development and Importance
The development of string-searching algorithms began in the early days of computing, with early naive methods performing brute-force comparisons, resulting in quadratic time complexity in the worst case.[11] These approaches, while simple, were insufficient for growing text-processing needs in early systems. A significant breakthrough came in 1977 with the Knuth-Morris-Pratt (KMP) algorithm, introduced by Donald E. Knuth, James H. Morris, and Vaughan R. Pratt, which achieved linear-time performance O(n + m) for text length n and pattern length m by preprocessing the pattern to avoid redundant comparisons. Key milestones followed closely. In 1975, Alfred V. Aho and Margaret J. Corasick developed an algorithm for multiple-pattern matching, building a finite automaton to locate all occurrences of a set of keywords in linear time, enhancing efficiency for dictionary-based searches.[12] The same year, 1977 saw Robert S. Boyer and J Strother Moore propose the Boyer-Moore algorithm, which skips portions of the text by comparing from the end of the pattern, often achieving sublinear average-case performance and proving highly practical for large texts.[13] Concurrently, suffix tree structures revolutionized the field; Peter Weiner introduced the concept in 1973 as a "position tree" for linear-time pattern matching via compacted suffix representations.[14] This was refined in the 1980s and culminated in Esko Ukkonen's 1995 online linear-time construction algorithm, making suffix trees more accessible for dynamic applications.[15] String-searching algorithms hold central importance in computing, underpinning tools like compilers for lexical analysis and syntax parsing, text editors for find-and-replace operations, and databases for query processing and indexing.[16] In bioinformatics, they enable efficient DNA and protein sequence alignment, crucial for genome assembly and motif discovery in large biological datasets.[16] Information retrieval systems rely on them for full-text search in search engines, while security applications, such as intrusion detection systems like Snort, use multi-pattern variants to scan network traffic for malicious signatures in real time.[17] These advancements profoundly impacted computational complexity theory, shifting the paradigm from quadratic O(nm) naive methods to linear and sublinear bounds, which influenced broader algorithm design by emphasizing preprocessing and skipping techniques to exploit string redundancies.[16] This evolution not only improved practical performance but also inspired research in related areas like compressed indexing and approximate matching.[11]Core Concepts
Text and Pattern Representation
In computer science, strings used in searching algorithms are typically represented as sequences of characters stored in contiguous memory locations, such as character arrays. In languages like C, strings are implemented as null-terminated arrays of characters, where a special null byte (ASCII 0) marks the end of the string, allowing functions to determine length without explicit storage.[18] This approach facilitates efficient traversal but requires careful management to avoid buffer overflows. In contrast, higher-level languages like Java and Python treat strings as immutable objects, preventing in-place modifications to ensure thread safety and enable string sharing across the program. In Java, strings are stored as arrays of 16-bit UTF-16 code units, supporting Unicode characters where supplementary characters are represented by surrogate pairs.[19] Similarly, Python represents strings as sequences of Unicode code points, with immutability enforced to promote functional programming practices and optimize memory usage through interning.[20] Handling Unicode and multibyte characters adds complexity to string representation, as encodings like UTF-8 use variable-length byte sequences (1 to 4 bytes per character) to represent the full range of Unicode code points. This allows efficient storage for ASCII-dominant text while accommodating international scripts, but searching algorithms must account for code unit boundaries to avoid partial character matches. The Unicode encoding model structures this through layers including character encoding forms (e.g., UTF-8, UTF-16) that map code points to code units, ensuring consistent processing across systems. For multibyte scenarios, such as in Chinese text processing, algorithms treat mixed single- and multi-byte characters by normalizing boundaries during representation.[21][22] Preprocessing of text and pattern begins with computing their lengths, denoted as for the text and for the pattern, which establishes the scope for subsequent comparisons. Optional transformations, such as converting both to lowercase, enable case-insensitive matching by standardizing character representations without altering the underlying structure. This step is particularly useful in natural language applications where case variations should not affect results.[23][24] Beyond basic arrays, data structures for text storage include linked lists, which offer advantages in dynamic environments by allowing efficient insertions and deletions at arbitrary positions without shifting elements, though they incur overhead from pointer storage and slower random access compared to arrays. Arrays, conversely, provide constant-time access via indexing, making them ideal for fixed-size texts in performance-critical searches, but resizing requires reallocation. For streaming inputs, where text arrives incrementally, buffers—such as circular queues or fixed-size arrays—temporarily hold data chunks to enable processing without loading the entire text into memory. In finite automata underlying some matching techniques, states are briefly represented using adjacency lists to model transitions over the alphabet, facilitating compact storage of sparse graphs.[25][26][27] Memory considerations in preprocessing involve space trade-offs, where pattern-specific structures often require additional space to store auxiliary data like shift tables, balancing preprocessing time against search efficiency without excessive overhead for the text. This linear space usage for the pattern enables optimizations in algorithms while keeping overall memory proportional to input sizes.[28]Performance Metrics and Complexity Analysis
Performance in string-searching algorithms is primarily evaluated through time and space complexities, which quantify the computational resources required as functions of the input sizes: the text length and the pattern length , where typically . Time complexity is divided into preprocessing (building auxiliary structures from the pattern) and searching (scanning the text for matches) phases. Preprocessing often requires time to analyze the pattern, while searching aims for efficiency across varying input distributions. In the worst case, searching can degrade to time, as seen in basic approaches where every possible alignment of the pattern against the text is exhaustively checked, leading to up to character comparisons.[29] For average-case analysis assuming random strings over an alphabet , searching time is often sublinear or linear, such as or better, due to early mismatches reducing unnecessary inspections.[30] Space complexity measures the additional memory beyond the input text and pattern. Auxiliary space is commonly for storing preprocessing tables or failure functions, resulting in total space . Some methods achieve constant auxiliary space , relying solely on the input, though this may limit preprocessing benefits. Practical implementations must balance this with memory hierarchies, where excessive auxiliary space can increase cache misses.[29] Key performance metrics beyond asymptotic bounds include the number of character comparisons (both text and pattern characters inspected) and the total number of text character inspections, which directly impact runtime on real hardware. These are often lower in average cases for random inputs, where mismatches occur after few comparisons, but adversarial inputs can force maximum inspections. Empirical benchmarks on test suites, such as the Canterbury Corpus or DNA sequences, reveal practical speeds, with algorithms favoring fewer inspections outperforming others by factors of 2-10 on large texts.[31][29] Practical factors significantly influence efficiency. The alphabet size affects skipping potential: larger alphabets enable more mismatches and shifts, reducing inspections, while small alphabets (e.g., binary or DNA) demand specialized handling to avoid quadratic degradation. Cache efficiency is crucial for large , as algorithms with sequential text access and localized pattern checks minimize cache misses, improving throughput by up to 20-50% on modern processors compared to jump-heavy methods.[32][31] Analysis models distinguish theoretical guarantees from real-world behavior. Worst-case analysis uses adversarial inputs to bound maximum resource use, ensuring reliability in hostile scenarios. Average-case models assume uniform random strings, yielding probabilistic bounds like expected inspections. Empirical evaluation via benchmarks on diverse datasets (e.g., English text, genomes) provides context-specific insights, often prioritizing metrics like inspections over pure time due to hardware variability.[30][31]Single-Pattern Algorithms
Single-pattern string-searching algorithms are designed to locate all occurrences of a single pattern string of length within a text string of length (where ). These algorithms focus on exact matching for one fixed pattern and employ a variety of strategies—including direct comparison, pattern preprocessing to compute auxiliary tables, heuristic skipping rules, and hashing—to optimize performance beyond the basic brute-force method. They offer different trade-offs: some prioritize simplicity with no preprocessing but risk quadratic time in the worst case, while others achieve linear or sublinear average-case performance through preprocessing or clever mismatch handling. The following subsections describe four classic examples: the naive string matching algorithm, the Knuth-Morris-Pratt (KMP) algorithm, the Boyer-Moore algorithm, and the Rabin-Karp algorithm.Naive String Matching
The naive string matching algorithm, also known as the brute-force approach, searches for occurrences of a pattern string of length within a text string of length (where ) by systematically checking each possible alignment of against substrings of . For each starting position ranging from 0 to , the algorithm compares characters of with the corresponding characters in sequentially from left to right, halting the comparison at the first mismatch and advancing to the next position only after a full match or complete mismatch check. This process reports all starting indices where a full match is found.[4] The algorithm's implementation uses two nested loops: an outer loop over possible starting positions , and an inner loop over pattern positions from 0 to . If all characters match for a given , the position is recorded as an occurrence.function NaiveMatch(T, P):
n = length(T)
m = length(P)
for i = 0 to n - m:
j = 0
while j < m and T[i + j] == P[j]:
j = j + 1
if j == m:
report i as a match
function NaiveMatch(T, P):
n = length(T)
m = length(P)
for i = 0 to n - m:
j = 0
while j < m and T[i + j] == P[j]:
j = j + 1
if j == m:
report i as a match
Knuth-Morris-Pratt Algorithm
The Knuth-Morris-Pratt (KMP) algorithm is a linear-time string-searching method for finding all occurrences of a pattern string of length within a text string of length , where . Developed independently by Donald Knuth, Vaughan Pratt, and James H. Morris, it improves upon naive string matching by leveraging internal structure in the pattern to skip unnecessary comparisons, ensuring that each position in the text is examined at most once.[4] This approach is particularly effective for exact matching in single-pattern scenarios, as it preprocesses the pattern to build a failure function that guides efficient shifts during mismatches.[4] The core of the KMP algorithm lies in its preprocessing step, which computes the failure function (also called the prefix table or partial match table) for the pattern . For each position in (where ), stores the length of the longest proper prefix of the substring that is also a suffix of .[4] This function captures overlapping redundancies in the pattern, enabling the algorithm to "fail" gracefully without backtracking in the text. The failure function is initialized with \pi{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0. For to , is determined by finding the largest such that ; if no such exists, then . The computation itself runs in time by iteratively extending candidate matches and falling back using previously computed values to avoid quadratic comparisons.[4] In the searching phase, a state variable tracks the length of the current match starting from position in . The algorithm advances while characters match (), incrementing both and . On a mismatch, it sets to shift the pattern forward by positions, effectively using the overlap information to resume matching without re-examining prior text characters.[4] A match is found whenever , at which point the occurrence starts at , and is reset to to continue searching for overlaps. The overall time complexity of KMP is for preprocessing plus for searching, yielding a total of , independent of the alphabet size.[4] It requires additional space to store the array. This linear bound holds because the searching phase amortizes shifts such that the total number of operations is bounded by .[4] To illustrate, consider the pattern "ababaca" (). The failure function is computed as . For the substring up to index 4 ("ababa"), \pi{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = 3 because "aba" (length 3) is the longest prefix that matches the suffix. Similarly, after "ababac" at index 5, a mismatch leads to \pi{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = 0 since "c" shares no prefix-suffix overlap with earlier parts. This array allows the algorithm, during searching, to skip to a state of 3 after matching "ababa" and mismatching on the next character, recognizing the overlapping "aba".[4]Boyer-Moore Algorithm
The Boyer-Moore algorithm is a heuristic-based method for efficiently searching a single pattern string of length within a text string of length , where . Developed by Robert S. Boyer and J Strother Moore, it compares characters from right to left, leveraging mismatches to skip large portions of the text, which makes it particularly effective for natural language texts or patterns over large alphabets.[5] Unlike left-to-right approaches, this rightward alignment allows the algorithm to exploit information from partial matches to compute maximal safe shifts, often achieving sublinear time performance in practice. Preprocessing in the Boyer-Moore algorithm constructs two tables to guide shifts during searching. The bad-character rule table, denoted for each character in the alphabet , stores the maximum shift distance when causes a mismatch; specifically, is if does not appear in , otherwise it is where is the rightmost position of in (0-based from the left).[5] The good-suffix rule table for positions to (where indicates a full match) computes shifts based on the longest suffix of that matches a prefix of ; more precisely, is the smallest shift that aligns the next occurrence of the matched suffix starting after position in , or if no such occurrence exists.[33] These tables are built in time, enabling rapid lookups during the search phase.[1] The searching phase aligns the right end of with position in (initially ) and compares characters from right to left, starting at against . If all characters match, a occurrence is reported at , and the search continues from . On a mismatch at position (where aligns with ), the pattern shifts right by , ensuring the shift is safe and maximal based on both rules.[34] This process repeats until . For illustration, consider searching (m=2) in : initial alignment at i=1 aligns over T[0:1]="he", matching fully and reporting position 0; next alignment at i=2 mismatches on T[35]='r' vs P[36]='e', with bc['r']=2 (absent in P), shifting to i=4 past the end.[5] The bad-character heuristic shifts the pattern to align the mismatched text character with its rightmost occurrence in , or beyond if absent, preventing redundant comparisons of that character.[33] The good-suffix heuristic, applied when the bad-character shift is insufficient, shifts to the next plausible position where the matched suffix from the end of can reoccur, considering border alignments in to avoid rechecking verified suffixes.[34] Together, these heuristics prioritize larger skips, making the algorithm excel in scenarios with distinct patterns or large alphabets like English text, where mismatches are informative.[37] In terms of performance, the Boyer-Moore algorithm has a worst-case time complexity of due to potential exhaustive backtracking in adversarial inputs, though the good-suffix rule bounds comparisons to in some refined versions.[5] Its average-case time is , as shifts often exceed 1, inspecting far fewer than characters—empirically, it inspects about characters for m=5 in English prose.[37] Space usage is for the tables, independent of .[1] A notable variant is the Boyer-Moore-Horspool algorithm, proposed by Nigel Horspool, which simplifies the original by using only the bad-character rule but applying it to the text character aligned with (the position before the end) on mismatch, reducing preprocessing complexity while retaining much of the efficiency for practical texts.[38] This tradeoff makes it easier to implement and nearly as fast in average cases, with the same worst-case time.[39]Rabin-Karp Algorithm
The Rabin-Karp algorithm is a probabilistic string-searching method that uses hashing to identify potential matches of a pattern string of length within a text string of length , where typically . Introduced by Richard M. Karp and Michael O. Rabin in 1987, it computes hash values for and sliding windows of to filter candidates quickly, followed by exact character-by-character verification to resolve any hash collisions. This approach achieves linear-time performance in the average case by leveraging a rolling hash mechanism that avoids recomputing hashes from scratch for each substring.[40] During preprocessing, the algorithm computes the hash value of the pattern using a polynomial rolling hash function: h_P = \left( P{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} \cdot b^{m-1} + P{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} \cdot b^{m-2} + \cdots + P[m-1] \cdot b^0 \right) \mod q Here, is a base (often a prime larger than the alphabet size, such as 256 for ASCII), and is a large prime modulus (e.g., a 64-bit prime) chosen to reduce collision probability. This step takes time. The algorithm also precomputes for efficient updates. For the text, initial hashes for substrings of length are computed similarly, but subsequent hashes use a rolling update to slide the window by one position: This update removes the leading character, shifts the remaining hash left by one base position, and appends the new trailing character, all in time per step. If a substring hash matches , the algorithm performs a direct comparison of the characters to confirm the match, ensuring correctness despite possible collisions. The time complexity is in the average and expected case, as hash computations and comparisons dominate but occur linearly, with verification happening rarely due to low collision rates. In the worst case, it degrades to if many false positives trigger full verifications (e.g., adversarial inputs causing frequent collisions). Space usage is beyond the input strings. To further mitigate collisions, double hashing—computing two independent hashes with different bases and moduli and requiring both to match—can reduce the probability to negligible levels (e.g., below for 64-bit moduli). The collision probability for a single pair is at most , and for comparisons, it is bounded by , which is made arbitrarily small by selecting sufficiently large . ExampleConsider searching for pattern () in text () using base and modulus . Precompute . For the first substring , , matching , so verify characters (true match at position 0). Rolling to (no match). Continue similarly for remaining positions. This illustrates efficient sliding with verification only on hash equality.
Multi-Pattern Algorithms
Aho-Corasick Algorithm
The Aho-Corasick algorithm is a dictionary-matching technique designed to find all occurrences of multiple patterns within a given text string efficiently. Invented by Alfred V. Aho and Margaret J. Corasick in 1975, it constructs a finite automaton from a set of patterns and processes the text in a single pass, making it particularly suitable for applications like bibliographic search, intrusion detection, plagiarism detection, and bioinformatics where multiple keywords must be located simultaneously.[12][41][42] The construction begins by building a trie (prefix tree) from the set of patterns. Each pattern is inserted into the trie, where nodes represent prefixes of the patterns, and edges are labeled with characters from a fixed alphabet. The root node corresponds to the empty prefix, and terminal nodes mark the ends of complete patterns, often storing the matched pattern's index in an output list. This trie structure allows for O(1) transitions per character during matching, assuming a constant-sized alphabet. The total time to build the trie is linear in the sum of the patterns' lengths, denoted as m.[43][44] Preprocessing extends the trie into a full automaton by adding failure links and output links using a breadth-first search (BFS) traversal starting from the root. For each node representing a string α, the failure link points to the node representing the longest proper suffix of α that is also a prefix of some pattern in the set; if no such suffix exists, it points to the root. These links are computed level by level: for a node at depth d, the failure link is found by following the parent's failure link and matching the corresponding character, ensuring the computation visits each node and edge a constant number of times for O(m) total time. Output links, or the output function, chain together matches: at a node, the outputs include not only any patterns ending there but also those from the node reached by following failure links until a pattern-ending node is found. This merging allows reporting all overlapping or embedded matches efficiently during search.[43][41][44] The searching phase processes the text of length n by starting at the root and advancing through the automaton for each character. On a successful goto transition, the current state updates accordingly. On a mismatch, the algorithm follows the failure link repeatedly until a valid transition is found or the root is reached, effectively skipping irrelevant prefixes. Whenever the current state (or its output chain) contains pattern endings, all associated patterns are reported as matches, along with their starting positions in the text, which can be tracked via depth or counters. This process handles overlaps naturally, as failure links propagate matches from suffixes. The total search time is O(n + z), where z is the number of matches (or output size), since each text character triggers at most a constant number of transitions amortized over the failure chains, and reporting takes time proportional to z. The space complexity is O(m), dominated by the trie and link structures.[43][41][44] For example, consider patterns {"he", "she", "his", "hers"} and text "ushers". The trie has branches for "sh" → "e", "h" → "e" and "h" → "i" → "s", and "he" → "r" → "s" for "hers". Failure links connect "sh" to "h" (since "h" is a prefix), "she" fails to "he", and so on. Processing "ushers" traverses: u (root, no transition), s (to "s" of "she"), h (to "sh"), e (to "she", output "she" and via output link "he"), r (from "she" follow failure to "he", then to "her"), s (to "hers", output "hers"). This demonstrates overlap detection for "she" and "he", and the full match of "hers" without backtracking.[43]Suffix Tree and Suffix Array Methods
Suffix trees and suffix arrays are powerful indexing structures that preprocess a static text string of length to enable efficient single-pattern or multi-pattern searches, as well as other string processing tasks. These methods focus on organizing all suffixes of to allow rapid prefix matching, which directly supports pattern queries by identifying suffixes that begin with the search pattern of length . Unlike pattern-preprocessing approaches, suffix-based methods build an index on the text once, supporting arbitrary queries thereafter with minimal additional overhead. This makes them particularly suitable for large, fixed corpora such as genomic sequences or document collections, where preprocessing time is amortized over multiple searches.[15][45] A suffix tree is a compressed trie representing all suffixes of , where each leaf corresponds to the starting position of a unique suffix, and internal nodes represent the longest common prefixes shared among groups of suffixes. Edges are labeled with substrings of , ensuring no node has more than one outgoing edge per character, and the tree is typically terminated with a unique end symbol to distinguish suffixes. This structure implicitly encodes the suffix relationships through shared paths from the root, allowing compact representation in linear space relative to . The seminal linear-time construction algorithm, known as Ukkonen's algorithm, builds the suffix tree online by incrementally adding suffixes while maintaining active points and using suffix links to avoid redundant traversals, achieving time and space complexity over a constant-sized alphabet.[15][46] To search for a pattern in a suffix tree, traverse from the root following the edge labels that match ; if the full path is found, the subtree rooted at the locus node (the endpoint of the matching path) contains leaves corresponding to all occurrences of in , which can be enumerated in time where is the number of matches. Pattern counting is straightforward by computing the size of this subtree, often precomputed during construction for queries per node. For more advanced queries like longest common extensions between suffixes, the lowest common ancestor of two leaf nodes in the tree gives the length of their shared prefix, enabling time with preprocessing. These operations highlight the suffix tree's versatility for exact matching in static texts.[15][47] A suffix array is an alternative, space-efficient representation: a sorted array of integers indexing the starting positions of all suffixes of , effectively permuting the suffixes in lexicographic order without the full tree structure. To enhance search efficiency, it is often paired with the longest common prefix (LCP) array, which stores the length of the longest common prefix between every pair of consecutive suffixes in the sorted order. The LCP array can be constructed in time from the suffix array using a backward scan that compares characters while leveraging the inverse suffix array to track positions. Suffix arrays require space, similar to suffix trees, but avoid the overhead of explicit tree nodes, making them preferable in memory-constrained environments. Linear-time construction algorithms, such as the Kärkkäinen-Sanders method, achieve preprocessing by recursively sorting suffixes based on sampled ranks and merging with doubling.[45][48][49] Searching in a suffix array typically uses binary search to find the range of suffixes prefixed by , comparing against selected suffixes in time per query. The LCP array accelerates this by skipping unnecessary comparisons: during binary search, the LCP between the current midpoint suffix and the previous or next provides a lower bound on mismatches, allowing the search to "skip" segments of the array in amortized time overall. For pattern counting, the size of the matching range in the suffix array directly gives the number of occurrences, retrievable in time. This enhancement makes suffix arrays competitive with suffix trees for search while being simpler to implement.[45][48][50] Both structures support key applications beyond basic searching, such as finding the longest repeated substring in , which corresponds to the deepest internal node in the suffix tree with at least two descendant leaves (depth giving the length) or the maximum LCP value greater than zero in the suffix array. This can be computed in time post-construction, with applications in data compression and bioinformatics for identifying tandem repeats. Pattern counting extends to multiple patterns by querying each independently, enabling efficient indexing for static text corpora. Overall, suffix trees and arrays provide preprocessing, query times, and space, balancing efficiency and practicality for large-scale string analysis.[15][45][48][51]Algorithm Classifications
By Number of Patterns
String-searching algorithms are classified based on the number of patterns they handle, which influences their preprocessing, efficiency, and applicability. Single-pattern algorithms search for one fixed pattern of length in a text of length , with many achieving linear worst-case time complexity through preprocessing that avoids redundant comparisons, while others like Boyer-Moore offer sublinear average-case performance but worst case.[4] Examples include the Knuth-Morris-Pratt (KMP) algorithm, which uses a prefix table to skip mismatches efficiently, and the Boyer-Moore algorithm, which leverages heuristics like bad character and good suffix rules for sublinear average-case performance.[13] These methods require minimal preprocessing proportional only to , making them suitable for scenarios with a solitary query pattern. For a finite set of patterns with total length , multi-pattern algorithms preprocess the dictionary to enable efficient simultaneous searches, often in time , where is the number of matches reported.[12] The Aho-Corasick algorithm constructs a trie with failure links to simulate multiple finite automata, allowing linear-time scanning of the text after building the structure.[43] Similarly, the Commentz-Walter algorithm extends Boyer-Moore heuristics to multiple patterns by combining a pattern trie with shift tables, balancing preprocessing cost with search speed for moderate .[52] Preprocessing scales with , which can become prohibitive for very large dictionaries, but these algorithms excel in applications like spell-checkers or intrusion detection systems where a fixed vocabulary is repeatedly queried against varying texts. Algorithms handling an infinite set of patterns, typically specified by regular expressions, convert the expression into an automaton for matching, such as an NFA via Thompson's construction, followed by simulation on the text. This approach supports expressive queries like wildcards or alternations, but efficiency varies: NFA simulation runs in time, while backtracking implementations can degrade to exponential in worst cases due to ambiguity in the pattern language.[53] Trade-offs highlight the progression: single-pattern methods offer simplicity and low overhead but lack flexibility; finite multi-pattern techniques scale well for bounded sets yet require dictionary construction; infinite-pattern methods provide maximum expressiveness for tools like grep but incur higher computational costs, especially for complex expressions.[54]By Preprocessing Requirements
String-searching algorithms are classified according to their preprocessing requirements, which specify whether and how the pattern (of length ), the text (of length ), or both are prepared before the search phase begins. This categorization emphasizes trade-offs in initial computation time, auxiliary space, and search efficiency, influencing suitability for online scenarios—where the text arrives incrementally without full prior access—and offline scenarios, where the entire text is available upfront.[55] Algorithms with no preprocessing, such as the naive string-matching method, incur setup time, shifting the full computational cost to the search phase, which can reach in the worst case. The naive approach compares the pattern directly against every substring of the text starting at each position, without any auxiliary data structures. In contrast, pattern-preprocessing-only algorithms, exemplified by the Knuth-Morris-Pratt (KMP) and Boyer-Moore methods, require preprocessing time on the pattern to construct tables that enable overall time complexity. The KMP algorithm builds a failure function (or prefix table) that identifies the longest proper prefix matching a suffix of the pattern, allowing skips over mismatched portions during search.[4] The Boyer-Moore algorithm preprocesses shift tables based on bad-character and good-suffix heuristics, often skipping multiple characters per mismatch for sublinear average-case performance.[5] The Rabin-Karp algorithm also falls here, with preprocessing to compute a hash value for the pattern, enabling efficient rolling hash comparisons during the search phase.[56] Text-preprocessing-only approaches, such as those using suffix trees or suffix arrays, demand build time on the text to support rapid pattern queries, typically in time, but assume a static text since modifications require rebuilding the structure. Suffix trees compress all suffixes into a trie-like form, facilitating longest common prefix queries essential for matching; the linear-time construction was achieved by Ukkonen's online algorithm, which incrementally adds suffixes while maintaining the tree. Suffix arrays, a space-efficient alternative, sort the suffixes implicitly and pair with LCP arrays for similar query speeds.[46] Algorithms preprocessing both the pattern and text, like the Aho-Corasick and Wu-Manber methods, are designed for multi-pattern or hybrid dynamic environments, combining pattern preparation with text partitioning or indexing for overall linear-time performance. The Aho-Corasick algorithm constructs a trie of all patterns augmented with failure links akin to KMP, enabling simultaneous searches across multiple patterns in time, where is the number of matches. Wu-Manber extends Boyer-Moore ideas by preprocessing hash tables for pattern blocks and text shifts, partitioning the text into fixed-size windows to handle multiple patterns efficiently in large corpora.[43][57] Key considerations in this classification involve balancing time and space: preprocessing trades initial overhead for faster searches, often requiring or more space, which suits offline applications but challenges online ones where text preprocessing must be minimal or incremental to accommodate streaming data.By Matching Strategies
String-searching algorithms can be classified by their core matching strategies, which determine how comparisons between the pattern and text are performed during the search phase. These strategies include forward matching, which processes the text sequentially from left to right; backward matching, which starts from the right and leverages suffix information for skips; hybrid approaches that combine techniques like hashing with verification; and automaton-based methods that model the search as state transitions in a finite automaton. This classification highlights differences in efficiency, particularly in how each strategy handles mismatches and exploits textual properties. Forward matching algorithms scan the text from left to right, comparing the pattern starting at each position sequentially or with optimizations to avoid redundant work. The naive string-matching algorithm exemplifies this by directly comparing characters until a mismatch or full match occurs, resulting in O((n-m+1)m) worst-case time complexity, where n is the text length and m is the pattern length. The Knuth-Morris-Pratt (KMP) algorithm improves upon this by preprocessing the pattern to build a prefix table (or failure function) that indicates the longest proper prefix which is also a suffix, allowing the search to shift the pattern by more than one position on mismatches without re-examining previously matched text characters. This achieves O(n + m) time in both average and worst cases, making it suitable for scenarios requiring linear-time guarantees.[4] Forward strategies are conceptually simple and well-suited for streaming applications where text arrives incrementally, as they process data in a single pass without needing to revisit earlier portions. Backward matching algorithms reverse the comparison direction, aligning the pattern's end with potential text positions and scanning right-to-left to identify mismatches quickly, often enabling larger skips. The Boyer-Moore algorithm is a seminal example, using two heuristics: the bad-character rule, which skips based on the rightmost occurrence of a mismatching text character in the pattern, and the good-suffix rule, which shifts using information about matched suffixes. This approach yields average-case O(n/m) time for large patterns and alphabets, though worst-case remains O(nm).[5] Backward strategies excel in practice on natural language texts like English, where large alphabets and low character frequencies allow frequent long skips, outperforming forward methods by factors of 2 to 10 in benchmarks on such data.[58] Hybrid strategies integrate multiple techniques to balance preprocessing, comparison speed, and verification. The Rabin-Karp algorithm hashes the pattern and rolling hashes of text windows forward, comparing hashes first to filter candidates before exact character verification, achieving average O(n + m) time but O(nm) worst-case due to hash collisions. Bit-parallel algorithms like Shift-Or enhance forward matching by simulating multiple comparisons using bitwise operations on word-sized registers, effectively processing up to 64 (or more) characters per step on modern hardware for small patterns, with time where is word size. These hybrids adapt to hardware and data characteristics, such as SIMD instructions for acceleration. Automaton-based strategies model the search as transitions in a finite-state machine, often deterministic for efficiency. The Aho-Corasick algorithm constructs a trie of patterns with failure links resembling a deterministic finite automaton (DFA), allowing simultaneous matching of multiple patterns by traversing states on each text character and reporting outputs at accepting states, in O(n + z) time where z is the number of matches. While the core automaton is deterministic, implementations may simulate non-deterministic finite automata (NFAs) for space savings in multi-pattern scenarios. This approach is powerful for dictionary-based searches but requires more preprocessing than direct comparison methods.[12] The choice of matching strategy impacts practical performance: backward methods like Boyer-Moore often dominate on random or natural-language texts due to skip heuristics, while forward and automaton-based ones provide robustness for worst-case or multi-pattern needs, and hybrids offer versatility across applications.[58]Advanced and Specialized Variants
Real-Time String Matching
Real-time string matching, also known as online string matching, operates under the model where the text arrives incrementally from left to right, and the algorithm must process each character as it becomes available, reporting any matches immediately without lookahead into future characters. This contrasts with offline algorithms that require the entire text for preprocessing, as online variants maintain a finite state machine or failure function to track partial matches in constant space. The primary goal is to achieve amortized O(1) time per input character while using O(m) space for a pattern of length m, ensuring suitability for streaming data where the text length n can be arbitrarily large and unbounded.[59] For single-pattern matching, the Knuth-Morris-Pratt (KMP) algorithm serves as a foundational adaptation for real-time scenarios, processing the stream via its prefix table (failure function) to skip redundant comparisons and maintain the current matching state after each character. The Morris-Pratt variant similarly employs a preprocessing step to compute shift values, enabling efficient online advancement through the text without backtracking beyond the current position. These adaptations ensure linear total time O(n + m) for the entire stream, with no global data structures beyond the pattern itself. In multi-pattern settings, the Aho-Corasick automaton extends to streaming by constructing a trie of all patterns upfront and traversing it character-by-character on the input stream, using output links to report overlapping matches lazily as they complete. For bounded-buffer environments like network processing, the Wu-Manber algorithm adapts by grouping patterns into blocks and using hash tables for quick shifts, allowing real-time inspection without full text buffering. These methods maintain O(n total time across patterns of total length z, with preprocessing O(z |Σ|) for Aho-Corasick and O(z) for Wu-Manber, that remains fixed during streaming.[12] Key challenges in real-time string matching include enforcing constant space to handle infinite streams and achieving O(1) amortized processing per character to meet throughput demands, particularly in resource-constrained systems. Unlike offline approaches with global preprocessing, online variants forgo extensive indexing to avoid delays, prioritizing immediate responsiveness. A prominent application is network packet inspection, where algorithms like Aho-Corasick detect intrusion signatures in arriving payloads for real-time threat mitigation. Similarly, in real-time log analysis, these techniques scan streaming logs for keywords such as error indicators, enabling instant alerting without batching.[59]Approximate and Fuzzy Matching
Approximate string matching, also known as fuzzy string matching, addresses the challenge of finding occurrences of a pattern string of length in a text string of length , where the match allows for a limited number of errors such as substitutions, insertions, or deletions, typically measured by the Levenshtein edit distance for some small integer .[60] This problem extends exact string matching by tolerating imperfections, which is essential when dealing with noisy data, typographical errors, or biological variations.[61] The foundational approach to computing the Levenshtein distance between two strings relies on dynamic programming, as formalized in the Wagner-Fischer algorithm, which constructs a matrix to calculate the minimum number of edit operations required to transform one string into the other. For approximate pattern matching, this DP is applied to each potential alignment position in the text, resulting in a time complexity of to find all occurrences with distance at most , though early termination can prune computations when distances exceed .[60] Space usage can be optimized to by retaining only two rows of the matrix at a time. To improve efficiency when is small compared to , banded dynamic programming restricts computations to a diagonal band of width proportional to around the main diagonal of the DP matrix, reducing the time complexity to approximately , or in practice for fixed and small .[60] This technique is particularly effective for long texts, as it avoids unnecessary calculations far from potential alignments. Heuristic methods further accelerate approximate matching for specific error models. For instance, Myers' bit-parallel algorithm simulates the DP using bitwise operations on word-sized integers, enabling efficient computation of Levenshtein distances in time in the worst case, where is the machine word size (typically 64 bits), and performs better for small by leveraging bit vectors to track multiple states simultaneously.[62] In bioinformatics, seed-and-extend heuristics like those in the BLAST algorithm first identify short exact matches (seeds) between the pattern and text, then extend these using local alignment with scoring penalties for gaps and mismatches, approximating global edit distances while achieving sublinear time in practice for large .[63] These methods find wide application in spell-checking, where dictionaries are searched for words within edit distance 1 or 2 of a query to suggest corrections, and in DNA sequence alignment, where evolutionary mutations introduce errors that must be tolerated to identify homologous regions.[60] For example, the Levenshtein distance between "kitten" and "sitting" is 3, achieved by substituting 'k' with 's', 'e' with 'i', and inserting 'g' at the end:kitten → sitten (substitute k→s)
sitten → sittin (substitute e→i)
sittin → sitting (insert g)
kitten → sitten (substitute k→s)
sitten → sittin (substitute e→i)
sittin → sitting (insert g)
Searching with Wildcards and Don't Cares
In string searching with wildcards and don't cares, the pattern contains special symbols that provide flexibility in matching: a wildcard (often denoted as '?') matches any single character from the alphabet Σ, while a don't care symbol indicates a position in the pattern that is ignored during comparison, effectively allowing skips in the alignment without penalizing the match. This variant of exact string matching is useful in scenarios requiring partial or flexible pattern recognition, such as querying databases with variable fields or scanning files with glob patterns, but it differs from approximate matching by enforcing exact correspondence except at wildcard or don't care positions. The problem is formalized as finding all occurrences of a pattern P of length m in a text T of length n, where positions in P marked with wildcards or don't cares do not impose strict equality constraints.[64] Automaton-based algorithms extend classical methods like the Knuth-Morris-Pratt (KMP) or Aho-Corasick automata to handle these symbols. In the KMP extension for wildcards, the failure function is modified to treat a '?' as a transition that accepts any input character, effectively creating a self-loop in the automaton state for that position. For multi-pattern wildcard matching, the Aho-Corasick trie incorporates wildcard nodes with transitions to all alphabet symbols or a universal state, while don't cares are handled by bypassing comparison at those positions during state advancement. Preprocessing involves building this extended automaton in O(m |Σ|) time for single patterns or O(total pattern length · |Σ|) for multiple, enabling linear-time scanning of the text thereafter. Bit-parallel techniques, such as variants of the Shift-And algorithm, optimize this for small alphabets by using bitvectors to represent wildcard positions: masks are precomputed where bits corresponding to '?' are set to accept any shift, allowing simultaneous simulation of multiple states via bitwise operations.[65] The time complexity for naive implementations with k wildcards or don't cares is O(n |Σ|^k) due to branching at flexible positions, but bit-parallel and automaton optimizations reduce it to O(n + occ) worst-case, where occ is the number of matches, assuming word size at least |Σ|. For example, the pattern P = "a?c?" matches the substring "abcx" in text T because the first '?' aligns with 'b' (any character) and the second with 'x' (any), while 'a' and 'c' match exactly; don't cares would skip comparison at those indices entirely. These methods find applications in file system globbing (e.g., matching "*.txt" via wildcards for any prefix) and lightweight regular expression engines, where full regex complexity is unnecessary. Recent advancements in compressed data structures, such as wavelet trees adapted for wildcard queries, achieve space usage of O(n log |Σ| / log n) bits while supporting O(|P| log n + occ) query time, addressing storage challenges in large repetitive texts like genomic data.[66][67]References
- Searching for occurrences of a substring in a text is a common operation familiar to anyone who uses a text editor, word processor, or web browser.
- Dec 21, 2017 · This website is a great resource for exact string searching algorithms. High-performance pattern matching in Java for general string searching.
