Two-way string-matching algorithm

Class	String-searching algorithm
Data structure	Any string with an ordered alphabet
Worst-case performance	O(n)
Best-case performance	O(n)
Worst-case space complexity	⌈log₂ m⌉

In computer science, the two-way string-matching algorithm is a string-searching algorithm, discovered by Maxime Crochemore and Dominique Perrin in 1991.^[1] It takes a pattern of size m, called a “needle”, preprocesses it in linear time O(m), producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O(n) with n being the haystack's length.

The two-way algorithm can be viewed as a combination of the forward-going Knuth–Morris–Pratt algorithm (KMP) and the backward-running Boyer–Moore string-search algorithm (BM). Like those two, the 2-way algorithm preprocesses the pattern to find partially repeating periods and computes “shifts” based on them, indicating what offset to “jump” to in the haystack when a given character is encountered.

Unlike BM and KMP, it uses only O(log m) additional space to store information about those partial repeats: the search pattern is split into two parts (its critical factorization), represented only by the position of that split. Being a number less than m, it can be represented in ⌈log₂ m⌉ bits. This is sometimes treated as "close enough to O(1) in practice", as the needle's size is limited by the size of addressable memory; the overhead is a number that can be stored in a single register, and treating it as O(1) is like treating the size of a loop counter as O(1) rather than log of the number of iterations. The actual matching operation performs at most 2n − m comparisons.^[2]

Breslauer later published two improved variants performing fewer comparisons, at the cost of storing additional data about the preprocessed needle:^[3]

The first one performs at most n + ⌊(n − m)/2⌋ comparisons, ⌈(n − m)/2⌉ fewer than the original. It must however store ⌈log_$\varphi$ m⌉ additional offsets in the needle, using O(log² m) space.
The second adapts it to only store a constant number of such offsets, denoted c, but must perform n + ⌊(1⁄2 + ε) * (n − m)⌋ comparisons, with ε = 1⁄2(F_c+2 − 1)⁻¹ = O( $\varphi$ ^−c) going to zero exponentially quickly as c increases.

The algorithm is considered fairly efficient in practice, being cache-friendly and using several operations that can be implemented in well-optimized subroutines. It is used by the C standard libraries glibc, newlib, and musl, to implement the memmem and strstr family of substring functions.^[4]^[5]^[6] As with most advanced string-search algorithms, the naïve implementation may be more efficient on small-enough instances;^[7] this is especially so if the needle isn't searched in multiple haystacks, which would amortize the preprocessing cost.

Critical factorization

Before we define critical factorization, we should define:^[1]

A factorization is a partition ⁠ $(u,v)$ ⁠ of a string $x$ . For example, ("Wiki","pedia") is a factorization of "Wikipedia".
A period of a string $x$ is an integer $p$ such that all characters $p$ -distance apart are equal. More precisely, $x [i] = x [i + p]$ holds for any integer $0 < i \leq len(x) - p$ . This definition is allowed to be vacuously true, so that any word of length $n$ has a period of $n$ . To illustrate, the 8-letter word "educated" has period 6 in addition to the trivial periods of 8 and above. The minimum period of $x$ is denoted as ⁠ $p(x)$ ⁠.
A repetition w in ⁠ $(u,v)$ ⁠ is a non-empty string such that:
- $w$ is a suffix of $u$ or $u$ is a suffix of $w$ ;
- $w$ is a prefix of $v$ or $v$ is a prefix of $w$ ;
In other words, $w$ occurs on both sides of the cut with a possible overflow on either side. Examples include "an" for ("ban","ana") and "voca" for ("a","vocado"). Each factorization trivially has at least one repetition: the string $vu$ .^[2]
A local period is the length of a repetition in ⁠ $(u,v)$ ⁠. The smallest local period in ⁠ $(u,v)$ ⁠ is denoted as ⁠ $r(u,v)$ ⁠. Because the trivial repetition $vu$ is guaranteed to exist and has the same length as $x$ , we see that ⁠ $1\leq r(u,v)\leq \mathrm {len} (x)$ ⁠.

Finally, a critical factorization is a factorization ⁠ $(u,v)$ ⁠ of $x$ such that ⁠ $r(u,v)=p(x)$ ⁠. The existence of a critical factorization is provably guaranteed.^[1] For a needle of length $m$ in an ordered alphabet, it can be computed in $2 m$ comparisons, by computing the lexicographically larger of two ordered maximal suffixes, defined for order ≤ and ≥.^[6]

The algorithm

The algorithm starts by computing a critical factorization of the needle n as the preprocessing step. This step produces the index (starting point) of the periodic right-half, and the period of this stretch. The suffix computation here follows the authors' formulation. It can alternatively be computed using the Duval's algorithm, which is simpler and still linear time but slower in practice.^[8]

Shorthand for inversion.
function cmp(a, b)
    if a > b return 1
    if a = b return 0
    if a < b return -1

function maxsuf(n, rev)
    length ← len(n)
    cur_period ← 1       currently known period.
    period_test_idx ← 1  index for period testing, 0 < period_test_idx <= cur_period.
    maxsuf_test_idx ← 0  index for maxsuf testing. greater than maxs.
    maxsuf_idx ← -1      the proposed starting index of maxsuf

    while maxsuf_test_idx + period_test_idx < length
        cmp_val ← cmp(
              n[maxsuf_test_idx + period_test_idx],
              n[maxsuf_idx      + period_test_idx]
        )
        if rev
            cmp_val *= -1
        if cmp_val < 0
            Suffix (maxsuf_test_idx + period_test_idx) is smaller. Period is the entire prefix so far.
            maxsuf_test_idx += period_test_idx
            period_test_idx ← 1
            cur_period ← maxsuf_test_idx - maxsuf_idx
        else if cmp_val == 0
            They are the same - we should go on.
            if period_test_idx == cur_period
                We are done checking this stretch of cur_period. reset period_test_idx.
                maxsuf_test_idx += cur_period
                period_test_idx ← 1
            else
                period_test_idx += 1
        else
            Suffix is larger. Start over from here.
            maxsuf_idx ← maxsuf_test_idx
            maxsuf_test_idx += 1
            cur_period ← 1
            period_test_idx ← 1
   return [maxsuf_idx, cur_period]

function crit_fact(n)
    [idx1, per1] ← maxsuf(n, false)
    [idx2, per2] ← maxsuf(n, true)
    if idx1 > idx2
        return [idx1, per1]
    else
        return [idx2, per2]

The comparison proceeds by first matching for the right-hand-side, and then for the left-hand-side if it matches. Linear-time skipping is done using the period.

function match(needle, haystack)
    needle_len   ← len(needle)
    haystack_len ← len(haystack)
    [length, cur_period] ← crit_fact(needle)
    Matches ← {}                             set of matches.

    Match the suffix.
    Use a library function like memcmp, or write your own loop.
    if needle[0] ... needle[length] == needle[length + 1] ... needle[length + cur_period]
        Matches ← {}
        pos ← 0
        s ← 0

    TODO. At least put the skip in.

References

^ ^a ^b ^c Crochemore, Maxime; Perrin, Dominique (1 July 1991). "Two-way string-matching" (PDF). Journal of the ACM. 38 (3): 650–674. doi:10.1145/116825.116845. S2CID 15055316.
^ ^a ^b "Two Way algorithm".
^ Breslauer, Dany (May 1996). "Saving comparisons in the Crochemore-Perrin string-matching algorithm". Theoretical Computer Science. 158 (1–2): 177–192. doi:10.1016/0304-3975(95)00068-2.
^ "musl/src/string/memmem.c". Retrieved 23 November 2019.
^ "newlib/libc/string/memmem.c". Retrieved 23 November 2019.
^ ^a ^b "glibc/string/str-two-way.h".
^ "Eric Blake - Re: PATCH] Improve performance of memmem". Newlib mailing list.
^ Adamczyk, Zbigniew; Rytter, Wojciech (May 2013). "A note on a simple computation of the maximal suffix of a string". Journal of Discrete Algorithms. 20: 61–64. doi:10.1016/j.jda.2013.03.002.

[CP91-1] Crochemore, Maxime; Perrin, Dominique (1 July 1991). "Two-way string-matching" (PDF). Journal of the ACM. 38 (3): 650–674. doi:10.1145/116825.116845. S2CID 15055316.

[igm-mlv-2] "Two Way algorithm".

[3] Breslauer, Dany (May 1996). "Saving comparisons in the Crochemore-Perrin string-matching algorithm". Theoretical Computer Science. 158 (1–2): 177–192. doi:10.1016/0304-3975(95)00068-2.

[4] "musl/src/string/memmem.c". Retrieved 23 November 2019.

[5] "newlib/libc/string/memmem.c". Retrieved 23 November 2019.

[str-two-way-6] "glibc/string/str-two-way.h".

[7] "Eric Blake - Re: PATCH] Improve performance of memmem". Newlib mailing list.

[8] Adamczyk, Zbigniew; Rytter, Wojciech (May 2013). "A note on a simple computation of the maximal suffix of a string". Journal of Discrete Algorithms. 20: 61–64. doi:10.1016/j.jda.2013.03.002.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Substring index Suffix array Suffix automaton Suffix tree Compressed suffix array LCP array FM-index Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations

History

Two-way string-matching algorithm

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Two-way string-matching algorithm

Critical factorization

The algorithm

References