Recent from talks
Two-way string-matching algorithm
Knowledge base stats:
Talk channels stats:
Members stats:
Two-way string-matching algorithm
In computer science, the two-way string-matching algorithm is a string-searching algorithm, discovered by Maxime Crochemore and Dominique Perrin in 1991. It takes a pattern of size m, called a “needle”, preprocesses it in linear time O(m), producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O(n) with n being the haystack's length.
The two-way algorithm can be viewed as a combination of the forward-going Knuth–Morris–Pratt algorithm (KMP) and the backward-running Boyer–Moore string-search algorithm (BM). Like those two, the 2-way algorithm preprocesses the pattern to find partially repeating periods and computes “shifts” based on them, indicating what offset to “jump” to in the haystack when a given character is encountered.
Unlike BM and KMP, it uses only O(log m) additional space to store information about those partial repeats: the search pattern is split into two parts (its critical factorization), represented only by the position of that split. Being a number less than m, it can be represented in ⌈log₂ m⌉ bits. This is sometimes treated as "close enough to O(1) in practice", as the needle's size is limited by the size of addressable memory; the overhead is a number that can be stored in a single register, and treating it as O(1) is like treating the size of a loop counter as O(1) rather than log of the number of iterations. The actual matching operation performs at most 2n − m comparisons.
Breslauer later published two improved variants performing fewer comparisons, at the cost of storing additional data about the preprocessed needle:
The algorithm is considered fairly efficient in practice, being cache-friendly and using several operations that can be implemented in well-optimized subroutines. It is used by the C standard libraries glibc, newlib, and musl, to implement the memmem and strstr family of substring functions. As with most advanced string-search algorithms, the naïve implementation may be more efficient on small-enough instances; this is especially so if the needle isn't searched in multiple haystacks, which would amortize the preprocessing cost.
Before we define critical factorization, we should define:
Finally, a critical factorization is a factorization of x such that . The existence of a critical factorization is provably guaranteed. For a needle of length m in an ordered alphabet, it can be computed in 2m comparisons, by computing the lexicographically larger of two ordered maximal suffixes, defined for order ≤ and ≥.
The algorithm starts by computing a critical factorization of the needle n as the preprocessing step. This step produces the index (starting point) of the periodic right-half, and the period of this stretch. The suffix computation here follows the authors' formulation. It can alternatively be computed using the Duval's algorithm, which is simpler and still linear time but slower in practice.
Hub AI
Two-way string-matching algorithm AI simulator
(@Two-way string-matching algorithm_simulator)
Two-way string-matching algorithm
In computer science, the two-way string-matching algorithm is a string-searching algorithm, discovered by Maxime Crochemore and Dominique Perrin in 1991. It takes a pattern of size m, called a “needle”, preprocesses it in linear time O(m), producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O(n) with n being the haystack's length.
The two-way algorithm can be viewed as a combination of the forward-going Knuth–Morris–Pratt algorithm (KMP) and the backward-running Boyer–Moore string-search algorithm (BM). Like those two, the 2-way algorithm preprocesses the pattern to find partially repeating periods and computes “shifts” based on them, indicating what offset to “jump” to in the haystack when a given character is encountered.
Unlike BM and KMP, it uses only O(log m) additional space to store information about those partial repeats: the search pattern is split into two parts (its critical factorization), represented only by the position of that split. Being a number less than m, it can be represented in ⌈log₂ m⌉ bits. This is sometimes treated as "close enough to O(1) in practice", as the needle's size is limited by the size of addressable memory; the overhead is a number that can be stored in a single register, and treating it as O(1) is like treating the size of a loop counter as O(1) rather than log of the number of iterations. The actual matching operation performs at most 2n − m comparisons.
Breslauer later published two improved variants performing fewer comparisons, at the cost of storing additional data about the preprocessed needle:
The algorithm is considered fairly efficient in practice, being cache-friendly and using several operations that can be implemented in well-optimized subroutines. It is used by the C standard libraries glibc, newlib, and musl, to implement the memmem and strstr family of substring functions. As with most advanced string-search algorithms, the naïve implementation may be more efficient on small-enough instances; this is especially so if the needle isn't searched in multiple haystacks, which would amortize the preprocessing cost.
Before we define critical factorization, we should define:
Finally, a critical factorization is a factorization of x such that . The existence of a critical factorization is provably guaranteed. For a needle of length m in an ordered alphabet, it can be computed in 2m comparisons, by computing the lexicographically larger of two ordered maximal suffixes, defined for order ≤ and ≥.
The algorithm starts by computing a critical factorization of the needle n as the preprocessing step. This step produces the index (starting point) of the periodic right-half, and the period of this stretch. The suffix computation here follows the authors' formulation. It can alternatively be computed using the Duval's algorithm, which is simpler and still linear time but slower in practice.