Recent from talks
Contribute something
Nothing was collected or created yet.
CYK algorithm
View on Wikipedia| Class | Parsing with context-free grammars |
|---|---|
| Data structure | String |
| Worst-case performance | , where:
|
In computer science, the Cocke–Younger–Kasami algorithm (alternatively called CYK, or CKY) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961.[1][2] The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.
The standard version of CYK operates only on context-free grammars given in Chomsky normal form (CNF). However any context-free grammar may be algorithmically transformed into a CNF grammar expressing the same language (Sipser 1997).
The importance of the CYK algorithm stems from its high efficiency in certain situations. Using big O notation, the worst case running time of CYK is , where is the length of the parsed string and is the size of the CNF grammar (Hopcroft & Ullman 1979, p. 140). This makes it one of the most efficient [citation needed] parsing algorithms in terms of worst-case asymptotic complexity, although other algorithms exist with better average running time in many practical scenarios.
Standard form
[edit]The dynamic programming algorithm requires the context-free grammar to be rendered into Chomsky normal form (CNF), because it tests for possibilities to split the current sequence into two smaller sequences. Any context-free grammar that does not generate the empty string can be represented in CNF using only production rules of the forms and ; to allow for the empty string, one can explicitly allow , where is the start symbol.[3]
Algorithm
[edit]As pseudocode
[edit]The algorithm in pseudocode is as follows:
let the input be a string I consisting of n characters: a1 ... an.
let the grammar contain r nonterminal symbols R1 ... Rr, with start symbol R1.
let P[n,n,r] be an array of booleans. Initialize all elements of P to false.
let back[n,n,r] be an array of lists of backpointing triples. Initialize all elements of back to the empty list.
for each s = 1 to n
for each unit production Rv → as
set P[1,s,v] = true
for each l = 2 to n -- Length of span
for each s = 1 to n-l+1 -- Start of span
for each p = 1 to l-1 -- Partition of span
for each production Ra → Rb Rc
if P[p,s,b] and P[l-p,s+p,c] then
set P[l,s,a] = true,
append <p,b,c> to back[l,s,a]
if P[n,1,1] is true then
I is member of language
return back -- by retracing the steps through back, one can easily construct all possible parse trees of the string.
else
return "not a member of language"
Probabilistic CYK (for finding the most probable parse)
[edit]Allows to recover the most probable parse given the probabilities of all productions.
let the input be a string I consisting of n characters: a1 ... an.
let the grammar contain r nonterminal symbols R1 ... Rr, with start symbol R1.
let P[n,n,r] be an array of real numbers. Initialize all elements of P to zero.
let back[n,n,r] be an array of backpointing triples.
for each s = 1 to n
for each unit production Rv →as
set P[1,s,v] = Pr(Rv →as)
for each l = 2 to n -- Length of span
for each s = 1 to n-l+1 -- Start of span
for each p = 1 to l-1 -- Partition of span
for each production Ra → Rb Rc
prob_splitting = Pr(Ra →Rb Rc) * P[p,s,b] * P[l-p,s+p,c]
if prob_splitting > P[l,s,a] then
set P[l,s,a] = prob_splitting
set back[l,s,a] = <p,b,c>
if P[n,1,1] > 0 then
find the parse tree by retracing through back
return the parse tree
else
return "not a member of language"
As prose
[edit]In informal terms, this algorithm considers every possible substring of the input string and sets to be true if the substring of length starting from can be generated from the nonterminal . Once it has considered substrings of length 1, it goes on to substrings of length 2, and so on. For substrings of length 2 and greater, it considers every possible partition of the substring into two parts, and checks to see if there is some production such that matches the first part and matches the second part. If so, it records as matching the whole substring. Once this process is completed, the input string is generated by the grammar if the substring containing the entire input string is matched by the start symbol.
Example
[edit]
This is an example grammar:
Now the sentence she eats a fish with a fork is analyzed using the CYK algorithm. In the following table, in , i is the number of the row (starting at the bottom at 1), and j is the number of the column (starting at the left at 1).
| S | ||||||
| VP | ||||||
| S | ||||||
| VP | PP | |||||
| S | NP | NP | ||||
| NP | V, VP | Det. | N | P | Det | N |
| she | eats | a | fish | with | a | fork |
For readability, the CYK table for P is represented here as a 2-dimensional matrix M containing a set of non-terminal symbols, such that Rk is in if, and only if, . In the above example, since a start symbol S is in , the sentence can be generated by the grammar.
Extensions
[edit]Generating a parse tree
[edit]The above algorithm is a recognizer that will only determine if a sentence is in the language. It is simple to extend it into a parser that also constructs a parse tree, by storing parse tree nodes as elements of the array, instead of the boolean 1. The node is linked to the array elements that were used to produce it, so as to build the tree structure. Only one such node in each array element is needed if only one parse tree is to be produced. However, if all parse trees of an ambiguous sentence are to be kept, it is necessary to store in the array element a list of all the ways the corresponding node can be obtained in the parsing process. This is sometimes done with a second table B[n,n,r] of so-called backpointers. The end result is then a shared-forest of possible parse trees, where common trees parts are factored between the various parses. This shared forest can conveniently be read as an ambiguous grammar generating only the sentence parsed, but with the same ambiguity as the original grammar, and the same parse trees up to a very simple renaming of non-terminals, as shown by Lang (1994).
Parsing non-CNF context-free grammars
[edit]As pointed out by Lange & Leiß (2009), the drawback of all known transformations into Chomsky normal form is that they can lead to an undesirable bloat in grammar size. The size of a grammar is the sum of the sizes of its production rules, where the size of a rule is one plus the length of its right-hand side. Using to denote the size of the original grammar, the size blow-up in the worst case may range from to , depending on the transformation algorithm used. For the use in teaching, Lange and Leiß propose a slight generalization of the CYK algorithm, "without compromising efficiency of the algorithm, clarity of its presentation, or simplicity of proofs" (Lange & Leiß 2009).
Parsing weighted context-free grammars
[edit]It is also possible to extend the CYK algorithm to parse strings using weighted and stochastic context-free grammars. Weights (probabilities) are then stored in the table P instead of booleans, so P[i,j,A] will contain the minimum weight (maximum probability) that the substring from i to j can be derived from A. Further extensions of the algorithm allow all parses of a string to be enumerated from lowest to highest weight (highest to lowest probability).
Numerical stability
[edit]When the probabilistic CYK algorithm is applied to a long string, the splitting probability can become very small due to multiplying many probabilities together. This can be dealt with by summing log-probability instead of multiplying probabilities.
Valiant's algorithm
[edit]The worst case running time of CYK is , where n is the length of the parsed string and |G| is the size of the CNF grammar G. This makes it one of the most efficient algorithms for recognizing general context-free languages in practice. Valiant (1975) gave an extension of the CYK algorithm. His algorithm computes the same parsing table as the CYK algorithm; yet he showed that algorithms for efficient multiplication of matrices with 0-1-entries can be utilized for performing this computation.
Using the Coppersmith–Winograd algorithm for multiplying these matrices, this gives an asymptotic worst-case running time of . However, the constant term hidden by the Big O Notation is so large that the Coppersmith–Winograd algorithm is only worthwhile for matrices that are too large to handle on present-day computers (Knuth 1997), and this approach requires subtraction and so is only suitable for recognition. The dependence on efficient matrix multiplication cannot be avoided altogether: Lee (2002) has proved that any parser for context-free grammars working in time can be effectively converted into an algorithm computing the product of -matrices with 0-1-entries in time , and this was extended by Abboud et al.[4] to apply to a constant-size grammar.
See also
[edit]References
[edit]- ^ Grune, Dick (2008). Parsing techniques : a practical guide (2nd ed.). New York: Springer. p. 579. ISBN 978-0-387-20248-8.
- ^ Itiroo Sakai, “Syntax in universal translation”. In Proceedings 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, Her Majesty’s Stationery Office, London, p. 593-608, 1962.
- ^ Sipser, Michael (2006). Introduction to the theory of computation (2nd ed.). Boston: Thomson Course Technology. Definition 2.8. ISBN 0-534-95097-3. OCLC 58544333.
- ^ Abboud, Amir; Backurs, Arturs; Williams, Virginia Vassilevska (2015-11-05). "If the Current Clique Algorithms are Optimal, so is Valiant's Parser". arXiv:1504.01431 [cs.CC].
Sources
[edit]- Sakai, Itiroo (1962). Syntax in universal translation. 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, Teddington, England. Vol. II. London: Her Majesty’s Stationery Office. pp. 593–608.
- Cocke, John; Schwartz, Jacob T. (April 1970). Programming languages and their compilers: Preliminary notes (PDF) (Technical report) (2nd revised ed.). CIMS, NYU.
- Hopcroft, John E.; Ullman, Jeffrey D. (1979). Introduction to Automata Theory, Languages, and Computation. Reading/MA: Addison-Wesley. ISBN 0-201-02988-X.
- Kasami, T. (1965). An efficient recognition and syntax-analysis algorithm for context-free languages (Technical report). AFCRL. 65-758.
- Knuth, Donald E. (November 14, 1997). The Art of Computer Programming Volume 2: Seminumerical Algorithms (3rd ed.). Addison-Wesley Professional. p. 501. ISBN 0-201-89684-2.
- Lang, Bernard (1994). "Recognition can be harder than parsing". Comput. Intell. 10 (4): 486–494. CiteSeerX 10.1.1.50.6982. doi:10.1111/j.1467-8640.1994.tb00011.x. S2CID 5873640.
- Lange, Martin; Leiß, Hans (2009). "To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm". Informatica Didactica. 8.
- Lee, Lillian (2002). "Fast context-free grammar parsing requires fast Boolean matrix multiplication". J. ACM. 49 (1): 1–15. arXiv:cs/0112018. doi:10.1145/505241.505242. S2CID 1243491.
- Sipser, Michael (1997). Introduction to the Theory of Computation (1st ed.). IPS. p. 99. ISBN 0-534-94728-X.
- Valiant, Leslie G. (1975). "General context-free recognition in less than cubic time". J. Comput. Syst. Sci. 10 (2): 308–314. doi:10.1016/s0022-0000(75)80046-8.
- Younger, Daniel H. (February 1967). "Recognition and parsing of context-free languages in time n3". Inform. Control. 10 (2): 189–208. doi:10.1016/s0019-9958(67)80007-x.
External links
[edit]CYK algorithm
View on GrokipediaFundamentals
Definition and Purpose
The CYK algorithm, also known as the Cocke–Younger–Kasami algorithm, is a bottom-up dynamic programming parser for determining membership in the language generated by a context-free grammar (CFG) when the grammar is in Chomsky Normal Form (CNF).[6][2][7] It operates by systematically building possible parses from smaller substrings to the full input string, enabling efficient recognition of valid derivations.[6][2] The algorithm was independently developed by Tadao Kasami in a 1965 technical report, Daniel H. Younger in a 1967 journal article, and John Cocke in an unpublished IBM technical report from around 1970.[6][2][7] These contributions established a foundational method for CFG parsing that remains influential in theoretical computer science.[6][2] Its primary purpose is to solve the membership problem for CFGs in CNF—verifying if a string of length n belongs to the grammar's language—in cubic time, specifically O(n3) worst-case complexity.[6][2] This efficiency stems from the algorithm's reliance on a triangular table structure that stores nonterminal sets for all substrings, consuming O(n2) space.[8] It solves the membership problem and, with appropriate bookkeeping, supports the construction of parse trees for valid derivations, accommodating ambiguity by tracking multiple possible nonterminals.[2][8] The CYK algorithm underpins applications in natural language processing for syntactic analysis of sentences, compiler design for efficient syntax checking during code compilation, and formal verification for validating properties in systems modeled by context-free languages.[9][10][2]Context-Free Grammars and Chomsky Normal Form
A context-free grammar (CFG) is a formal system for generating strings in a language, defined as a quadruple , where is a finite set of nonterminal symbols (variables), is a finite set of terminal symbols (the alphabet), is a finite set of production rules each of the form with and , and is the designated start symbol.[11] The language generated by , denoted , consists of all terminal strings derivable from via applications of the productions in .[11] In the Chomsky hierarchy of formal grammars, CFGs occupy Type-2, positioned between Type-3 regular grammars and Type-1 context-sensitive grammars.[12] The class of context-free languages (CFLs) they generate exhibits closure properties under key operations: specifically, if and are CFLs, then so are their union , concatenation , and Kleene star .[11] These properties facilitate the construction of CFGs for combined languages without altering the context-free nature. Chomsky normal form (CNF) restricts a CFG to simplify analysis while preserving the generated language. A grammar is in CNF if every production rule is either (where and ) or (where ), except possibly for the start symbol production if the empty string is in the language.[11] This binary or unary structure eliminates longer chains, epsilon rules (beyond the start), and unit productions. Converting an arbitrary CFG to CNF involves a systematic sequence of transformations that maintain equivalence. First, eliminate all -productions by identifying nullable variables (those deriving ) and substituting them in other rules, adjusting for the start symbol if necessary.[13] Next, remove unit productions () by replacing each with the productions of , iteratively until none remain.[13] Finally, handle productions with more than two nonterminals or intermixed terminals by introducing new nonterminals to binarize chains (e.g., becomes , ) and replace terminals in non-unary rules (e.g., becomes , ).[13] The CNF restriction is essential for efficient parsing of CFLs, as it enables dynamic programming algorithms to recognize membership in time for strings of length by limiting combinations to pairs of nonterminals, mirroring binary derivation trees.[2] This form underpins the CYK algorithm's table-filling procedure, where subproblem solutions build upon binary splits of the input.[2]Core Algorithm
Dynamic Programming Approach
The CYK algorithm utilizes a bottom-up dynamic programming strategy to solve the membership problem for context-free grammars in Chomsky normal form (CNF), constructing parses for increasingly longer substrings of the input string. This approach systematically builds solutions to larger parsing subproblems by combining results from smaller ones, ensuring that each substring is analyzed only once to avoid redundant computations. The core data structure is a triangular table , where is the length of the input string , and each entry contains the set of non-terminals that derive the substring according to the grammar. For base cases where , is populated with non-terminals such that there exists a production .[1][14] The fundamental recurrence relation for filling the table when is defined as follows: where is the set of productions in the CNF grammar. This relation composes binary productions to derive non-terminals for longer spans by considering all possible split points , effectively building a parse bottom-up from terminal symbols to the full string. The table is filled in order of increasing , and for each , over all starting positions from 1 to . This process captures the compositional nature of context-free languages, where substrings are parsed independently before integration into larger constituents.[1][14] The time complexity of the CYK algorithm is , where represents the number of productions in the grammar, arising from the table entries, each requiring up to work to compute via the recurrence. Space complexity is , dominated by the storage of the table itself. A string is accepted by the grammar if and only if the start symbol appears in , confirming that the entire input derives from . This dynamic programming framework provides an efficient, tabular method for parsing, leveraging memoization to eliminate overlapping subproblems inherent in recursive descent approaches.[1][14]Pseudocode Implementation
The CYK algorithm is typically implemented using a dynamic programming table , where each entry is a set of non-terminals that can derive the substring of the input string of length . This set-based representation accommodates ambiguity by allowing multiple non-terminals per cell without duplication.[15] For substrings of length 1, the table is initialized by adding each non-terminal to if there is a production in the grammar.[15] The full pseudocode for the standard membership version, assuming a context-free grammar in Chomsky normal form (CNF), is as follows:algorithm CYK_membership(w, G = (V, Σ, P, S))
n ← length(w)
if n = 0 then
return (S → ε) ∈ P // Handle empty string if ε-rule for start symbol exists
T ← array[1..n, 1..n] initialized to empty sets
// Base case: single terminals
for i ← 1 to n do
for each production A → a ∈ P s.t. a = w[i] do
T[i, i] ← T[i, i] ∪ {A}
// Inductive case: longer substrings
for j ← 2 to n do // length of substring
for i ← 1 to n - j + 1 do // starting index
end ← i + j - 1
for split ← i to end - 1 do // split position k
for each production A → B C ∈ P do
if B ∈ T[i, split] and C ∈ T[split + 1, end] then
T[i, end] ← T[i, end] ∪ {A}
return S ∈ T[1, n]
end
algorithm CYK_membership(w, G = (V, Σ, P, S))
n ← length(w)
if n = 0 then
return (S → ε) ∈ P // Handle empty string if ε-rule for start symbol exists
T ← array[1..n, 1..n] initialized to empty sets
// Base case: single terminals
for i ← 1 to n do
for each production A → a ∈ P s.t. a = w[i] do
T[i, i] ← T[i, i] ∪ {A}
// Inductive case: longer substrings
for j ← 2 to n do // length of substring
for i ← 1 to n - j + 1 do // starting index
end ← i + j - 1
for split ← i to end - 1 do // split position k
for each production A → B C ∈ P do
if B ∈ T[i, split] and C ∈ T[split + 1, end] then
T[i, end] ← T[i, end] ∪ {A}
return S ∈ T[1, n]
end
Step-by-Step Execution
The execution of the CYK algorithm commences with the initialization of an n × n dynamic programming table, where n denotes the length of the input string, arranged in a triangular fashion to represent substrings. All cells begin empty, and the base case for substrings of length 1 is populated first: for each position i ranging from 1 to n, the diagonal cell T_{i,i} receives all non-terminals A for which a production A → w_i exists, with w_i being the terminal symbol at that position.[17] This step identifies the fundamental units derivable directly from the grammar's terminal rules.[18] Subsequent iterations proceed in a bottom-up manner, processing substrings of increasing length k from 2 to n. For each k, the algorithm iterates over all starting positions i from 1 to n - k + 1, targeting the substring spanning i to i + k - 1. Within each such substring, it examines every possible split point j from i to i + k - 2. At each split, the contents of T_{i,j} and T_{j+1,i+k-1} are consulted; if a binary production A → B C holds where B belongs to the first cell's set and C to the second, then A is incorporated into T_{i,i+k-1}. This ordered filling—by length, then position, then split—leverages previously computed subresults to build longer derivations systematically.[17][18] The combination mechanism hinges on efficient matching of binary rules, assuming the grammar adheres to Chomsky normal form with productions limited to two non-terminals on the right-hand side or a single terminal. To expedite this, productions are typically preprocessed into indexed structures, such as sets or hash tables mapping pairs of non-terminals to deriving non-terminals, thereby avoiding exhaustive grammar scans during table filling.[17] Multiple additions to a cell may occur across splits, accumulating all viable non-terminals that can derive the substring. Verification concludes the process by inspecting the root cell T_{1,n}: the input string belongs to the language if the start symbol S appears therein. Ambiguity arises if multiple derivation paths lead to S, though the algorithm itself does not enumerate them.[19][18] A frequent implementation challenge involves inadequate rule indexing, which forces repeated full grammar traversals and inflates time complexity beyond the standard O(n^3). Proper preprocessing mitigates this, ensuring lookups remain constant time per check.[17]Illustrative Example
Grammar Setup
To illustrate the CYK algorithm, consider a context-free grammar in Chomsky normal form (CNF), where is the set of nonterminals, is the set of terminals, is the start symbol, and consists of the following productions: This grammar is in CNF because every production is of the form (where are nonterminals) or (where ).[20] The input string to parse is , with length . The symbols are positioned as , , , , . This example is selected for its moderate size, which facilitates a clear visualization of the dynamic programming table without excessive complexity, while demonstrating the algorithm on a string accepted by the grammar.[20]Table Construction and Results
The CYK algorithm constructs a triangular table , where represents the set of non-terminals that derive the substring starting at position and ending at position (with ). For the example grammar in Chomsky normal form and input string "baaba" of length , the table is filled bottom-up by increasing substring length . For the base case of length 1 substrings, each cell is populated by matching the terminal symbol to applicable unit productions. Thus, since the first "b" matches ; for the first "a" via and ; for the second "a"; for "b"; and for the last "a). For length 2 substrings, the algorithm examines possible splits at each between and . For ("ba"), the only split is at , yielding and ; matching productions include (adding A) and (adding S), so . For ("aa"), split at : and , matching (adding B), so . For ("ab"): and , matching and (adding S, C), so . For ("ba"): similar to T[1,2], . Higher lengths are filled similarly by checking all possible splits and applicable productions. The completed table is:| Length \ Starting | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| 5 | {A,S,C} | ||||
| 4 | ∅ | {A,S,C} | |||
| 3 | ∅ | {B} | {B} | ||
| 2 | {S,A} | {B} | {S,C} | {S,A} | |
| 1 | {B} | {A,C} | {A,C} | {B} | {A,C} |
