Hubbry Logo
CYK algorithmCYK algorithmMain
Open search
CYK algorithm
Community hub
CYK algorithm
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
CYK algorithm
CYK algorithm
from Wikipedia
Cocke–Younger–Kasami algorithm (CYK)
ClassParsing with context-free grammars
Data structureString
Worst-case performance, where:
  • is length of the string
  • is the size of the CNF grammar

In computer science, the Cocke–Younger–Kasami algorithm (alternatively called CYK, or CKY) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961.[1][2] The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

The standard version of CYK operates only on context-free grammars given in Chomsky normal form (CNF). However any context-free grammar may be algorithmically transformed into a CNF grammar expressing the same language (Sipser 1997).

The importance of the CYK algorithm stems from its high efficiency in certain situations. Using big O notation, the worst case running time of CYK is , where is the length of the parsed string and is the size of the CNF grammar (Hopcroft & Ullman 1979, p. 140). This makes it one of the most efficient [citation needed] parsing algorithms in terms of worst-case asymptotic complexity, although other algorithms exist with better average running time in many practical scenarios.

Standard form

[edit]

The dynamic programming algorithm requires the context-free grammar to be rendered into Chomsky normal form (CNF), because it tests for possibilities to split the current sequence into two smaller sequences. Any context-free grammar that does not generate the empty string can be represented in CNF using only production rules of the forms and ; to allow for the empty string, one can explicitly allow , where is the start symbol.[3]

Algorithm

[edit]

As pseudocode

[edit]

The algorithm in pseudocode is as follows:

let the input be a string I consisting of n characters: a1 ... an.
let the grammar contain r nonterminal symbols R1 ... Rr, with start symbol R1.
let P[n,n,r] be an array of booleans. Initialize all elements of P to false.
let back[n,n,r] be an array of lists of backpointing triples. Initialize all elements of back to the empty list.

for each s = 1 to n
    for each unit production Rvas
        set P[1,s,v] = true

for each l = 2 to n -- Length of span
    for each s = 1 to n-l+1 -- Start of span
        for each p = 1 to l-1 -- Partition of span
            for each production RaRb Rc
                if P[p,s,b] and P[l-p,s+p,c] then
                    set P[l,s,a] = true, 
                    append <p,b,c> to back[l,s,a]

if P[n,1,1] is true then
    I is member of language
    return back -- by retracing the steps through back, one can easily construct all possible parse trees of the string.
else
    return "not a member of language"

Probabilistic CYK (for finding the most probable parse)

[edit]

Allows to recover the most probable parse given the probabilities of all productions.

let the input be a string I consisting of n characters: a1 ... an.
let the grammar contain r nonterminal symbols R1 ... Rr, with start symbol R1.
let P[n,n,r] be an array of real numbers. Initialize all elements of P to zero.
let back[n,n,r] be an array of backpointing triples.
for each s = 1 to n
  for each unit production Rvas
    set P[1,s,v] = Pr(Rvas)
for each l = 2 to n -- Length of span
  for each s = 1 to n-l+1 -- Start of span
    for each p = 1 to l-1 -- Partition of span       
      for each production RaRb Rc
        prob_splitting = Pr(RaRb Rc) * P[p,s,b] * P[l-p,s+p,c]
        if prob_splitting > P[l,s,a] then 
          set P[l,s,a] = prob_splitting
          set back[l,s,a] = <p,b,c>

if P[n,1,1] > 0 then
    find the parse tree by retracing through back
    return the parse tree
else
    return "not a member of language"

As prose

[edit]

In informal terms, this algorithm considers every possible substring of the input string and sets to be true if the substring of length starting from can be generated from the nonterminal . Once it has considered substrings of length 1, it goes on to substrings of length 2, and so on. For substrings of length 2 and greater, it considers every possible partition of the substring into two parts, and checks to see if there is some production such that matches the first part and matches the second part. If so, it records as matching the whole substring. Once this process is completed, the input string is generated by the grammar if the substring containing the entire input string is matched by the start symbol.

Example

[edit]
Sentence parsing using the CYK algorithm

This is an example grammar:

Now the sentence she eats a fish with a fork is analyzed using the CYK algorithm. In the following table, in , i is the number of the row (starting at the bottom at 1), and j is the number of the column (starting at the left at 1).

CYK table
S
VP
 
S
VP PP
S NP NP
NP V, VP Det. N P Det N
she eats a fish with a fork

For readability, the CYK table for P is represented here as a 2-dimensional matrix M containing a set of non-terminal symbols, such that Rk is in if, and only if, . In the above example, since a start symbol S is in , the sentence can be generated by the grammar.

Extensions

[edit]

Generating a parse tree

[edit]

The above algorithm is a recognizer that will only determine if a sentence is in the language. It is simple to extend it into a parser that also constructs a parse tree, by storing parse tree nodes as elements of the array, instead of the boolean 1. The node is linked to the array elements that were used to produce it, so as to build the tree structure. Only one such node in each array element is needed if only one parse tree is to be produced. However, if all parse trees of an ambiguous sentence are to be kept, it is necessary to store in the array element a list of all the ways the corresponding node can be obtained in the parsing process. This is sometimes done with a second table B[n,n,r] of so-called backpointers. The end result is then a shared-forest of possible parse trees, where common trees parts are factored between the various parses. This shared forest can conveniently be read as an ambiguous grammar generating only the sentence parsed, but with the same ambiguity as the original grammar, and the same parse trees up to a very simple renaming of non-terminals, as shown by Lang (1994).

Parsing non-CNF context-free grammars

[edit]

As pointed out by Lange & Leiß (2009), the drawback of all known transformations into Chomsky normal form is that they can lead to an undesirable bloat in grammar size. The size of a grammar is the sum of the sizes of its production rules, where the size of a rule is one plus the length of its right-hand side. Using to denote the size of the original grammar, the size blow-up in the worst case may range from to , depending on the transformation algorithm used. For the use in teaching, Lange and Leiß propose a slight generalization of the CYK algorithm, "without compromising efficiency of the algorithm, clarity of its presentation, or simplicity of proofs" (Lange & Leiß 2009).

Parsing weighted context-free grammars

[edit]

It is also possible to extend the CYK algorithm to parse strings using weighted and stochastic context-free grammars. Weights (probabilities) are then stored in the table P instead of booleans, so P[i,j,A] will contain the minimum weight (maximum probability) that the substring from i to j can be derived from A. Further extensions of the algorithm allow all parses of a string to be enumerated from lowest to highest weight (highest to lowest probability).

Numerical stability

[edit]

When the probabilistic CYK algorithm is applied to a long string, the splitting probability can become very small due to multiplying many probabilities together. This can be dealt with by summing log-probability instead of multiplying probabilities.

Valiant's algorithm

[edit]

The worst case running time of CYK is , where n is the length of the parsed string and |G| is the size of the CNF grammar G. This makes it one of the most efficient algorithms for recognizing general context-free languages in practice. Valiant (1975) gave an extension of the CYK algorithm. His algorithm computes the same parsing table as the CYK algorithm; yet he showed that algorithms for efficient multiplication of matrices with 0-1-entries can be utilized for performing this computation.

Using the Coppersmith–Winograd algorithm for multiplying these matrices, this gives an asymptotic worst-case running time of . However, the constant term hidden by the Big O Notation is so large that the Coppersmith–Winograd algorithm is only worthwhile for matrices that are too large to handle on present-day computers (Knuth 1997), and this approach requires subtraction and so is only suitable for recognition. The dependence on efficient matrix multiplication cannot be avoided altogether: Lee (2002) has proved that any parser for context-free grammars working in time can be effectively converted into an algorithm computing the product of -matrices with 0-1-entries in time , and this was extended by Abboud et al.[4] to apply to a constant-size grammar.

See also

[edit]

References

[edit]

Sources

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Cocke–Younger–Kasami (CYK) algorithm is a dynamic programming-based method for determining whether a given string belongs to the language generated by a (CFG) and, if so, constructing one or more corresponding parse trees. It requires the grammar to be in (CNF), where productions are limited to A → BC (two non-terminals) or A → a (a single terminal), and fills an n × n triangular table (for input length n) by iteratively combining non-terminals over increasingly longer substrings, starting from single terminals. The algorithm achieves O(n³) and O(n²) usage, enabling efficient recognition and syntax analysis without in computation relative to input size. The algorithm was first published by Itiroo Sakai in 1961. It was independently developed in the mid-1960s by Tadao Kasami in a 1965 technical report on efficient recognition and for context-free languages, followed by Daniel H. Younger's 1967 publication detailing its application to in cubic time, and later by John Cocke in approximately 1970 as part of broader work on compiler optimization. Kasami's formulation emphasized the algebraic properties of context-free languages to avoid exhaustive enumeration, while Younger's version explicitly constructed a parsing matrix to handle ambiguity by tracking multiple derivations. In practice, the CYK algorithm underpins probabilistic parsing models like probabilistic CFGs (PCFGs) in , where it is extended to compute the most likely parse by associating probabilities with rules and selecting the highest-scoring . Its tabular structure naturally accommodates structural ambiguities in sentences, making it foundational for applications in syntax analysis, , and design, though it is typically preceded by conversion steps that may introduce minor overhead. Despite its cubic time bound, optimizations and parallel implementations have extended its utility to larger inputs in modern .

Fundamentals

Definition and Purpose

The CYK algorithm, also known as the Cocke–Younger–Kasami algorithm, is a bottom-up dynamic programming parser for determining membership in the language generated by a (CFG) when the grammar is in (CNF). It operates by systematically building possible parses from smaller substrings to the full input string, enabling efficient recognition of valid derivations. The algorithm was independently developed by Tadao Kasami in a 1965 technical report, Daniel H. Younger in a 1967 journal article, and John Cocke in an unpublished technical report from around 1970. These contributions established a foundational method for CFG that remains influential in . Its primary purpose is to solve the membership problem for CFGs in CNF—verifying if a of n belongs to the grammar's —in cubic time, specifically O(n3) . This efficiency stems from the algorithm's reliance on a triangular table structure that stores nonterminal sets for all substrings, consuming O(n2) space. It solves the membership problem and, with appropriate , supports the construction of parse trees for valid derivations, accommodating by tracking multiple possible nonterminals. The CYK algorithm underpins applications in for syntactic analysis of sentences, compiler design for efficient syntax checking during code compilation, and for validating properties in systems modeled by context-free languages.

Context-Free Grammars and Chomsky Normal Form

A (CFG) is a for generating strings in a , defined as a quadruple G=(V,Σ,P,S)G = (V, \Sigma, P, S), where VV is a of nonterminal symbols (variables), Σ\Sigma is a of terminal symbols (the ), PP is a of production rules each of the form AαA \to \alpha with AVA \in V and α(VΣ)\alpha \in (V \cup \Sigma)^*, and SVS \in V is the designated start symbol. The generated by GG, denoted L(G)L(G), consists of all terminal strings derivable from SS via applications of the productions in PP. In the of formal grammars, CFGs occupy Type-2, positioned between Type-3 regular grammars and Type-1 context-sensitive grammars. The class of context-free languages (CFLs) they generate exhibits closure properties under key operations: specifically, if L1L_1 and L2L_2 are CFLs, then so are their union L1L2L_1 \cup L_2, concatenation L1L2L_1 L_2, and L1L_1^*. These properties facilitate the construction of CFGs for combined languages without altering the context-free nature. Chomsky normal form (CNF) restricts a CFG to simplify analysis while preserving the generated language. A grammar GG is in CNF if every production rule is either ABCA \to B C (where A,B,CVA, B, C \in V and B,CSB, C \neq S) or AaA \to a (where aΣa \in \Sigma), except possibly for the start symbol production SϵS \to \epsilon if the empty string is in the language. This binary or unary structure eliminates longer chains, epsilon rules (beyond the start), and unit productions. Converting an arbitrary CFG to CNF involves a systematic sequence of transformations that maintain equivalence. First, eliminate all -productions by identifying nullable variables (those deriving ) and substituting them in other rules, adjusting for the start symbol if necessary. Next, remove unit productions (ABA \to B) by replacing each with the productions of BB, iteratively until none remain. Finally, handle productions with more than two nonterminals or intermixed terminals by introducing new nonterminals to binarize chains (e.g., ABCDA \to B C D becomes ABXA \to B X, XCDX \to C D) and replace terminals in non-unary rules (e.g., AaBA \to a B becomes AXBA \to X B, XaX \to a). The CNF restriction is essential for efficient parsing of CFLs, as it enables dynamic programming algorithms to recognize membership in O(n3)O(n^3) time for strings of length nn by limiting combinations to pairs of nonterminals, mirroring binary derivation trees. This form underpins the CYK algorithm's table-filling procedure, where subproblem solutions build upon binary splits of the input.

Core Algorithm

Dynamic Programming Approach

The CYK algorithm utilizes a bottom-up dynamic programming strategy to solve the membership problem for context-free grammars in (CNF), constructing parses for increasingly longer substrings of the input string. This approach systematically builds solutions to larger parsing subproblems by combining results from smaller ones, ensuring that each substring is analyzed only once to avoid redundant computations. The core is a triangular table T[1..n,1..n]T[1..n, 1..n], where nn is the length of the input string w=w1w2wnw = w_1 w_2 \dots w_n, and each entry T[i,j]T[i,j] contains the set of non-terminals that derive the substring wiwi+j1w_i \dots w_{i+j-1} according to the grammar. For base cases where j=1j = 1, T[i,1]T[i,1] is populated with non-terminals AA such that there exists a production AwiA \to w_i. The fundamental recurrence relation for filling the table when j>1j > 1 is defined as follows: T[i,j]={A | ABC[P](/page/P′′),k{1,,j1} s.t. BT[i,k] and CT[i+k,jk]}T[i,j] = \left\{ A \ \middle|\ A \to BC \in [P](/page/P′′), \exists k \in \{1, \dots, j-1\} \text{ s.t. } B \in T[i,k] \text{ and } C \in T[i+k, j-k] \right\} where PP is the set of productions in the CNF grammar. This relation composes binary productions to derive non-terminals for longer spans by considering all possible split points kk, effectively building a parse bottom-up from terminal symbols to the full string. The table is filled in order of increasing jj, and for each jj, over all starting positions ii from 1 to nj+1n-j+1. This process captures the compositional nature of context-free languages, where substrings are parsed independently before integration into larger constituents. The time complexity of the CYK algorithm is O(n3G)O(n^3 |G|), where G|G| represents the number of productions in the grammar, arising from the O(n2)O(n^2) table entries, each requiring up to O(nG)O(n |G|) work to compute via the recurrence. Space complexity is O(n2)O(n^2), dominated by the storage of the table itself. A string is accepted by the grammar if and only if the start symbol SS appears in T[1,n]T[1,n], confirming that the entire input derives from SS. This dynamic programming framework provides an efficient, tabular method for parsing, leveraging memoization to eliminate overlapping subproblems inherent in recursive descent approaches.

Pseudocode Implementation

The CYK algorithm is typically implemented using a dynamic programming table TT, where each entry T[i,j]T[i,j] is a set of non-terminals that can derive the wiwjw_i \dots w_j of the input ww of nn. This set-based representation accommodates by allowing multiple non-terminals per cell without duplication. For substrings of 1, the table is initialized by adding each non-terminal AA to T[i,i]T[i,i] if there is a production AwiA \to w_i in the . The full pseudocode for the standard membership version, assuming a context-free grammar in Chomsky normal form (CNF), is as follows:

algorithm CYK_membership(w, G = (V, Σ, P, S)) n ← length(w) if n = 0 then return (S → ε) ∈ P // Handle empty string if ε-rule for start symbol exists T ← array[1..n, 1..n] initialized to empty sets // Base case: single terminals for i ← 1 to n do for each production A → a ∈ P s.t. a = w[i] do T[i, i] ← T[i, i] ∪ {A} // Inductive case: longer substrings for j ← 2 to n do // length of substring for i ← 1 to n - j + 1 do // starting index end ← i + j - 1 for split ← i to end - 1 do // split position k for each production A → B C ∈ P do if B ∈ T[i, split] and C ∈ T[split + 1, end] then T[i, end] ← T[i, end] ∪ {A} return S ∈ T[1, n] end

algorithm CYK_membership(w, G = (V, Σ, P, S)) n ← length(w) if n = 0 then return (S → ε) ∈ P // Handle empty string if ε-rule for start symbol exists T ← array[1..n, 1..n] initialized to empty sets // Base case: single terminals for i ← 1 to n do for each production A → a ∈ P s.t. a = w[i] do T[i, i] ← T[i, i] ∪ {A} // Inductive case: longer substrings for j ← 2 to n do // length of substring for i ← 1 to n - j + 1 do // starting index end ← i + j - 1 for split ← i to end - 1 do // split position k for each production A → B C ∈ P do if B ∈ T[i, split] and C ∈ T[split + 1, end] then T[i, end] ← T[i, end] ∪ {A} return S ∈ T[1, n] end

This implementation assumes no ε-rules in the except possibly SϵS \to \epsilon for the case, as required by CNF. A basic probabilistic extension for probabilistic CFGs replaces the set unions with computations of summed or maximized probabilities over productions, yielding the inside probabilities for total likelihood or Viterbi scores for the most probable parse.

Step-by-Step Execution

The execution of the CYK algorithm commences with the initialization of an n × n dynamic programming table, where n denotes the length of the input string, arranged in a triangular fashion to represent substrings. All cells begin empty, and the base case for substrings of length 1 is populated first: for each position i ranging from 1 to n, the diagonal cell T_{i,i} receives all non-terminals A for which a A → w_i exists, with w_i being the terminal symbol at that position. This step identifies the fundamental units derivable directly from the grammar's terminal rules. Subsequent iterations proceed in a bottom-up manner, processing substrings of increasing length k from 2 to n. For each k, the algorithm iterates over all starting positions i from 1 to n - k + 1, targeting the substring spanning i to i + k - 1. Within each such substring, it examines every possible split point j from i to i + k - 2. At each split, the contents of T_{i,j} and T_{j+1,i+k-1} are consulted; if a binary production A → B C holds where B belongs to the first cell's set and C to the second, then A is incorporated into T_{i,i+k-1}. This ordered filling—by length, then position, then split—leverages previously computed subresults to build longer derivations systematically. The combination mechanism hinges on efficient matching of binary rules, assuming the grammar adheres to Chomsky normal form with productions limited to two non-terminals on the right-hand side or a single terminal. To expedite this, productions are typically preprocessed into indexed structures, such as sets or hash tables mapping pairs of non-terminals to deriving non-terminals, thereby avoiding exhaustive grammar scans during table filling. Multiple additions to a cell may occur across splits, accumulating all viable non-terminals that can derive the substring. Verification concludes the process by inspecting the root cell T_{1,n}: the input string belongs to the language if the start symbol S appears therein. Ambiguity arises if multiple derivation paths lead to S, though the algorithm itself does not enumerate them. A frequent implementation challenge involves inadequate rule indexing, which forces repeated full grammar traversals and inflates beyond the standard O(n^3). Proper preprocessing mitigates this, ensuring lookups remain constant time per check.

Illustrative Example

Grammar Setup

To illustrate the CYK algorithm, consider a context-free grammar G=(V,Σ,P,S)G = (V, \Sigma, P, S) in (CNF), where V={S,A,B,C}V = \{S, A, B, C\} is the set of nonterminals, Σ={a,b}\Sigma = \{a, b\} is the set of terminals, SS is the start symbol, and PP consists of the following productions: SABBC,ABAa,BCCb,CABa.\begin{align*} S &\to AB \mid BC, \\ A &\to BA \mid a, \\ B &\to CC \mid b, \\ C &\to AB \mid a. \end{align*} This grammar is in CNF because every production is of the form XYZX \to YZ (where X,Y,ZX, Y, Z are nonterminals) or XσX \to \sigma (where σΣ\sigma \in \Sigma). The input to parse is w=baabaw = baaba, with length n=5n = 5. The symbols are positioned as w1=bw_1 = b, w2=aw_2 = a, w3=aw_3 = a, w4=bw_4 = b, w5=aw_5 = a. This example is selected for its moderate size, which facilitates a clear visualization of the dynamic programming table without excessive complexity, while demonstrating the algorithm on a accepted by the .

Table Construction and Results

The CYK algorithm constructs a triangular table TT, where T[i,j]T[i,j] represents the set of non-terminals that derive the substring starting at position ii and ending at position jj (with jij \geq i). For the example grammar in Chomsky normal form and input string "baaba" of length n=5n=5, the table is filled bottom-up by increasing substring length l=ji+1l = j - i + 1. For the base case of length 1 substrings, each cell is populated by matching the terminal symbol to applicable unit productions. Thus, T[1,1]={B}T[1,1] = \{B\} since the first "b" matches BbB \to b; T[2,2]={A,C}T[2,2] = \{A, C\} for the first "a" via AaA \to a and CaC \to a; T[3,3]={A,C}T[3,3] = \{A, C\} for the second "a"; T[4,4]={B}T[4,4] = \{B\} for "b"; and T[5,5]={A,C}T[5,5] = \{A, C\} for the last "a). For length 2 substrings, the algorithm examines possible splits at each kk between ii and jj. For T[1,2]T[1,2] ("ba"), the only split is at k=1k=1, yielding T[1,1]={B}T[1,1] = \{B\} and T[2,2]={A,C}T[2,2] = \{A, C\}; matching productions include ABAA \to BA (adding A) and SBCS \to BC (adding S), so T[1,2]={S,A}T[1,2] = \{S, A\}. For T[2,3]T[2,3] ("aa"), split at k=2k=2: T[2,2]={A,C}T[2,2] = \{A, C\} and T[3,3]={A,C}T[3,3] = \{A, C\}, matching BCCB \to CC (adding B), so T[2,3]={B}T[2,3] = \{B\}. For T[3,4]T[3,4] ("ab"): T[3,3]={A,C}T[3,3] = \{A, C\} and T[4,4]={B}T[4,4] = \{B\}, matching SABS \to AB and CABC \to AB (adding S, C), so T[3,4]={S,C}T[3,4] = \{S, C\}. For T[4,5]T[4,5] ("ba"): similar to T[1,2], T[4,5]={S,A}T[4,5] = \{S, A\}. Higher lengths are filled similarly by checking all possible splits and applicable productions. The completed table is:
Length \ Starting12345
5{A,S,C}
4{A,S,C}
3{B}{B}
2{S,A}{B}{S,C}{S,A}
1{B}{A,C}{A,C}{B}{A,C}
The string is accepted since the start symbol SS appears in T[1,n]=T[1,5]T[1,n] = T[1,5]. This example exhibits , as multiple non-terminals (including S) derive the full string, indicating possible multiple parses.

Advanced Features

Parse Tree Generation

To generate a from the CYK table after determining membership, the algorithm is modified to augment each table entry with backpointers during the filling phase. These backpointers are typically stored as tuples consisting of a production rule (e.g., ABCA \to B C) and the split position kk that led to the non-terminal in the cell, allowing traceability of the derivation path. The backtracking process begins with a recursive descent from the root cell T[1,n]T[1,n], where nn is the input string length, assuming the start symbol SS is present. Starting at this cell, a suitable production SABS \to A B is selected from the backpointers, along with a corresponding split position kk; the process then recurses on the subintervals T[1,k]T[1,k] for AA and T[k+1,n]T[k+1,n] for BB, continuing until reaching the base case of single-word spans with terminal productions. This yields a representing the derivation. In cases of ambiguity, where multiple productions or splits apply to a cell, the standard augmented CYK recovers one valid parse tree via the backpointers. To enumerate all possible trees, the algorithm must explore every alternative backpointer in the relevant cells, potentially producing an exponential number of outputs relative to the degree of ambiguity in the grammar and input. The resulting parse tree is output as a rooted structure, with internal nodes labeled by non-terminals and edges connecting via production rules, terminating in leaves that match the input terminals in order. For the illustrative example discussed earlier, these backpointers would enable reconstruction of the specific leading to acceptance without refilling the table. Constructing a single parse tree via backtracking requires O(n2)O(n^2) time, proportional to the table size, following the O(n3)O(n^3) table-filling phase; however, full enumeration of ambiguous parses can incur exponential time due to branching over multiple backpointers.

Non-CNF Grammar Parsing

The standard CYK algorithm is limited to context-free grammars (CFGs) in strict (CNF), where all productions are of the form ABCA \to BC (with BB and CC nonterminals) or AaA \to a (with aa a terminal), and it does not directly handle ε-productions or unit productions (ABA \to B), requiring preprocessing to avoid incorrect results or infinite loops. These limitations arise because the dynamic programming table in CYK relies on binary combinations of nonterminals to fill spans of greater than 1, making mixed or longer right-hand sides incompatible without adaptation. To apply CYK to a general CFG GG, the grammar must first be converted to an equivalent CNF grammar GG' that generates the same language. To achieve polynomial size growth, the conversion follows an optimal order of steps: (1) Introduce a new start symbol S0S_0 if the original start symbol appears on any right-hand side, adding the production S0SS_0 \to S; (2) Binarize productions with more than two symbols on the right-hand side by iteratively introducing new nonterminals (e.g., for AX1X2X3X4A \to X_1 X_2 X_3 X_4, create AX1A1A \to X_1 A_1, A1X2A2A_1 \to X_2 A_2, A2X3X4A_2 \to X_3 X_4), handling mixed RHS with terminals; (3) Remove ε-productions by computing the set of nullable nonterminals via fixed-point iteration and adding new productions for all non-empty subsets of RHS symbols (now at most 3 per binary rule), while removing original ε-rules; (4) Eliminate unit productions by computing the transitive closure of unit chains and substituting direct non-unit productions; (5) Separate terminals from nonterminals in mixed productions by replacing each terminal aa with a new nonterminal AaA_a and adding AaaA_a \to a. This process runs in O(|G|^2) time overall and produces a grammar with O(|G|^2) productions; suboptimal orders (e.g., ε-removal before binarization) can cause exponential growth. The conversion enlargement affects CYK's time complexity, raising it from O(n^3) to O(|G| n^3) with a larger constant factor, where n is the input string length. As an alternative to full CNF conversion, a generalized CYK algorithm can parse general CFGs directly by extending the to match productions of arbitrary length and form, using precomputed sets for nullable symbols and unit chains to handle ε- and unit productions efficiently. This variant first transforms the grammar to binary normal form (allowing single terminals but requiring binary nonterminal combinations), computes nullable nonterminals and the inverse unit relation in O(|G|) time, then fills the DP table by applying all matching productions over sub-spans, yielding O(|G| n^3) time with only linear grammar size increase (O(|G|)) and no need for terminal-specific nonterminals. However, it incurs higher constants from scanning longer productions during table filling compared to strict CNF. The choice between CNF conversion and direct generalized CYK depends on the trade-off: conversion preprocessing takes O(|G|^2) time but enables the optimized binary-only recurrence, suitable for small-to-medium grammars where is tolerable; direct methods are preferable for large grammars to avoid size explosion, despite slightly slower per-span computations, especially when repeated on the same is needed.

Probabilistic and Weighted Variants

The probabilistic variant of the CYK algorithm, often denoted as PCKY, extends the standard algorithm to handle probabilistic context-free grammars (PCFGs) in Chomsky normal form, where each production rule is assigned a probability. Instead of determining mere membership by set union, PCKY computes the probability of a nonterminal deriving a substring, addressing ambiguity by quantifying the likelihood of parses rather than just their existence. For Viterbi parsing, which finds the most probable parse tree, the table entries store the maximum probability, replacing union with maximization over split points and rules. The core recurrence for Viterbi PCKY is as follows: for a span from position ii to jj, the probability P(Awi..j)P(A \to w_{i..j}) is computed as P(Awi..j)=maxk[P(ABC)P(Bwi..k)P(Cwk+1..j)],P(A \to w_{i..j}) = \max_{k} \left[ P(A \to B C) \cdot P(B \to w_{i..k}) \cdot P(C \to w_{k+1..j}) \right], where the maximum is taken over all rules ABCA \to B C and split points kk between ii and jj. This formulation corresponds to the inside probabilities in the inside-outside algorithm, originally developed for training PCFGs via expectation-maximization, where Viterbi approximates the most likely derivation. For the total probability of a string (summing over all parses), the algorithm uses summation instead of maximization, yielding the inside probability αi,j(A)=kP(ABC)αi,k(B)αk+1,j(C)\alpha_{i,j}(A) = \sum_{k} P(A \to B C) \cdot \alpha_{i,k}(B) \cdot \alpha_{k+1,j}(C). Weighted variants generalize PCKY further by operating over s, abstract algebraic structures that replace the operations of standard CYK (logical OR for union, AND for concatenation) with custom addition (\oplus) and multiplication (\otimes) operations. This allows the same dynamic programming table to compute diverse quantities, such as costs or counts, without altering the structure. For instance, the tropical (R{},min,+,,0)(\mathbb{R} \cup \{\infty\}, \min, +, \infty, 0) computes minimum-cost parses by using minimization for \oplus and addition for \otimes, applicable to optimization problems like finding the lowest-weight derivation. The probabilistic case uses the standard (R+{0},+,×,0,1)(\mathbb{R}_+ \cup \{0\}, +, \times, 0, 1) for summing probabilities. Numerical stability poses challenges in probabilistic and weighted CYK due to underflow from repeated multiplications of small probabilities, especially for long sentences. Solutions include working in log-space, where products become sums of log-probabilities (e.g., logP(Awi..j)=maxk[logP(ABC)+logP(Bwi..k)+logP(Cwk+1..j)]\log P(A \to w_{i..j}) = \max_k [\log P(A \to B C) + \log P(B \to w_{i..k}) + \log P(C \to w_{k+1..j})]), or scaling probabilities during computation to prevent overflow/underflow. Inside-outside variants also incorporate scaling factors for stability during parameter estimation. These variants find key applications in , particularly , where PCFGs model syntactic structure from acoustic inputs, and the inside-outside algorithm trains grammars on corpora for improved accuracy. In , probabilistic CYK enables efficient of source sentences to guide rule-based or statistical transfer, handling ambiguity by selecting high-probability structures.

Valiant's Generalization

In 1975, Leslie Valiant provided a significant generalization of the CYK algorithm by reformulating context-free language recognition as an algebraic computation over semirings, thereby extending its scope to a wider range of problems involving matrix operations. This approach represents the grammar using matrices indexed by non-terminals, where the entry XA,BX_{A,B} is defined as the semiring sum over all production rules of the form AwBA \to w B (with ww a terminal matching the input symbol) for unary contributions, while binary rules ABCA \to B C are captured through matrix multiplication, which combines substructures BB and CC. For an input string of length nn, the algorithm computes the nn-th power of this rule matrix XX to determine if the start symbol derives the string. The core computation involves a chain of multiplications on upper-triangular matrices to evaluate XnX^n via , yielding a of O(n3logn)O(n^3 \log n) using standard algorithms, though the structure lends itself to high parallelizability with depth O(logn)O(\log n). Valiant's method achieves sub-cubic performance, such as O(n2.81)O(n^{2.81}), when incorporating fast techniques like Strassen's algorithm. This formulation unifies context-free with broader algebraic paradigms, allowing application across s including the for simple recognition and weighted semirings for variants like probabilistic . For grammars in , it is computationally equivalent to the standard CYK algorithm but offers greater generality, enabling solutions to non-linguistic problems such as all-pairs shortest paths via the min-plus . Historically, Valiant's work deepened the theoretical foundations of dynamic programming in formal languages and inspired advancements in for algebraic problems.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.