Recent from talks
Nothing was collected or created yet.
Canonical LR parser
View on WikipediaA canonical LR parser (also called a LR(1) parser) is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."
Formally, a canonical LR parser is an LR(k) parser for k=1, i.e. with a single lookahead terminal. The special attribute of this parser is that any LR(k) grammar with k>1 can be transformed into an LR(1) grammar.[1] However, back-substitutions are required to reduce k and as back-substitutions increase, the grammar can quickly become large, repetitive and hard to understand. LR(k) can handle all deterministic context-free languages.[1] In the past this LR(k) parser has been avoided because of its huge memory requirements in favor of less powerful alternatives such as the LALR and the LL(1) parser. Recently, however, a "minimal LR(1) parser" whose space requirements are close to LALR parsers[citation needed], is being offered by several parser generators.
Like most parsers, the LR(1) parser is automatically generated by compiler-compilers like GNU Bison, MSTA, Menhir,[2] HYACC,[3] and LRSTAR.[4]
History
[edit]In 1965 Donald Knuth invented the LR(k) parser (Left to right, Rightmost derivation parser) a type of shift-reduce parser, as a generalization of existing precedence parsers. This parser has the potential of recognizing all deterministic context-free languages and can produce both left and right derivations of statements encountered in the input file. Knuth proved that it reaches its maximum language recognition power for k=1 and provided a method for transforming LR(k), k > 1 grammars into LR(1) grammars.[1]
Canonical LR(1) parsers have the practical disadvantage of having enormous memory requirements for their internal parser-table representation. In 1969, Frank DeRemer suggested two simplified versions of the LR parser called LALR and SLR. These parsers require much less memory than Canonical LR(1) parsers, but have slightly less language-recognition power.[5] LALR(1) parsers have been the most common implementations of the LR Parser.
However, a new type of LR(1) parser, some people call a "Minimal LR(1) parser" was introduced in 1977 by David Pager[6] who showed that LR(1) parsers can be created whose memory requirements rival those of LALR(1) parsers. Recently[when?], some parser generators are offering Minimal LR(1) parsers, which not only solve the memory requirement problem, but also the mysterious-conflict-problem inherent in LALR(1) parser generators.[citation needed] In addition, Minimal LR(1) parsers can use shift-reduce actions, which makes them faster than Canonical LR(1) parsers.
Overview
[edit]The LR(1) parser is a deterministic automaton and as such its operation is based on static state transition tables. These codify the grammar of the language it recognizes and are typically called "parsing tables".
The parsing tables of the LR(1) parser are parameterized with a lookahead terminal. Simple parsing tables, like those used by the LR(0) parser represent grammar rules of the form
- A1 → A B
which means that if we go have input A followed by B then we will reduce the pair to A1 regardless of what follows. After parameterizing such a rule with a lookahead we have:
- A1 → A B, a
which means that the reduction will now be performed only if the lookahead terminal is a. This allows for richer languages where a simple rule can have different meanings depending on the lookahead context. For example, in a LR(1) grammar, all of the following rules perform a different reduction in spite of being based on the same state sequence.
- A1 → A B, a
- A2 → A B, b
- A3 → A B, c
- A4 → A B, d
The same would not be true if a lookahead terminal was not being taken into account. Parsing errors can be identified without the parser having to read the whole input by declaring some rules as errors. For example,
- E1 → B C, d
can be declared an error, causing the parser to stop. This means that the lookahead information can also be used to catch errors, as in the following example:
- A1 → A B, a
- A1 → A B, b
- A1 → A B, c
- E1 → A B, d
In this case A B will be reduced to A1 when the lookahead is a, b or c and an error will be reported when the lookahead is d.
The lookahead can also be helpful in deciding when to reduce a rule. The lookahead can help avoid reducing a specific rule if the lookahead is not valid, which would probably mean that the current state should be combined with the following instead of the previous state. That means in the following example
- Input sequence: A B C
- Rules:
- A1 → A B
- A2 → B C
the sequence can be reduced to
- A A2
instead of
- A1 C
if the lookahead after the parser went to state B wasn't acceptable, i.e. no transition rule existed. Reductions can be produced directly from a terminal as in
- X → y
which allows for multiple sequences to appear.
LR(1) parsers have the requirement that each rule should be expressed in a complete LR(1) manner, i.e. a sequence of two states with a specific lookahead. That makes simple rules such as
- X → y
requiring a great many artificial rules that essentially enumerate the combinations of all the possible states and lookahead terminals that can follow. A similar problem appears for implementing non-lookahead rules such as
- A1 → A B
where all the possible lookaheads must be enumerated. That is the reason why LR(1) parsers cannot be practically implemented without significant memory optimizations.[6]
Constructing LR(1) parsing tables
[edit]LR(1) parsing tables are constructed in the same way as LR(0) parsing tables with the modification that each Item contains a lookahead terminal. This means, contrary to LR(0) parsers, a different action may be executed, if the item to process is followed by a different terminal.
Parser items
[edit]Starting from the production rules of a language, at first the item sets for this language have to be determined. In plain words, an item set is the list of production rules, which the currently processed symbol might be part of. An item set has a one-to-one correspondence to a parser state, while the items within the set, together with the next symbol, are used to decide which state transitions and parser action are to be applied. Each item contains a marker, to note at which point the currently processed symbol appears in the rule the item represents. For LR(1) parsers, each item is specific to a lookahead terminal, thus the lookahead terminal has also been noted inside each item.
For example, assume a language consisting of the terminal symbols 'n', '+', '(', ')', the nonterminals 'E', 'T', the starting rule 'S' and the following production rules:
- S → E
- E → T
- E → ( E )
- T → n
- T → + T
- T → T + n
Items sets will be generated by analog to the procedure for LR(0) parsers. The item set 0 which represents the initial state will be created from the starting rule:
- [S → • E, $]
The dot '•' denotes the marker of the current parsing position within this rule. The expected lookahead terminal to apply this rule is noted after the comma. The '$' sign is used to denote 'end of input' is expected, as is the case for the starting rule.
This is not the complete item set 0, though. Each item set must be 'closed', which means all production rules for each nonterminal following a '•' have to be recursively included into the item set until all of those nonterminals are dealt with. The resulting item set is called the closure of the item set we began with.
For LR(1) for each production rule an item has to be included for each possible lookahead terminal following the rule. For more complex languages this usually results in very large item sets, which is the reason for the large memory requirements of LR(1) parsers.
In our example, the starting symbol requires the nonterminal 'E' which in turn requires 'T', thus all production rules will appear in item set 0. At first, we ignore the problem of finding the lookaheads and just look at the case of an LR(0), whose items do not contain lookahead terminals. So the item set 0 (without lookaheads) will look like this:
- [S → • E]
- [E → • T]
- [E → • ( E )]
- [T → • n]
- [T → • + T]
- [T → • T + n]
FIRST and FOLLOW sets
[edit]To determine lookahead terminals, so-called FIRST and FOLLOW sets are used. FIRST(A) is the set of terminals which can appear as the first element of any chain of rules matching nonterminal A. FOLLOW(I) of an Item I [A → α • B β, x] is the set of terminals that can appear immediately after nonterminal B, where α, β are arbitrary symbol strings, and x is an arbitrary lookahead terminal. FOLLOW(k,B) of an item set k and a nonterminal B is the union of the follow sets of all items in k where '•' is followed by B. The FIRST sets can be determined directly from the closures of all nonterminals in the language, while the FOLLOW sets are determined from the items under usage of the FIRST sets.
In our example, as one can verify from the full list of item sets below, the first sets are:
- FIRST(S) = { n, '+', '(' }
- FIRST(E) = { n, '+', '(' }
- FIRST(T) = { n, '+' }
Determining lookahead terminals
[edit]Within item set 0 the follow sets can be found to be:
- FOLLOW(0,S) = { $ }
- FOLLOW(0,E) = { $ }
- FOLLOW(0,T) = { $, '+' }
From this the full item set 0 for an LR(1) parser can be created, by creating for each item in the LR(0) item set one copy for each terminal in the follow set of the LHS nonterminal. Each element of the follow set may be a valid lookahead terminal:
- [S → • E, $]
- [E → • T, $]
- [E → • ( E ), $]
- [T → • n, $]
- [T → • n, +]
- [T → • + T, $]
- [T → • + T, +]
- [T → • T + n, $]
- [T → • T + n, +]
Creating new item sets
[edit]The rest of the item sets can be created by the following algorithm
- 1. For each terminal and nonterminal symbol A appearing after a '•' in each already existing item set k, create a new item set m by adding to m all the rules of k where '•' is followed by A, but only if m will not be the same as an already existing item set after step 3.
- 2. shift all the '•'s for each rule in the new item set one symbol to the right
- 3. create the closure of the new item set
- 4. Repeat from step 1 for all newly created item sets, until no more new sets appear
In the example we get 5 more sets from item set 0, item set 1 for nonterminal E, item set 2 for nonterminal T, item set 3 for terminal n, item set 4 for terminal '+' and item set 5 for '('.
Item set 1 (E):
- [S → E •, $]
Item set 2 (T):
- [E → T •, $]
- [T → T • + n, $]
- [T → T • + n, +]
Item set 3 (n):
- [T → n •, $]
- [T → n •, +]
Item set 4 ('+'):
- [T → + • T, $]
- [T → + • T, +]
- [T → • n, $]
- [T → • n, +]
- [T → • + T, $]
- [T → • + T, +]
- [T → • T + n, $]
- [T → • T + n, +]
Item set 5 ('('):
- [E → ( • E ), $]
- [E → • T, )]
- [E → • ( E ), )]
- [T → • n, )]
- [T → • n, +]
- [T → • + T, )]
- [T → • + T, +]
- [T → • T + n, )]
- [T → • T + n, +]
From item sets 2, 4 and 5 several more item sets will be produced. The complete list is quite long and thus will not be stated here. Note that this particular grammar is not LR(1), as two of the item sets (not listed above) contain shift/reduce conflicts. Detailed LR(k) treatment of this grammar can e.g. be found in [1].
Goto
[edit]The lookahead of an LR(1) item is used directly only when considering reduce actions (i.e., when the • marker is at the right end).
The core of an LR(1) item [S → a A • B e, c] is the LR(0) item S → a A • B e. Different LR(1) items may share the same core.
For example, in item set 2
- [E → T •, $]
- [T → T • + n, $]
- [T → T • + n, +]
the parser is required to perform the reduction [E → T] if the next symbol is '$', but to do a shift if the next symbol is '+'. Note that a LR(0) parser would not be able to make this decision, as it only considers the core of the items, and would thus report a shift/reduce conflict.
A state containing [A → α • X β, a] will move to a state containing [A → α X • β, a] with label X.
Every state has transitions according to Goto.
Shift actions
[edit]If [A → α • b β, a] is in state Ik and Ik moves to state Im with label b, then we add the action
- action[Ik, b] = "shift m"
Reduce actions
[edit]If [A→α •, a] is in state Ik, then we add the action
- action[Ik, a] = "reduce A → α"
References
[edit]- ^ a b c Knuth, D. E. (July 1965). "On the translation of languages from left to right". Information and Control. 8 (6): 607–639. doi:10.1016/S0019-9958(65)90426-2.
- ^ "What is Menhir?". INRIA, CRISTAL project. Retrieved 29 June 2012.
- ^ "HYACC, minimal LR(1) parser generator".
- ^ "LRSTAR parser generator".
- ^ Franklin L. DeRemer (1969). "Practical Translators for LR(k) languages" (PDF). MIT, PhD Dissertation. Archived from the original (PDF) on April 5, 2012.
- ^ a b Pager, D. (1977), "A Practical General Method for Constructing LR(k) Parsers", Acta Informatica 7, pp. 249–268
External links
[edit]- Practical LR(k) Parser Construction HTML page, David Tribble
- First & Follow Sets Construction in Python (Narayana Chikkam)
Canonical LR parser
View on GrokipediaOverview
Definition and Characteristics
The canonical LR(1) parser is a type of deterministic bottom-up parser that employs a finite-state automaton to recognize viable prefixes of right-sentential forms in context-free grammars that are LR(1). It constructs a parsing table from sets of LR(1) items, where each item is a production with a dot indicating the parsing position and a lookahead terminal (or endmarker) specifying the context for reductions. This table drives shift and reduce actions based on the current state and the next input symbol, ensuring unambiguous parsing for grammars where rightmost derivations can be determined with one lookahead token. The approach was originally formulated by Donald E. Knuth as part of the LR(k) parsing framework for k=1.[2][3] Key characteristics of the canonical LR(1) parser include its fully deterministic shift-reduce mechanism, which resolves actions using precise lookahead information to avoid shift-reduce or reduce-reduce conflicts in LR(1) grammars. Unlike simpler variants, it achieves the full expressive power of LR(1) languages—encompassing all deterministic context-free languages—without requiring grammar transformations or approximations. However, the resulting parsing table can exhibit exponential growth in size in the worst case, as the number of states corresponds to distinct subsets of possible LR(1) items, potentially leading to up to states, where is the set of productions and the terminals; in practice, this often manifests as exponential in the grammar's complexity.[3][4][5] The canonical LR(1) parser serves as a foundational tool for syntax analysis in compiler design, particularly for programming languages with intricate grammars that demand high precision and efficiency in linear time relative to input length. Its structure involves a stack for states and symbols, an input buffer, and table lookups for actions. The following pseudocode outlines the core parsing loop:let stack = [0] // initial state
let input = w $ // input [string](/page/String) with endmarker
while true:
let s = stack.top()
let a = input.peek() // next symbol or $
let act = ACTION[s][a]
if act == shift j:
stack.push(a)
stack.push(j)
input.advance()
else if act == reduce A → β: // |β| = r symbols
for i = 1 to 2r: stack.pop()
let s' = stack.top()
stack.push(A)
stack.push(GOTO[s'][A])
else if act == accept:
return "successful parse"
else:
error("[syntax error](/page/Syntax_error)")
let stack = [0] // initial state
let input = w $ // input [string](/page/String) with endmarker
while true:
let s = stack.top()
let a = input.peek() // next symbol or $
let act = ACTION[s][a]
if act == shift j:
stack.push(a)
stack.push(j)
input.advance()
else if act == reduce A → β: // |β| = r symbols
for i = 1 to 2r: stack.pop()
let s' = stack.top()
stack.push(A)
stack.push(GOTO[s'][A])
else if act == accept:
return "successful parse"
else:
error("[syntax error](/page/Syntax_error)")
Relation to LR Parsing Family
The canonical LR(1) parser represents the most complete and powerful form within the LR parsing family, as it constructs parsing tables using full LR(1) items that incorporate precise lookahead information for every production, enabling it to recognize all deterministic context-free languages without conflicts when the grammar is LR(1).[2] In contrast, simpler variants like SLR(1) approximate lookahead sets by relying exclusively on FOLLOW sets derived from the grammar, which limits their ability to handle certain LR(1) grammars that require more nuanced lookahead propagation.[7] LALR(1) parsers, meanwhile, achieve smaller table sizes by merging canonical LR(1) states that share the same core LR(0) items while unioning their lookahead sets, but this merging can introduce reduce/reduce conflicts in grammars that are LR(1) but not LALR(1). This hierarchy reflects a fundamental trade-off in the LR family: greater parsing power versus computational efficiency and table compactness.[8] Canonical LR(1) tables contain one state per distinct set of LR(1) items, often resulting in significantly more states—potentially orders of magnitude larger—than LALR(1) tables, which use only the number of distinct LR(0) item sets. However, canonical LR(1) avoids the conflicts that may arise in LALR(1) due to lookahead unioning, ensuring unambiguous parsing for the full class of LR(1) grammars.[9] Canonical LR(1) is particularly suited for applications requiring maximum precision in lookahead handling, such as parsing complex grammars in compiler tools where table size is manageable or optimizable, including extensions in generator systems like Bison that support full LR(1) modes for conflict-free results.[10] While SLR(1) and LALR(1) suffice for many practical languages due to their efficiency—often used in Yacc/Bison for production compilers—canonical LR(1) provides the theoretical benchmark for the family's expressive power, especially when FOLLOW sets alone prove insufficient for conflict resolution.[9]Core Concepts
LR(1) Items and Sets
In the canonical LR(1) parser, states are represented using LR(1) items, which capture partial parses along with anticipated lookahead symbols to resolve parsing actions deterministically. An LR(1) item is formally defined as , where is a production in the context-free grammar, is the sequence of symbols already processed (a viable prefix), is the remaining sequence to process, the bullet marks the current position within the right-hand side, and is a single terminal symbol serving as the lookahead (or the end-of-input marker $$$). This notation extends the simpler LR(0) item by incorporating the lookahead to disambiguate reduce actions when multiple productions might complete at the same position.[2] LR(1) item sets, also called canonical collections of items, group valid LR(1) items that represent equivalence classes of viable prefixes—prefixes of right-hand sides that can legitimately appear on the parser's stack during a left-to-right scan. Each set models a parser state as a collection of possible partial derivations, partitioned into kernel items (those with the bullet not at the start of the production, carrying forward from prior states) and closure items (added to account for nonterminals immediately following the bullet in kernel items). These sets ensure the parser recognizes all contextually valid continuations without ambiguity.[2] In the parsing process, each LR(1) item set corresponds to a state in the underlying deterministic finite automaton (DFA), where the lookahead terminals dictate precise shift or reduce decisions based on the next input symbol. This structure allows the canonical LR(1) parser to handle a broad class of deterministic context-free grammars by maintaining lookahead information to avoid conflicts in action selection.[2] For example, consider the grammar with productions and . A sample LR(1) item is [A \to \bullet aA, \$], indicating that the parser has recognized the start of an production and anticipates shifting the terminal next, with end-of-input as the lookahead. Since the bullet precedes a terminal (), the closure of this item consists solely of itself, with no additional items added for nonterminal expansions. The closure operation briefly referenced here expands sets for nonterminals but yields a singleton set in this case, representing a state ready for a shift on .[2]FIRST and FOLLOW Sets
In the context of LR parsing, including the canonical LR(1) variant, FIRST and FOLLOW sets provide essential information for predicting possible terminals during derivation and for determining valid lookaheads in parser construction. The FIRST set for a grammar symbol X, denoted FIRST(X), is the set of terminals that can appear as the first symbol in some string derived from X. If X is a terminal a, then FIRST(a) = {a}. If X is a nonterminal with productions X → Y₁Y₂…Yₖ, the computation involves iteratively adding terminals from FIRST(Y₁) to FIRST(X), and if ε ∈ FIRST(Y₁), continuing with subsequent symbols until a non-ε prefix is found; ε is included in FIRST(X) only if X can derive the empty string (i.e., X is nullable). This recursive union ensures FIRST(X) captures all possible starting terminals, excluding ε in the final set unless explicitly needed for nullability checks.[11] The FOLLOW set for a nonterminal A, denoted FOLLOW(A), consists of terminals that can immediately follow A in any valid derivation of the start symbol. Computation begins by placing the end-of-input marker $ in FOLLOW(S) for the start symbol S. For each production A → αBβ, where B is a nonterminal, FIRST(β) excluding ε is added to FOLLOW(B); if β is nullable (ε ∈ FIRST(β)), then FOLLOW(A) is also added to FOLLOW(B). These rules propagate terminals backwards through the grammar via an iterative fixed-point algorithm, repeating until no further changes occur, ensuring all possible following terminals are captured.[11][12] Both sets are computed using iterative propagation to reach a fixed point, often implemented with queues or simple loops over productions. For FIRST sets, initialize FIRST(a) = {a} for terminals and empty sets otherwise; then, for each production X → Y₁…Yₖ, update FIRST(X) by unioning FIRST(Y_i) where prior symbols are nullable, marking nullability as needed, and iterate until stabilization. For FOLLOW sets, after FIRST computation, initialize FOLLOW(S) = {$}; then, for each production A → αBβ, enqueue updates to FOLLOW(B) with FIRST(β) \ {ε} and, if β nullable, with FOLLOW(A), propagating via a queue until no additions occur. These algorithms ensure completeness in O(n²) time for n nonterminals, assuming grammar size is polynomial.[13][12] Consider the following augmented grammar for illustration:- S' → S
- S → A x | B y | z
- A → 1 C B | 2 B
- B → 3 B | C
- C → 4 | ε
| Nonterminal | FIRST Set |
|---|---|
| S | {1, 2, 3, 4, y, z} |
| A | {1, 2} |
| B | {3, 4, ε} |
| C | {4, ε} |
| Nonterminal | FOLLOW Set |
|---|---|
| S | {$} |
| A | {x} |
| B | {x, y} |
| C | {3, 4, x, y} |
Table Construction Algorithm
Closure Operation and Item Set Creation
The closure operation is fundamental to constructing the canonical collection of LR(1) item sets in a Canonical LR parser, as it expands a given set of items to include all possible nonterminal expansions that could follow the current position in a derivation. Formally, given a set I of LR(1) items, the closure CL(I) begins with the items in I and iteratively adds new items until saturation: for each item [A → α • B β, a] in CL(I), where B is a nonterminal and B → γ is a production in the grammar, include all items [B → • γ, b] such that b ∈ FIRST(β a), where FIRST is the standard first-set function for computing possible starting terminals of strings (with ε-propagation if β is nullable). This ensures that the item set captures all valid continuations from the current viable prefix, with lookaheads refined to prevent spurious conflicts.[14][15] The initial item set I₀ kickstarts the process and is defined as the closure of the singleton set containing [S' → • S, $], where S' is an augmented start symbol with the production S' → S added to the grammar, and ensuring proper termination lookahead.[10][15] The full canonical collection of item sets, denoted C, comprises all distinct sets reachable from I₀ through iterative application of closure and the goto operation (though goto details are deferred). Starting with C = {I₀}, for each set J in C and each grammar symbol X, compute the goto set as the items obtained by advancing the dot over X in J's items, then take its closure; if this yields a new set, add it to C. The resulting collection C forms the states of a deterministic finite automaton (DFA), where transitions on symbols X lead from state J to goto(J, X), enabling recognition of viable prefixes in the grammar. FIRST sets are used briefly here solely for lookahead computation in closure, as detailed conceptually elsewhere.[14][15] The time complexity for constructing this canonical collection is O(|G|^3), where |G| measures the grammar size (typically the number of productions times the vocabulary size), arising from the iterative closure and goto computations over potentially up to O(|G|^2) states, each processed in O(|G|) time.[10] To illustrate the closure operation, consider a simple arithmetic grammar with productions:- S' → E
- E → E + T
- E → T
- T → T * a
- T → a
Goto Function and State Transitions
The goto function serves as the primary mechanism for defining transitions between sets of LR(1) items in the construction of the canonical LR parser's state machine. It computes the next set of items by advancing the parse position over a specific grammar symbol in the current set, thereby modeling how the parser progresses through viable prefixes of right-sentential forms. This function was introduced by Knuth as part of the foundational theory for LR(k) parsing, enabling the systematic generation of parser states that recognize deterministic context-free languages.[16] Formally, given a set of LR(1) items and a grammar symbol (either a terminal or nonterminal), the goto function is defined as: Here, the set inside the closure consists of all items where the dot has been moved across from items in that have the dot immediately before , and the closure operation then adds any new items derived via -productions from this kernel set. This definition ensures that the resulting set captures all possible parsing configurations reachable after processing the symbol .[16][17] To build the full collection of states for the parser, the process begins with the initial state , which is the closure of the single item [S' \to \cdot S, \$], where is the augmented start symbol and \$$ denotes the end-of-input marker. The [goto](/page/Goto) function is then applied iteratively: for each state IXI\text{Goto}(I, X)$; if this yields a new set not previously encountered, add it as a new state and repeat the process from there. This exhaustive application continues until no additional states are generated, resulting in a finite set of states that form the parser's automaton.[16][17] In this automaton, each state represents a unique set of LR(1) items, corresponding to the possible positions in the parsing of right-sentential forms, and the goto function defines the labeled edges between states. Transitions on terminal symbols typically lead to shift states, where the parser advances by consuming input, while transitions on nonterminals facilitate goto actions that expand the parse tree without consuming input. By ensuring deterministic transitions—one unique next state per symbol from each current state—the goto function guarantees that the parser can handle any viable prefix unambiguously, forming the deterministic finite automaton essential for efficient bottom-up parsing.[16][17] As an example, consider a simplified expression grammar with productions:Lookahead Terminal Determination
In canonical LR(1) parsing, lookahead terminals are determined through a propagation mechanism that ensures each LR(1) item—denoted as , where is a production, the dot indicates the parsing position, and is the lookahead terminal—carries precise predictions of the next input symbol. This process begins with kernel items, which have the dot not at the left end and carry initial lookaheads from the goto transition, and extends to non-kernel (closure) items via systematic propagation. The rules for propagation are as follows: for a kernel item , where is a nonterminal, each production generates a new item , with the lookahead set computed as the FIRST set of the string ; this captures possible first terminals derivable from followed by . Reduce items, where the dot is at the right end (), receive spontaneous lookaheads directly from the parent context without further propagation, as they signal completion based on the expected following terminal.[14] The algorithm for lookahead determination integrates propagation into the closure operation on item sets, achieving a fixed point through iteration. Starting with a set of kernel items, the closure adds all predicted items as described, enqueueing them for processing; for each new item , it recursively applies the propagation rule to generate and union lookaheads for child items, continuing until no additional items or lookahead updates occur. Core lookaheads originate from the kernel (e.g., the starting lookahead for the augmented grammar item [S' \to \cdot S, \$]), while propagated lookaheads accumulate from multiple paths, forming sets that may split across items to reflect contextual distinctions. This full propagation distinguishes canonical LR(1) from approximations like SLR(1), where reduce lookaheads rely solely on static FOLLOW sets, potentially introducing spurious conflicts; in contrast, LR(1)'s dynamic, path-specific lookaheads ensure exactness, minimizing conflicts for LR(1) grammars.[14][18] To illustrate, consider a simplified expression grammar with productions , , and augmented start . In an item set with kernel item [E \to E \cdot + T, \]) (lookahead $ from end-of-input context), closure propagates to ([T \to \cdot id, +]+ $) = {+}T[E \to E \cdot + T, )][T \to id \cdot, b]b = +b = ) $, avoiding a shift/reduce conflict on + by isolating contexts—+ triggers shift for addition, while ) triggers reduce to complete the expression. This split ensures unambiguous decisions without approximating via FOLLOW sets.[14][18]Parsing Table Generation
Shift and Goto Entries
In the canonical LR parser, the parsing table is divided into the ACTION part, which handles terminals including the endmarker $, and the GOTO part, which handles nonterminals. The rows of both tables correspond to the states, each representing a set of LR(1) items derived from the canonical collection. The ACTION columns are indexed by terminals, while GOTO columns are indexed by nonterminals. This structure encodes the deterministic finite automaton that recognizes viable prefixes of right-sentential forms, guiding the parser's decisions during input processing.[19][20] Shift entries populate the ACTION table as follows: for a state with corresponding item set , and a terminal , if the goto function yields , then set , where denotes a shift to state . This entry instructs the parser to shift the input symbol onto the parse stack and transition to state . The accept entry is a special case of the ACTION table: if contains the item [S' \to S \bullet, \$], where is the augmented start symbol and $ is the endmarker, then set \mathrm{action}[i, \$] = \mathrm{accept}. This signals successful parsing of the input.[19][2] Goto entries fill the GOTO table: for state with item set and a nonterminal , if , then set . These entries are consulted after a reduction action, determining the next state by pushing the reduced nonterminal onto the stack. The goto function results, computed during table construction, directly provide the indices for both shift and goto entries, ensuring the table reflects all valid transitions in the automaton.[19][2] To illustrate, consider the ACTION and GOTO tables for the simple grammar with augmented start production S' → E and E → id, using states 0 through 2:| State | ACTION | GOTO | |
|---|---|---|---|
| id | $ | E | |
| 0 | s1 | 2 | |
| 1 | r(E→id) | ||
| 2 | accept |
Reduce Entries and Conflicts
In the canonical LR(1) parsing table, reduce entries are constructed for states that contain complete LR(1) items, which are productions where the dot appears after the entire right-hand side, denoted as [A → α •, b]. For each such item in state , the action table is updated by settingaction[i, b] = r_j (reduce using production ) for every terminal in the lookahead set associated with that item. This process ensures reductions are context-sensitive, applying only when the upcoming input symbol matches a valid lookahead, thereby supporting deterministic parsing for LR(1) grammars.[3]
The lookaheads for these complete items are precisely computed during the automaton construction, incorporating propagation from predecessor items to avoid overgeneralization seen in simpler variants like SLR(1). Shift-reduce conflicts occur when a state contains both a complete item with lookahead (prompting a reduce) and a goto transition on nonterminal or terminal (prompting a shift), potentially leading to nondeterminism. Reduce-reduce conflicts arise if two or more complete items in the same state share a common lookahead terminal, requiring a choice between multiple reductions. In the canonical LR(1) method, these conflicts are inherently resolved for LR(1) grammars because the item sets and lookaheads are refined to partition contexts, ensuring disjoint actions per input symbol per state.[2][3]
If table construction reveals conflicts, the grammar is not LR(1), and remediation typically involves rewriting it—such as factoring common prefixes, adjusting associativity, or introducing precedence declarations—to restore determinism without altering the language. For any terminal without a shift, reduce, or accept action in a state, the corresponding entry is marked as "error" to trigger syntax error handling during parsing.[3]
A representative example appears in the augmented expression grammar with productions like , , , where a state (say, ) might contain the complete item [T → id •, + / \$]. Here, the reduce action for is entered in the action table for terminals and $$$, reflecting lookaheads propagated from contexts expecting addition or end-of-input. This setup avoids conflicts by limiting the reduce to symbols that unambiguously follow identifiers in valid derivations.[21]
Parsing Process
Input Processing and Stack Operations
The canonical LR parser employs a stack to manage the state of the parsing process, an input buffer containing the sequence of tokens to be analyzed, and an output mechanism to record reductions performed during parsing. The stack alternates between grammar symbols (terminals and nonterminals) and parser states, representing the viable prefixes of the input recognized so far. The input buffer holds the token stream produced by a lexical analyzer, with a special endmarker symbol $ appended to indicate the end of the input, enabling one-symbol lookahead for decision-making. The output typically logs the productions used in reductions, which can contribute to constructing a parse tree or performing semantic actions.[22] Parsing begins with the stack initialized to contain only the initial state 0, reflecting the empty prefix at the start of the input. The input buffer is positioned at the first token, with the lookahead symbol being this initial token until advanced. No symbols are initially on the output.[22] The core of the parsing process is a table-driven lookup cycle, where the parser repeatedly examines the current top state on the stack and the current lookahead symbol from the input buffer to index into the ACTION and GOTO tables. If the ACTION entry specifies a shift to a new state s', the lookahead symbol is pushed onto the stack, followed by the new state s', and the input buffer advances to the next token. If the entry indicates a reduce by production A → β, the stack is popped for 2 × |β| entries (removing |β| symbols and their corresponding states), the nonterminal A is then pushed, and the GOTO table is consulted using the new top state and A to push the resulting goto state; the production is also outputted. This cycle continues until termination.[22] Acceptance occurs when the ACTION table entry for the current state and lookahead $ is "accept," which typically follows a reduce of the augmented start production S' → S (where S is the grammar's start symbol) with no further input remaining, confirming the input is a valid sentence in the language. An error arises if the ACTION table has no defined entry for the current state and lookahead, at which point the parser reports a syntax error and halts, without attempting recovery in the basic algorithm.[22] The following pseudocode outlines the table-driven parsing loop for a canonical LR(1) parser:input a;
output y;
call [error](/page/Error);
hold stack;
repeat
let s be the top state on the stack;
let X be the next input [symbol](/page/Symbol) (the lookahead);
if ACTION[s, X] = "shift t" then
let t be a state;
push X onto the stack;
push t onto the stack;
X := next input [symbol](/page/Symbol);
else if ACTION[s, X] = "reduce A → β" then
pop 2r [symbols](/page/Symbol) from the stack, where r is the length of β;
let s' be the top state on the stack;
push A onto the stack;
let t be GOTO[s', A];
push t onto the stack;
output the production A → β;
X := lookahead;
else if ACTION[s, X] = "accept" then
return;
else
call [error](/page/Error);
until accept or [error](/page/Error)
input a;
output y;
call [error](/page/Error);
hold stack;
repeat
let s be the top state on the stack;
let X be the next input [symbol](/page/Symbol) (the lookahead);
if ACTION[s, X] = "shift t" then
let t be a state;
push X onto the stack;
push t onto the stack;
X := next input [symbol](/page/Symbol);
else if ACTION[s, X] = "reduce A → β" then
pop 2r [symbols](/page/Symbol) from the stack, where r is the length of β;
let s' be the top state on the stack;
push A onto the stack;
let t be GOTO[s', A];
push t onto the stack;
output the production A → β;
X := lookahead;
else if ACTION[s, X] = "accept" then
return;
else
call [error](/page/Error);
until accept or [error](/page/Error)
Handling Shifts and Reduces
In a canonical LR parser, the shift action is executed when the parsing table's action entry for the current state and lookahead symbol specifies "shift j". This operation pushes the lookahead symbol onto the stack, followed by the new state j, and then advances the input pointer to the next symbol. This process builds the right-hand side of potential productions incrementally, simulating the rightmost derivation in reverse.[23] The reduce action occurs when the action entry indicates "reduce A → α", where α is a production's right-hand side of length k. The parser pops 2k entries from the stack—k symbols and k corresponding states—exposing the state below the handle. It then pushes the nonterminal A onto the stack and transitions to the state given by the goto function from the exposed state on A. This replaces the handle with its nonterminal, completing a step in the reverse rightmost derivation.[23] Semantic actions may be associated with reduce operations to perform translations or build structures, such as constructing abstract syntax tree nodes from the reduced symbols' attributes. These actions are executed immediately after popping the handle but before pushing the nonterminal, enabling bottom-up synthesis of semantic values like type checking or code generation.[23] Consider the grammar with productions E → E + T | T, T → id, augmented with S' → E, and input "id + id $". The following table traces the stack (showing states for brevity), input, and actions:| Stack | Input | Action |
|---|---|---|
| 0 | id + id $ | shift 5 |
| 0 5 | + id $ | reduce T → id (pop 2, push 4 via goto) |
| 0 4 | + id $ | reduce E → T (pop 2, push 1 via goto) |
| 0 1 | + id $ | shift 6 |
| 0 1 6 | id $ | shift 5 |
| 0 1 6 5 | $ | reduce T → id (pop 2, push 4 via goto) |
| 0 1 6 4 | $ | reduce E → E + T (pop 6, push 1 via goto) |
| 0 1 | $ | reduce S' → E (pop 2, accept) |
