Recent from talks
Contribute something
Nothing was collected or created yet.
LALR parser
View on WikipediaIn computer science, an LALR parser[a] (look-ahead, left-to-right, rightmost derivation parser) is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.
An LALR parser is a simplified version of a canonical LR parser.
The LALR parser was invented by Frank DeRemer in his 1969 PhD dissertation, Practical Translators for LR(k) languages,[1] in his treatment of the practical difficulties at that time of implementing LR(1) parsers. He showed that the LALR parser has more language recognition power than the LR(0) parser, while requiring the same number of states as the LR(0) parser for a language that can be recognized by both parsers. This makes the LALR parser a memory-efficient alternative to the LR(1) parser for languages that are LALR. It was also proven that there exist LR(1) languages that are not LALR. Despite this weakness, the power of the LALR parser is sufficient for many mainstream computer languages,[2] including Java,[3] though the reference grammars for many languages fail to be LALR due to being ambiguous.[2]
The original dissertation gave no algorithm for constructing such a parser given a formal grammar. The first algorithms for LALR parser generation were published in 1973.[4] In 1982, DeRemer and Tom Pennello published an algorithm that generated highly memory-efficient LALR parsers.[5] LALR parsers can be automatically generated from a grammar by an LALR parser generator such as Yacc or GNU Bison. The automatically generated code may be augmented by hand-written code to augment the power of the resulting parser.
History
[edit]In 1965, Donald Knuth invented the LR parser (Left to Right, Rightmost derivation). The LR parser can recognize any deterministic context-free language in linear-bounded time.[6] Rightmost derivation has very large memory requirements and implementing an LR parser was impractical due to the limited memory of computers at that time. To address this shortcoming, in 1969, Frank DeRemer proposed two simplified versions of the LR parser, namely the Look-Ahead LR (LALR)[1] and the Simple LR parser (SLR) that had much lower memory requirements at the cost of less language-recognition power, with the LALR parser being the most-powerful alternative.[1] In 1977, memory optimizations for the LR parser were invented[7] but still the LR parser was less memory-efficient than the simplified alternatives.
In 1979, Frank DeRemer and Tom Pennello announced a series of optimizations for the LALR parser that would further improve its memory efficiency.[8] Their work was published in 1982.[5]
Overview
[edit]Generally, the LALR parser refers to the LALR(1) parser,[b] just as the LR parser generally refers to the LR(1) parser. The "(1)" denotes one-token lookahead, to resolve differences between rule patterns during parsing. Similarly, there is an LALR(2) parser with two-token lookahead, and LALR(k) parsers with k-token lookup, but these are rare in actual use. The LALR parser is based on the LR(0) parser, so it can also be denoted LALR(1) = LA(1)LR(0) (1 token of lookahead, LR(0)) or more generally LALR(k) = LA(k)LR(0) (k tokens of lookahead, LR(0)). There is in fact a two-parameter family of LA(k)LR(j) parsers for all combinations of j and k, which can be derived from the LR(j + k) parser,[9] but these do not see practical use.
As with other types of LR parsers, an LALR parser is quite efficient at finding the single correct bottom-up parse in a single left-to-right scan over the input stream, because it does not need to use backtracking. Being a lookahead parser by definition, it always uses a lookahead, with LALR(1) being the most-common case.
Relation to other parsers
[edit]LR parsers
[edit]The LALR(1) parser is less powerful than the LR(1) parser, and more powerful than the SLR(1) parser, though they all use the same production rules. The simplification that the LALR parser introduces consists in merging rules that have identical kernel item sets, because during the LR(0) state-construction process the lookaheads are not known. This reduces the power of the parser because not knowing the lookahead symbols can confuse the parser as to which grammar rule to pick next, resulting in reduce/reduce conflicts. All conflicts that arise in applying a LALR(1) parser to an unambiguous LR(1) grammar are reduce/reduce conflicts. The SLR(1) parser performs further merging, which introduces additional conflicts.
The standard example of an LR(1) grammar that cannot be parsed with the LALR(1) parser, exhibiting such a reduce/reduce conflict, is:[10][11]
S → a E c
→ a F d
→ b F c
→ b E d
E → e
F → e
In the LALR table construction, two states will be merged into one state and later the lookaheads will be found to be ambiguous. The one state with lookaheads is:
E → e. {c,d}
F → e. {c,d}
An LR(1) parser will create two different states (with non-conflicting lookaheads), neither of which is ambiguous. In an LALR parser this one state has conflicting actions (given lookahead c or d, reduce to E or F), a "reduce/reduce conflict"; the above grammar will be declared ambiguous by a LALR parser generator and conflicts will be reported.
To recover, this ambiguity is resolved by choosing E, because it occurs before F in the grammar. However, the resultant parser will not be able to recognize the valid input sequence b e c, since the ambiguous sequence e c is reduced to (E → e) c, rather than the correct (F → e) c, but b E c is not in the grammar.
LL parsers
[edit]The LALR(j) parsers are incomparable with LL(k) parsers: for any j and k both greater than 0, there are LALR(j) grammars that are not LL(k) grammars and vice versa. In fact, it is undecidable whether a given LL(1) grammar is LALR(k) for any .[2]
Depending on the presence of empty derivations, a LL(1) grammar can be equal to a SLR(1) or a LALR(1) grammar. If the LL(1) grammar has no empty derivations it is SLR(1) and if all symbols with empty derivations have non-empty derivations it is LALR(1). If symbols having only an empty derivation exist, the grammar may or may not be LALR(1).[12]
See also
[edit]Notes
[edit]- ^ "LALR" is pronounced as the initialism "el-ay-el-arr"
- ^ "LALR(1)" is pronounced as the initialism "el-ay-el-arr-one"
References
[edit]- ^ a b c DeRemer 1969.
- ^ a b c LR Parsing: Theory and Practice, Nigel P. Chapman, p. 86–87
- ^ "Generate the Parser". Eclipse JDT Project. Retrieved 29 June 2012.
- ^ Anderson, T.; Eve, J.; Horning, J. (1973). "Efficient LR(1) parsers". Acta Informatica (2): 2–39.
- ^ a b DeRemer, Frank; Pennello, Thomas (October 1982). "Efficient Computation of LALR(1) Look-Ahead Sets" (PDF). ACM Transactions on Programming Languages and Systems. 4 (4): 615–649. doi:10.1145/69622.357187.
- ^ Knuth, D. E. (July 1965). "On the translation of languages from left to right". Information and Control. 8 (6): 607–639. doi:10.1016/S0019-9958(65)90426-2.
- ^ Pager, D. (1977), "A Practical General Method for Constructing LR(k) Parsers", Acta Informatica 7, vol. 7, no. 3, pp. 249–268, doi:10.1007/BF00290336
- ^ Frank DeRemer, Thomas Pennello (1979), "Efficient Computation of LALR(1) Look-Ahead Sets", Sigplan Notices - SIGPLAN, vol. 14, no. 8, pp. 176–187
- ^ Parsing Techniques: A Practical Guide, by Dick Grune and Ceriel J. H. Jacobs, "9.7 LALR(1)", p. 302
- ^ "7.9 LR(1) but not LALR(1) Archived 4 August 2010 at the Wayback Machine", CSE 756: Compiler Design and Implementation Archived 23 July 2010 at the Wayback Machine, Eitan Gurari, Spring 2008
- ^ "Why is this LR(1) grammar not LALR(1)?"
- ^ (Beatty 1982)
- DeRemer, Franklin L. (1969). Practical Translators for LR(k) languages (PDF) (PhD). MIT. Archived from the original (PDF) on 19 August 2013. Retrieved 13 November 2012.
- Beatty, J. C. (1982). "On the relationship between LL(1) and LR(1) grammars" (PDF). Journal of the ACM. 29 (4 (Oct)): 1007–1022. doi:10.1145/322344.322350.
External links
[edit]- Parsing Simulator This simulator is used to generate parsing tables LALR and resolve the exercises of the book.
- JS/CC JavaScript based implementation of a LALR(1) parser generator, which can be run in a web-browser or from the command-line.
- LALR(1) tutorial at the Wayback Machine (archived May 7, 2021), a flash card-like tutorial on LALR(1) parsing.
LALR parser
View on GrokipediaBackground and History
Parsing Fundamentals
Syntax analysis, commonly known as parsing, is the phase in compiler design where a sequence of tokens produced by the lexical analyzer is examined to verify whether it adheres to the syntactic rules defined by a formal grammar, typically constructing a parse tree or abstract syntax tree if valid.[4] This process ensures that the input string represents a syntactically correct program in the source language, identifying structural errors such as mismatched parentheses or invalid statement sequences.[5] A context-free grammar (CFG), which forms the basis for most syntax analysis in programming languages, is formally defined as a four-tuple , where is the finite set of variables (non-terminals, often uppercase letters representing syntactic categories like expressions or statements), is the finite set of terminals (the actual symbols or tokens from the lexicon, such as keywords or operators), is a finite set of production rules of the form with and , and is the distinguished start symbol from which derivations begin.[6] These production rules specify how non-terminals can be replaced by sequences of terminals and non-terminals to generate strings in the language. Derivations in a CFG proceed by repeatedly applying productions: a leftmost derivation replaces the leftmost non-terminal at each step, while a rightmost derivation replaces the rightmost non-terminal, both ultimately yielding a terminal string if the grammar generates it.[7] This context-free nature allows the replacement of a non-terminal to depend only on its own production, independent of surrounding context, enabling efficient parsing for many practical grammars.[8] Parsers are broadly classified into top-down and bottom-up approaches based on how they construct the parse tree. Top-down parsers, including predictive parsers and recursive descent parsers, begin at the root (start symbol) and recursively expand non-terminals downward to match the input, predicting expansions based on the grammar and current input position.[9] In contrast, bottom-up parsers, such as shift-reduce parsers, start from the leaves (input tokens) and build upward toward the root by recognizing and reducing substrings that match the right-hand sides of productions. Shift-reduce parsing employs a stack to hold grammar symbols and an input buffer for the remaining tokens; the "shift" action moves the next input token onto the stack, while the "reduce" action identifies a "handle" (a right-hand side substring) on the stack's top, pops it, and pushes the corresponding left-hand side non-terminal.[10] LR parsers represent a prominent family of bottom-up parsers that leverage these mechanics for deterministic recognition of context-free languages.[11] A key mechanism in resolving parsing ambiguities, particularly in bottom-up parsers, is lookahead, which involves examining a limited number of upcoming input symbols to inform decisions like whether to shift or reduce. This -lookahead helps disambiguate choices when multiple actions are possible for the current stack configuration and input, ensuring the parser selects the correct path without backtracking, as in LL(k) or LR(k) grammars where bounds the lookahead to make parsing deterministic. For many programming language grammars, suffices to handle operator precedences and associativities effectively.[12]Historical Development
The development of LALR parsers traces its roots to the broader advancements in bottom-up parsing techniques during the 1960s, particularly shift-reduce methods for handling deterministic context-free languages.[13] In 1965, Donald Knuth introduced LR(k) parsers in his seminal paper "On the Translation of Languages from Left to Right," which formalized a powerful bottom-up approach capable of parsing any deterministic context-free language using a fixed amount of lookahead.[13] This work established the theoretical foundation for efficient parser generation but highlighted practical challenges, such as the large memory requirements for full LR(1) tables in real-world implementations.[13] To address these inefficiencies, Frank DeRemer developed the concepts of LALR(1) and SLR(1) parsers in his 1969 PhD dissertation at MIT, titled "Practical Translators for LR(k) Languages," which proposed merging LR(1) states to reduce table sizes while preserving much of the parsing power.[14] Significant optimizations followed in the late 1970s and early 1980s, with DeRemer and Thomas Pennello publishing "Efficient Computation of LALR(1) Look-Ahead Sets" in 1979 (presented at the SIGPLAN Symposium on Compiler Construction) and 1982 (in ACM Transactions on Programming Languages and Systems), introducing relational methods for lookahead propagation that drastically improved computation efficiency and made LALR parsers viable for production use.[15] These advancements directly influenced parser generator tools, beginning with Steve Johnson's Yacc (Yet Another Compiler-Compiler) in 1975 at Bell Laboratories, which implemented an LALR(1) generator to automate parser creation for Unix-based compilers. Yacc's success paved the way for open-source alternatives, including GNU Bison, initially developed by Robert Corbett in 1985 as a free Yacc-compatible tool under the GNU Project.Core Concepts
LR Parsing Basics
LR parsers constitute a family of efficient bottom-up parsers designed for deterministic context-free grammars, scanning the input string from left to right while constructing the rightmost derivation in reverse order. The "LR(k)" designation reflects this process: "L" for left-to-right scan, "R" for rightmost derivation, and "k" indicating the number of lookahead input symbols used to determine parsing actions. This framework enables the recognition of a broad class of context-free languages with linear time complexity.[13] Fundamental to LR parsing are parser items, which encapsulate the parser's progress through the grammar's production rules. An LR(0) item consists of a production rule with a dot (•) marking the current position, such as A → α • β, where α represents the portion already parsed and β the remaining right-hand side to be matched. These items form the basis for tracking partial parses without lookahead information. In contrast, LR(1) items extend this by incorporating lookahead: [A → α • β, a], where a is a terminal symbol anticipating what may follow the completed production in the input stream. This augmentation allows precise decision-making at reduction points.[13] The canonical LR(1) parser is built as a deterministic finite automaton, where each state is a set of LR(1) items representing possible parser configurations. Construction starts with the initial state I_0, comprising the item [S' → • S, $] for the augmented start production S' → S (with $ denoting end-of-input). New states are derived iteratively via two operations: closure and goto. The closure of a set I adds items for non-terminals immediately after the dot in existing items. Formally, if [A → α • B β, a] ∈ I and B → γ is a production for non-terminal B, then for each terminal b ∈ FIRST(β a)—the first terminal set of β followed by lookahead a—add [B → • γ, b] to I; this repeats until no new items are added. For example, given I containing [S → • A B, c], closure would include all [A → • δ, d] where d ∈ FIRST(B c). The goto operation for set I and symbol X (terminal or non-terminal) yields goto(I, X) = closure({ [A → α X • β, a] | [A → α • X β, a] ∈ I }), advancing the parser over X. These operations generate the full set of states, forming the LR(1) automaton.[13] From the automaton, the parsing table is derived to guide the shift-reduce process. The table has two components: the ACTION table, indexed by current state and next input terminal, and the GOTO table, indexed by state and non-terminal. In ACTION[s, t], entries include "shift s'" to push terminal t onto the parse stack and transition to state s'; "reduce A → α" to pop |α| states (twice the length, accounting for symbols and states), push non-terminal A, and move to GOTO[s, A]; or "accept" when the start production completes with lookahead $. The GOTO[s, A] specifies the next state upon reducing to non-terminal A. This table-driven approach ensures efficient parsing via stack operations.[13] Conflicts in the parsing table indicate potential nondeterminism, which must be absent for the grammar to be LR(1). A shift-reduce conflict arises in state s for terminal t if ACTION[s, t] prescribes both a shift (to some s') and a reduce (by production A → α, where t ∈ FOLLOW(A)). A reduce-reduce conflict occurs if multiple reductions are possible for the same s and t. In canonical LR(1) construction, the precise lookaheads in items ensure that for LR(1) grammars, the FIRST and FOLLOW sets partition the terminals appropriately, eliminating conflicts; any detected conflict signals the grammar is not LR(1). Resolution relies on the lookahead to select the unique correct action, distinguishing LR parsers from weaker variants.[13]LALR Parser Definition
An LALR(1) parser, or Look-Ahead LR(1) parser, is a type of bottom-up parser that combines the state structure of an LR(0) automaton with lookahead sets propagated from the more precise LR(1) automaton to resolve parsing actions deterministically.[15] It is equivalent to an LA(1)LR(0) parser, where LR(0) items form the core of each state, and the lookahead symbols are computed and propagated to handle shifts and reduces without introducing unnecessary nondeterminism.[2] This approach allows LALR(1) parsers to recognize a broader class of grammars than LR(0) or SLR(1) while using fewer states than full canonical LR(1) parsers. The states in an LALR(1) parser are constructed by first building the canonical LR(1) collection of sets of items and then merging those LR(1) states that share identical cores, defined as the sets of LR(0) items (productions with dot positions, excluding lookaheads).[15] Lookaheads are then propagated across these merged states using relations that capture spontaneous generation and inheritance from prior states, ensuring that the union of lookaheads from the original LR(1) states informs the parsing decisions in the compressed automaton.[15] A grammar is LALR(1) if the resulting parser table contains no shift-reduce or reduce-reduce conflicts after this merging and propagation; however, the process may introduce reduce/reduce conflicts not present in the original LR(1) parser due to the union of incompatible lookaheads.[15] In terms of expressive power, every SLR(1) grammar is also LALR(1), as the more refined lookahead propagation in LALR(1) can only resolve additional conflicts or maintain determinism where SLR(1) succeeds.[2] Conversely, every LALR(1) grammar is LR(1), but the converse does not hold, since some LR(1) grammars produce conflicts upon state merging.[2] Thus, LALR(1) parsers recognize a proper subset of the deterministic context-free languages, though they suffice for most practical programming language grammars. To illustrate the benefit of precise lookahead propagation in LALR(1) over SLR(1), consider the grammar , , . In the LR(1) construction, the state reached after input "a" contains the items [, $] and [, b]. This configuration requires shifting on "a" to continue parsing "aab" and reducing only if the lookahead is "b" to parse "Bbb". In SLR(1), the lookahead for the reduce would be the entire FOLLOW() = {a, b}, causing a shift-reduce conflict on "a". However, LALR(1) propagates the precise lookahead {b} for the reduce, avoiding the conflict and correctly shifting on "a" while reducing on "b".[2]Construction and Operation
Building the LALR Parsing Table
The construction of an LALR(1) parsing table for a given context-free grammar proceeds in several algorithmic steps, starting from an augmented grammar where a new start production is added, with being the original start symbol. This augmentation ensures the parser recognizes complete sentences ending with the endmarker \ $. The process leverages the LR(0) automaton as its foundation, enhancing it with precise one-symbol lookaheads to achieve LALR(1) power without the state explosion of full LR(1) parsing.[15] The first step is to compute the LR(0) automaton, which consists of states represented as sets of LR(0) items of the form , indicating the position of the dot within a production. Begin with the initial state , the closure of the item , where closure adds all items for nonterminals reachable immediately after the dot in existing items. For each state and grammar symbol (terminal or nonterminal), compute the goto set by moving the dot past in all applicable items from and taking the closure of the result; this defines the transitions of the automaton. The resulting deterministic finite automaton has states corresponding to viable prefixes of right-sentential forms, and completed items identify potential reduce actions.[16] To incorporate lookaheads, propagate them across the LR(0) states using the efficient algorithm developed by DeRemer and Pennello, which collects lookaheads for each LR(0) core from the corresponding LR(1) items. This involves distinguishing spontaneous lookaheads—those directly generated in a state from grammar rules, such as \ [A \to \alpha \cdot, a] a $; then iteratively propagate these sets via successor relations until no further additions occur, ensuring all valid lookaheads are captured for each core.[15][17] A simplified pseudocode outline for the lookahead propagation phase, based on the DeRemer-Pennello method, is as follows:Initialize spontaneous lookahead sets SPON[I][p] for each state I and production p completed in I,
e.g., SPON[I][S' → S] = {$} and others from FIRST computations.
Build propagation graphs: For each transition from state J to K on symbol X,
add edges reflecting lookahead flow (direct for terminals, via epsilon-closure for nonterminals).
Worklist W = all states with non-empty spontaneous sets;
While W is not empty:
Remove state J from W;
For each successor state K reachable from J via a propagation edge:
For each production p completed in both J and K:
Old = PROP[K][p];
PROP[K][p] = PROP[K][p] ∪ (SPON[J][p] ∪ PROP[J][p]);
If PROP[K][p] changed:
Add K to W;
Final lookahead LA[I][p] = SPON[I][p] ∪ PROP[I][p] for each I and p.
Initialize spontaneous lookahead sets SPON[I][p] for each state I and production p completed in I,
e.g., SPON[I][S' → S] = {$} and others from FIRST computations.
Build propagation graphs: For each transition from state J to K on symbol X,
add edges reflecting lookahead flow (direct for terminals, via epsilon-closure for nonterminals).
Worklist W = all states with non-empty spontaneous sets;
While W is not empty:
Remove state J from W;
For each successor state K reachable from J via a propagation edge:
For each production p completed in both J and K:
Old = PROP[K][p];
PROP[K][p] = PROP[K][p] ∪ (SPON[J][p] ∪ PROP[J][p]);
If PROP[K][p] changed:
Add K to W;
Final lookahead LA[I][p] = SPON[I][p] ∪ PROP[I][p] for each I and p.
- In the action table, for each terminal , if there is a transition from to on , place "shift " (or where is the state number of ); if is completed in and , place "reduce by " (or for production index ); place "accept" if contains and a = \ $; otherwise, "error".
- In the goto table, for each nonterminal , if there is a transition from to on , place (state number of ).
