Recent from talks
Contribute something
Nothing was collected or created yet.
Recursive descent parser
View on WikipediaThis article includes a list of general references, but it lacks sufficient corresponding inline citations. (February 2009) |
In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.[1][2]
A predictive parser is a recursive descent parser that does not require backtracking.[3] Predictive parsing is possible only for the class of LL(k) grammars, which are the context-free grammars for which there exists some positive integer k that allows a recursive descent parser to decide which production to use by examining only the next k tokens of input. The LL(k) grammars therefore exclude all ambiguous grammars, as well as all grammars that contain left recursion. Any context-free grammar can be transformed into an equivalent grammar that has no left recursion, but removal of left recursion does not always yield an LL(k) grammar. A predictive parser runs in linear time.
Recursive descent with backtracking is a technique that determines which production to use by trying each production in turn. Recursive descent with backtracking is not limited to LL(k) grammars, but is not guaranteed to terminate unless the grammar is LL(k). Even when they terminate, parsers that use recursive descent with backtracking may require exponential time.
Although predictive parsers are widely used, and are frequently chosen if writing a parser by hand, programmers often prefer to use a table-based parser produced by a parser generator,[citation needed] either for an LL(k) language or using an alternative parser, such as LALR or LR. This is particularly the case if a grammar is not in LL(k) form, as transforming the grammar to LL to make it suitable for predictive parsing is involved. Predictive parsers can also be automatically generated, using tools like ANTLR.
Predictive parsers can be depicted using transition diagrams for each non-terminal symbol where the edges between the initial and the final states are labelled by the symbols (terminals and non-terminals) of the right side of the production rule.[4]
Example parser
[edit]The following EBNF-like grammar (for Niklaus Wirth's PL/0 programming language, from Algorithms + Data Structures = Programs) is in LL(1) form:
program = block "." .
block =
["const" ident "=" number {"," ident "=" number} ";"]
["var" ident {"," ident} ";"]
{"procedure" ident ";" block ";"} statement .
statement =
ident ":=" expression
| "call" ident
| "begin" statement {";" statement } "end"
| "if" condition "then" statement
| "while" condition "do" statement .
condition =
"odd" expression
| expression ("="|"#"|"<"|"<="|">"|">=") expression .
expression = ["+"|"-"] term {("+"|"-") term} .
term = factor {("*"|"/") factor} .
factor =
ident
| number
| "(" expression ")" .
Terminals are expressed in quotes. Each nonterminal is defined by a rule in the grammar, except for ident and number, which are assumed to be implicitly defined.
C implementation
[edit]What follows is an implementation of a recursive descent parser for the above language in C. The parser reads in source code, and exits with an error message if the code fails to parse, exiting silently if the code parses correctly.
Notice how closely the predictive parser below mirrors the grammar above. There is a procedure for each nonterminal in the grammar. Parsing descends in a top-down manner until the final nonterminal has been processed. The program fragment depends on a global variable, sym, which contains the current symbol from the input, and the function nextsym, which updates sym when called.
The implementations of the functions nextsym and error are omitted for simplicity.
typedef enum {ident, number, lparen, rparen, times, slash, plus,
minus, eql, neq, lss, leq, gtr, geq, callsym, beginsym, semicolon,
endsym, ifsym, whilesym, becomes, thensym, dosym, constsym, comma,
varsym, procsym, period, oddsym} Symbol;
Symbol sym;
void nextsym(void);
void error(const char msg[]);
int accept(Symbol s) {
if (sym == s) {
nextsym();
return 1;
}
return 0;
}
int expect(Symbol s) {
if (accept(s))
return 1;
error("expect: unexpected symbol");
return 0;
}
void factor(void) {
if (accept(ident)) {
;
} else if (accept(number)) {
;
} else if (accept(lparen)) {
expression();
expect(rparen);
} else {
error("factor: syntax error");
nextsym();
}
}
void term(void) {
factor();
while (sym == times || sym == slash) {
nextsym();
factor();
}
}
void expression(void) {
if (sym == plus || sym == minus)
nextsym();
term();
while (sym == plus || sym == minus) {
nextsym();
term();
}
}
void condition(void) {
if (accept(oddsym)) {
expression();
} else {
expression();
if (sym == eql || sym == neq || sym == lss || sym == leq || sym == gtr || sym == geq) {
nextsym();
expression();
} else {
error("condition: invalid operator");
nextsym();
}
}
}
void statement(void) {
if (accept(ident)) {
expect(becomes);
expression();
} else if (accept(callsym)) {
expect(ident);
} else if (accept(beginsym)) {
do {
statement();
} while (accept(semicolon));
expect(endsym);
} else if (accept(ifsym)) {
condition();
expect(thensym);
statement();
} else if (accept(whilesym)) {
condition();
expect(dosym);
statement();
} else {
error("statement: syntax error");
nextsym();
}
}
void block(void) {
if (accept(constsym)) {
do {
expect(ident);
expect(eql);
expect(number);
} while (accept(comma));
expect(semicolon);
}
if (accept(varsym)) {
do {
expect(ident);
} while (accept(comma));
expect(semicolon);
}
while (accept(procsym)) {
expect(ident);
expect(semicolon);
block();
expect(semicolon);
}
statement();
}
void program(void) {
nextsym();
block();
expect(period);
}
Examples
[edit]Some recursive descent parser generators:
- TMG – an early compiler-compiler used in the 1960s and early 1970s
- JavaCC
- Coco/R
- ANTLR
- Spirit Parser Framework – a C++ recursive descent parser generator framework requiring no pre-compile step
- parboiled (Java) – a recursive descent PEG parsing library for Java
The C++ front-end of the Clang compiler contains a hand-written parser based on the recursive-descent parsing algorithm. [5]
See also
[edit]- Parser combinator – a higher-order function used in combinatory parsing, a method of factoring recursive descent parser designs
- Parsing expression grammar – another form representing recursive descent grammar
- Recursive ascent parser
- Tail recursive parser – a variant of the recursive descent parser
References
[edit]- ^ This article is based on material taken from Recursive+descent+parser at the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.
- ^ Burge, W.H. (1975). Recursive Programming Techniques. Addison-Wesley Publishing Company. ISBN 0-201-14450-6.
- ^ Watson, Des (22 March 2017). A Practical Approach to Compiler Construction. Springer. ISBN 978-3-319-52789-5.
- ^ Aho, Alfred V.; Sethi, Ravi; Ullman, Jeffrey (1986). Compilers: Principles, Techniques and Tools (first ed.). Addison Wesley. p. 183.
- ^ How Clang handles the type / variable name ambiguity of C/C++ https://eli.thegreenplace.net/2012/07/05/how-clang-handles-the-type-variable-name-ambiguity-of-cc/
General references
[edit]- Compilers: Principles, Techniques, and Tools, first edition, Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman, in particular Section 4.4.
- Modern Compiler Implementation in Java, Second Edition, Andrew Appel, 2002, ISBN 0-521-82060-X.
- Recursive Programming Techniques, W.H. Burge, 1975, ISBN 0-201-14450-6
- Crafting a Compiler with C, Charles N Fischer and Richard J LeBlanc, Jr, 1991, ISBN 0-8053-2166-7.
- Compiling with C# and Java, Pat Terry, 2005, ISBN 0-321-26360-X, 624
- Algorithms + Data Structures = Programs, Niklaus Wirth, 1975, ISBN 0-13-022418-9
- Compiler Construction, Niklaus Wirth, 1996, ISBN 0-201-40353-6
External links
[edit]- Jack W. Crenshaw: Let's Build A Compiler (1988-1995), in Pascal, with assembly language output, using a "keep it simple" approach
Recursive descent parser
View on GrokipediaFundamentals
Definition and Principles
A recursive descent parser is a type of top-down parser implemented as a set of mutually recursive procedures, with each procedure corresponding to one nonterminal symbol in the context-free grammar being parsed.[4] This approach directly translates the grammar's production rules into procedural code, where the parsing process begins at the start symbol and recursively expands nonterminals to match the input token stream.[5] The foundational principles of recursive descent parsing emphasize predictive decision-making through limited lookahead, typically one token, to select the appropriate production rule without backtracking.[4] This predictive nature requires the grammar to be LL(1), meaning that for each nonterminal, the first terminal symbol of each alternative production must be unique, ensuring deterministic choices.[6] Grammars with left recursion pose challenges, as they can lead to infinite recursion; such cases must be refactored, often by converting left-recursive rules into right-recursive or iterative forms to maintain termination.[4] Consequently, recursive descent excels in parsing unambiguous, non-left-recursive grammars but may require grammar transformations for broader applicability. At its core, the parser comprises one function per nonterminal, alongside a global input scanner that advances through tokens; the main parsing function invokes these procedures recursively based on the current lookahead token.[7] Recursion in this context mirrors the hierarchical structure of the parse tree: each procedure call descends into sub-productions for nested nonterminals, consuming matching terminals and returning control upon successful matching of the entire rule, thereby constructing the derivation tree incrementally from the top down.[4]Historical Development
Recursive descent parsing emerged in the early 1960s as part of the broader development of compiler design techniques, particularly influenced by the ALGOL 60 report and the need for practical top-down parsing methods during the implementation of early programming language compilers. The technique was formally described in 1961 by Ned Irons in January and by Peter Lucas in December, who introduced a top-down parser for ALGOL-like languages using mutually recursive procedures to mirror the grammar structure, marking a key advancement in syntax analysis for context-free grammars.[8] This approach contrasted with emerging bottom-up methods and was motivated by the desire for straightforward, hand-implementable parsers in resource-constrained environments of the time. In 1965, Donald Knuth developed LR parsing as a more general and efficient bottom-up alternative, generalizing precedence parsing to handle a wider class of grammars with deterministic finite automata, which highlighted the trade-offs between top-down simplicity and bottom-up power in parser design. Despite this, recursive descent gained traction through Niklaus Wirth's work in the 1970s, where he employed hand-written recursive descent parsers in his PL/0 educational compiler (1976) and notably in the first Pascal compiler completed in early 1970, emphasizing its readability and ease of integration with language design. Wirth's advocacy for this method in subsequent languages like Modula and Oberon further solidified its role in producing fast, one-pass compilers.[9] The technique evolved significantly in the 1980s and 1990s as alternatives to bottom-up tools like Yacc (introduced in 1975) proliferated, addressing the need for more maintainable top-down generators compatible with LL(1) grammars. Key milestones included the development of Coco in 1983 at the University of Linz, Austria, which generated recursive descent scanners and parsers from attributed grammars, and ANTLR (originally PCCTS) in 1989 by Terence Parr, which supported adaptive LL(*) parsing for broader grammar classes.[10][11] These tools popularized recursive descent in practical compiler construction, with formal descriptions appearing in influential textbooks such as Wirth's Compilerbau (1976) and Aho and Ullman's Principles of Compiler Design (1977), with later editions including Sethi as co-author. By the 1990s, its adoption extended to scripting language interpreters and IDEs, valuing its debuggability over automated bottom-up methods.Operational Mechanics
Parsing Procedure
The parsing procedure of a recursive descent parser initiates with the setup of the input stream, where tokens are generated from the source code and placed into a lookahead buffer, typically holding one token for predictive decisions. The process begins by calling the parsing function associated with the grammar's start symbol, a nonterminal that represents the root of the language's structure. An end-of-input marker is often appended to the token stream to facilitate termination checks. This initialization ensures the parser can systematically consume the input while tracking progress.[12] During the core recursive descent phase, the parser processes each nonterminal by examining its production rules in sequence or predictively. For a given nonterminal, terminals in the right-hand side of a production are matched by advancing the input pointer and verifying that the lookahead token equals the expected terminal symbol; a successful match consumes the token. Nonterminals trigger recursive calls to their dedicated parsing functions, mirroring the grammar's hierarchical structure and building depth through the function call stack. This top-down approach expands the start symbol into a tree of derivations, with each recursive invocation handling a subtree.[13] To resolve ambiguity in productions with alternatives, the parser employs a one-token lookahead to select the appropriate branch without backtracking, relying on the grammar's predictive properties. If no alternative matches the lookahead, the procedure reports a syntax error, often halting or entering a recovery mode by skipping tokens until a synchronizing point. The parse tree emerges implicitly from the recursion stack, where function entries and exits delineate nodes; for applications requiring explicit representation, functions can return constructed tree nodes to enable semantic actions like type checking.[14] Termination occurs upon returning from the initial start symbol call, provided all input tokens have been consumed and the end-of-input marker is reached, confirming a complete and valid parse. Failure to consume the entire input or an unresolved error indicates an invalid structure, prompting error diagnostics. This procedure ensures efficient, direct translation of the grammar into executable code while maintaining traceability through recursion.[12]Grammar Compatibility
Recursive descent parsers are most suitable for LL(1) context-free grammars, where the parsing can be performed predictively using a single token of lookahead, ensuring that the FIRST sets of alternative productions for any nonterminal are disjoint.[6] This compatibility arises because such grammars allow unambiguous selection of productions without backtracking, aligning with the top-down, recursive nature of the parser. Additionally, these grammars must be free from left recursion and ambiguity to prevent infinite recursion or nondeterministic choices during parsing.[15] Incompatible grammars often feature left-recursive rules, such as , which lead to infinite recursion in the corresponding parsing function since the function calls itself before advancing the input.[6] Nondeterministic choices, where alternatives share common prefixes in their FIRST sets, require additional lookahead beyond one token, making the grammar non-LL(1) and unsuitable for standard recursive descent without modifications.[15] To adapt incompatible grammars, left factoring can resolve prefix conflicts by extracting common initial symbols into a new nonterminal; for example, transforming into , .[6] For LL(k) extensions with more lookahead, manual prediction tables can be constructed to guide production selection, though this increases complexity.[15] Left recursion can be eliminated by rewriting rules to right-recursive forms, such as converting to , , potentially adjusting the abstract syntax tree to restore desired associativity.[6] The formal conditions for predictability rely on FIRST and FOLLOW sets. The FIRST set of a nonterminal , denoted FIRST(), is the set of terminals that can appear as the first symbol in some string derived from ; if can derive the empty string, is included.[6] The FOLLOW set of , denoted FOLLOW(), is the set of terminals that can appear immediately after in some sentential form.[15] A grammar is LL(1) if, for each nonterminal with alternatives , the FIRST sets of the are pairwise disjoint, and if any can derive , then FIRST() and FOLLOW() are disjoint.[6] Consider the simple grammar with productions , :- Compute FIRST(): Since starts with , and starts with , FIRST() = { x, ( }.
- Compute FIRST(): can derive , and through it includes FIRST(), so FIRST() = { \epsilon, x, ( }.
- Compute FOLLOW(): As the start symbol, FOLLOW() includes \ L \to S L ) S \to (L) S ) = \{ \, (, x, ) }.
- Compute FOLLOW(): is followed by in , so FOLLOW() = { ) }.
Implementation Details
Algorithmic Structure
A recursive descent parser employs a top-down approach, implementing the grammar's productions through a set of mutually recursive procedures, each corresponding to a nonterminal in the grammar.[16] This design translates the context-free grammar directly into code, where the procedure for the start symbol initiates parsing and recursively invokes subprocedures to handle derivations.[17] The modular design centers on dedicated functions for each nonterminal, promoting clarity and maintainability by isolating the logic for parsing specific syntactic constructs. These functions share access to a global input stream of tokens and a position tracker that maintains the current parsing location, enabling seamless navigation through the tokenized input without redundant state management across modules.[16] For instance, a function parsing an expression nonterminal would recursively call functions for terms and factors, advancing the shared position as matches occur.[17] Lookahead in recursive descent parsing typically involves a single-token peek mechanism, where a function examines the next token in the stream without consuming it to select the appropriate production branch based on the grammar's FIRST sets.[16] This peek operation supports LL(1) grammars efficiently, with optional buffering to cache the lookahead token and minimize repeated accesses to the input stream for improved performance in repetitive checks.[17] Error recovery follows a simple panic-mode strategy, where upon detecting a syntax error—such as a mismatch between the expected and actual token—the parser skips input tokens until it encounters a synchronizing symbol from the FOLLOW set of the current nonterminal, allowing parsing to resume without cascading failures.[18] This approach prioritizes continuation over precise diagnosis, discarding erroneous tokens to realign with valid syntax.[19] Integration with the lexer occurs through a standardized interface, where the parser assumes a pre-tokenized input stream and consumes tokens via dedicated match functions that verify and advance the position upon successful recognition.[16] The lexer supplies categorized tokens (e.g., identifiers, operators), and the parser's consumption methods ensure type-safe progression, decoupling lexical analysis from syntactic processing.[17] Extensibility is achieved by embedding semantic actions directly within the parsing functions, such as constructing abstract syntax tree (AST) nodes during successful matches or generating intermediate code on the fly.[16] These actions, often placed after production expansions, allow the parser to perform attribute evaluation or translation alongside syntax checking, facilitating extensions for compiler backends without altering the core structure.[17]Code Examples
A recursive descent parser for arithmetic expressions can be implemented using mutually recursive procedures, each handling a non-terminal in the grammar. Consider the following grammar, which supports addition, subtraction, multiplication, division, numbers, identifiers, and parentheses:expr → term { ( "+" | "-" ) term }term → factor { ( "*" | "/" ) factor }factor → number | identifier | "(" expr ")"
Pseudocode Example
The pseudocode below demonstrates the core recursive functions, using loops for repetitions (e.g.,{ (+|-) term }) and conditional checks for alternatives (e.g., number, identifier, or parentheses). Recursion occurs when parsing nested factors, such as parenthesized expressions, with depth limited by expression complexity to avoid stack overflow in practice. Token consumption advances the input stream, and errors are raised on mismatches.
E() → // Parse expr
t = T() // Initial term
while next token is "+" or "-"
op = consume next token // Match and advance
t1 = T() // Recursive call for next term
t = new BinaryNode(op, t, t1) // Build AST node
return t
T() → // Parse term
f = F() // Initial factor
while next token is "*" or "/"
op = consume next token
f1 = F() // Recursive call for next factor
f = new BinaryNode(op, f, f1)
return f
F() → // Parse factor
if next token is number
return new NumberNode(consume number)
else if next token is identifier
return new IdentifierNode(consume identifier)
else if next token is "("
consume "("
e = E() // Recursive call for expr
consume ")"
return e
else
error "Expected number, identifier, or '('"
E() → // Parse expr
t = T() // Initial term
while next token is "+" or "-"
op = consume next token // Match and advance
t1 = T() // Recursive call for next term
t = new BinaryNode(op, t, t1) // Build AST node
return t
T() → // Parse term
f = F() // Initial factor
while next token is "*" or "/"
op = consume next token
f1 = F() // Recursive call for next factor
f = new BinaryNode(op, f, f1)
return f
F() → // Parse factor
if next token is number
return new NumberNode(consume number)
else if next token is identifier
return new IdentifierNode(consume identifier)
else if next token is "("
consume "("
e = E() // Recursive call for expr
consume ")"
return e
else
error "Expected number, identifier, or '('"
C Implementation
A basic C implementation requires a lexer to provide tokens (e.g., via a global current token and advance function) and functions for matching terminals. The code below assumes a simple lexer with functions likenext_token() (advances and returns current token type), match(char* expected) (checks and consumes if match, returns bool), get_lexeme() (returns string value), and error() (prints and exits). It builds a simple AST using dynamic allocation. Recursion depth corresponds to nesting levels, such as in (1 + 2) * 3.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Simplified AST node types
typedef enum { NUMBER, ID, BINARY } NodeType;
typedef struct Node {
NodeType type;
char* value; // For NUMBER or ID
char* op; // For BINARY
struct Node* left;
struct Node* right;
} Node;
// Assume lexer globals: char* current_lexeme; int current_type; void next_token();
Node* create_node(NodeType type, char* val, char* op, Node* l, Node* r) {
Node* n = malloc(sizeof(Node));
n->type = type;
n->value = val ? strdup(val) : NULL;
n->op = op ? strdup(op) : NULL;
n->left = l;
n->right = r;
return n;
}
int match(char* expected) {
if (strcmp(current_lexeme, expected) == 0) {
next_token();
return 1;
}
return 0;
}
int is_number() { return current_type == NUMBER_TYPE; } // Assume defined
int is_id() { return current_type == ID_TYPE; }
Node* parse_factor() {
if (is_number()) {
Node* n = create_node(NUMBER, strdup(current_lexeme), NULL, NULL, NULL);
next_token();
return n;
} else if (is_id()) {
Node* n = create_node(ID, strdup(current_lexeme), NULL, NULL, NULL);
next_token();
return n;
} else if (match("(")) {
Node* e = parse_expr();
if (!match(")")) error("Expected ')'");
return e;
}
error("Expected number, identifier, or '('");
return NULL;
}
Node* parse_term() {
Node* f = parse_factor(); // Recursive call
while (strcmp(current_lexeme, "*") == 0 || strcmp(current_lexeme, "/") == 0) {
char* op = strdup(current_lexeme);
next_token();
Node* f2 = parse_factor(); // Recursive call
f = create_node(BINARY, NULL, op, f, f2);
}
return f;
}
Node* parse_expr() {
Node* t = parse_term(); // Recursive call
while (strcmp(current_lexeme, "+") == 0 || strcmp(current_lexeme, "-") == 0) {
char* op = strdup(current_lexeme);
next_token();
Node* t2 = parse_term(); // Recursive call
t = create_node(BINARY, NULL, op, t, t2);
}
return t;
}
// Usage: Node* ast = parse_expr(); // After initializing lexer
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Simplified AST node types
typedef enum { NUMBER, ID, BINARY } NodeType;
typedef struct Node {
NodeType type;
char* value; // For NUMBER or ID
char* op; // For BINARY
struct Node* left;
struct Node* right;
} Node;
// Assume lexer globals: char* current_lexeme; int current_type; void next_token();
Node* create_node(NodeType type, char* val, char* op, Node* l, Node* r) {
Node* n = malloc(sizeof(Node));
n->type = type;
n->value = val ? strdup(val) : NULL;
n->op = op ? strdup(op) : NULL;
n->left = l;
n->right = r;
return n;
}
int match(char* expected) {
if (strcmp(current_lexeme, expected) == 0) {
next_token();
return 1;
}
return 0;
}
int is_number() { return current_type == NUMBER_TYPE; } // Assume defined
int is_id() { return current_type == ID_TYPE; }
Node* parse_factor() {
if (is_number()) {
Node* n = create_node(NUMBER, strdup(current_lexeme), NULL, NULL, NULL);
next_token();
return n;
} else if (is_id()) {
Node* n = create_node(ID, strdup(current_lexeme), NULL, NULL, NULL);
next_token();
return n;
} else if (match("(")) {
Node* e = parse_expr();
if (!match(")")) error("Expected ')'");
return e;
}
error("Expected number, identifier, or '('");
return NULL;
}
Node* parse_term() {
Node* f = parse_factor(); // Recursive call
while (strcmp(current_lexeme, "*") == 0 || strcmp(current_lexeme, "/") == 0) {
char* op = strdup(current_lexeme);
next_token();
Node* f2 = parse_factor(); // Recursive call
f = create_node(BINARY, NULL, op, f, f2);
}
return f;
}
Node* parse_expr() {
Node* t = parse_term(); // Recursive call
while (strcmp(current_lexeme, "+") == 0 || strcmp(current_lexeme, "-") == 0) {
char* op = strdup(current_lexeme);
next_token();
Node* t2 = parse_term(); // Recursive call
t = create_node(BINARY, NULL, op, t, t2);
}
return t;
}
// Usage: Node* ast = parse_expr(); // After initializing lexer
Python Variant
In Python, a class-based approach can encapsulate the parser state, including the token stream, current position, and error handling via exceptions. The example below uses aParser class with methods for each non-terminal, building an AST as a tree of objects. It includes basic error handling by raising ParseError on mismatches. Recursion is managed via method calls, with depth bounded by Python's recursion limit (typically 1000, sufficient for most expressions). The lexer is simplified to tokenize on-the-fly for brevity.
import re
class ParseError(Exception):
pass
class Token:
def __init__(self, type_, value):
self.type = type_
self.value = value
class ASTNode:
pass
class NumberNode(ASTNode):
def __init__(self, value):
self.value = value
class IdNode(ASTNode):
def __init__(self, value):
self.value = value
class BinaryNode(ASTNode):
def __init__(self, op, left, right):
self.op = op
self.left = left
self.right = right
class Parser:
def __init__(self, text):
self.tokens = self._tokenize(text)
self.pos = 0
def _tokenize(self, text):
tokens = []
# Simple regex for numbers, ids, ops, parens
for match in re.finditer(r'\d+|[a-zA-Z]+|[+\-*/()]', text):
val = match.group(0)
if val.isdigit():
tokens.append(Token('NUMBER', val))
elif val.isalpha():
tokens.append(Token('ID', val))
else:
tokens.append(Token(val, val))
tokens.append(Token('EOF', None))
return tokens
def current(self):
if self.pos < len(self.tokens):
return self.tokens[self.pos]
return Token('EOF', None)
def consume(self, expected_type):
if self.current().type == expected_type:
self.pos += 1
return self.current().value
raise ParseError(f"Expected {expected_type}, got {self.current().type}")
def parse_factor(self):
tok = self.current()
if tok.type == 'NUMBER':
self.pos += 1
return NumberNode(tok.value)
elif tok.type == 'ID':
self.pos += 1
return IdNode(tok.value)
elif tok.type == '(':
self.pos += 1
expr = self.parse_expr() # Recursive call
self.consume(')')
return expr
raise ParseError("Expected number, identifier, or '('")
def parse_term(self):
node = self.parse_factor() # Recursive call
while self.current().type in ('*', '/'):
op = self.consume(self.current().type)
right = self.parse_factor() # Recursive call
node = BinaryNode(op, node, right)
return node
def parse_expr(self):
node = self.parse_term() # Recursive call
while self.current().type in ('+', '-'):
op = self.consume(self.current().type)
right = self.parse_term() # Recursive call
node = BinaryNode(op, node, right)
return node
def parse(self):
ast = self.parse_expr()
if self.current().type != 'EOF':
raise ParseError("Extra input")
return ast
# Usage example:
# parser = Parser("2 + 3 * (4 - 1)")
# ast = parser.parse()
# print(ast) # Would print AST structure
import re
class ParseError(Exception):
pass
class Token:
def __init__(self, type_, value):
self.type = type_
self.value = value
class ASTNode:
pass
class NumberNode(ASTNode):
def __init__(self, value):
self.value = value
class IdNode(ASTNode):
def __init__(self, value):
self.value = value
class BinaryNode(ASTNode):
def __init__(self, op, left, right):
self.op = op
self.left = left
self.right = right
class Parser:
def __init__(self, text):
self.tokens = self._tokenize(text)
self.pos = 0
def _tokenize(self, text):
tokens = []
# Simple regex for numbers, ids, ops, parens
for match in re.finditer(r'\d+|[a-zA-Z]+|[+\-*/()]', text):
val = match.group(0)
if val.isdigit():
tokens.append(Token('NUMBER', val))
elif val.isalpha():
tokens.append(Token('ID', val))
else:
tokens.append(Token(val, val))
tokens.append(Token('EOF', None))
return tokens
def current(self):
if self.pos < len(self.tokens):
return self.tokens[self.pos]
return Token('EOF', None)
def consume(self, expected_type):
if self.current().type == expected_type:
self.pos += 1
return self.current().value
raise ParseError(f"Expected {expected_type}, got {self.current().type}")
def parse_factor(self):
tok = self.current()
if tok.type == 'NUMBER':
self.pos += 1
return NumberNode(tok.value)
elif tok.type == 'ID':
self.pos += 1
return IdNode(tok.value)
elif tok.type == '(':
self.pos += 1
expr = self.parse_expr() # Recursive call
self.consume(')')
return expr
raise ParseError("Expected number, identifier, or '('")
def parse_term(self):
node = self.parse_factor() # Recursive call
while self.current().type in ('*', '/'):
op = self.consume(self.current().type)
right = self.parse_factor() # Recursive call
node = BinaryNode(op, node, right)
return node
def parse_expr(self):
node = self.parse_term() # Recursive call
while self.current().type in ('+', '-'):
op = self.consume(self.current().type)
right = self.parse_term() # Recursive call
node = BinaryNode(op, node, right)
return node
def parse(self):
ast = self.parse_expr()
if self.current().type != 'EOF':
raise ParseError("Extra input")
return ast
# Usage example:
# parser = Parser("2 + 3 * (4 - 1)")
# ast = parser.parse()
# print(ast) # Would print AST structure
parse_factor from parse_term.
