Recent from talks
Nothing was collected or created yet.
ANTLR
View on WikipediaThis article needs additional citations for verification. (March 2016) |
| ANTLR | |
|---|---|
| Original authors | Terence Parr and others |
| Initial release | April 10, 1992 |
| Stable release | 4.13.2
/ 3 August 2024 |
| Repository | |
| Written in | Java |
| Platform | Cross-platform |
| License | BSD License |
| Website | www |
In computer-based language recognition, ANTLR (pronounced antler), or ANother Tool for Language Recognition, is a parser generator that uses a LL(*) algorithm for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is Professor Terence Parr of the University of San Francisco.[citation needed]
PCCTS 1.00 was announced April 10, 1992.[1][2]
Usage
[edit]ANTLR takes as input a grammar that specifies a language and generates as output source code for a recognizer of that language. While Version 3 supported generating code in the programming languages Ada95, ActionScript, C, C#, Java, JavaScript, Objective-C, Perl, Python, Ruby, and Standard ML,[3] Version 4 at present targets C#, C++, Dart,[4][5] Java, JavaScript, Go, PHP, Python (2 and 3), and Swift.
A language is specified using a context-free grammar expressed using Extended Backus–Naur Form (EBNF).[citation needed][6] ANTLR can generate lexers, parsers, tree parsers, and combined lexer-parsers. Parsers can automatically generate parse trees or abstract syntax trees, which can be further processed with tree parsers. ANTLR provides a single consistent notation for specifying lexers, parsers, and tree parsers.
By default, ANTLR reads a grammar and generates a recognizer for the language defined by the grammar (i.e., a program that reads an input stream and generates an error if the input stream does not conform to the syntax specified by the grammar). If there are no syntax errors, the default action is to simply exit without printing any message. In order to do something useful with the language, actions can be attached to grammar elements in the grammar. These actions are written in the programming language in which the recognizer is being generated. When the recognizer is being generated, the actions are embedded in the source code of the recognizer at the appropriate points. Actions can be used to build and check symbol tables and to emit instructions in a target language, in the case of a compiler.[citation needed][6]
Other than lexers and parsers, ANTLR can be used to generate tree parsers. These are recognizers that process abstract syntax trees, which can be automatically generated by parsers. These tree parsers are unique to ANTLR and help processing abstract syntax trees.[citation needed][6]
Licensing
[edit]ANTLR 3[citation needed] and ANTLR 4 are free software, published under a three-clause BSD License.[7] Prior versions were released as public domain software.[8] Documentation, derived from Parr's book The Definitive ANTLR 4 Reference, is included with the BSD-licensed ANTLR 4 source.[7][9]
Various plugins have been developed for the Eclipse development environment to support the ANTLR grammar, including ANTLR Studio, a proprietary product, as well as the "ANTLR 2"[10] and "ANTLR 3"[11] plugins for Eclipse hosted on SourceForge.[citation needed]
ANTLR 4
[edit]ANTLR 4 deals with direct left recursion correctly, but not with left recursion in general, i.e., grammar rules x that refer to y that refer to x.[12]
Development
[edit]As reported on the tools[13] page of the ANTLR project, plug-ins that enable features like syntax highlighting, syntax error checking and code completion are freely available for the most common IDEs (Intellij IDEA, NetBeans, Eclipse, Visual Studio[14] and Visual Studio Code).
Projects
[edit]Software built using ANTLR includes:
- Groovy[15]
- Jython[16]
- Hibernate[17]
- OpenJDK Compiler Grammar project experimental version of the javac compiler based upon a grammar written in ANTLR[18]
- Apex, Salesforce.com's programming language[citation needed]
- The expression evaluator in Numbers, Apple's spreadsheet[citation needed]
- Twitter's search query language[19]
- Weblogic server[citation needed]
- Apache Cassandra[citation needed]
- Processing[citation needed]
- JabRef[citation needed]
- Trino (SQL query engine)
- Presto (SQL query engine)
- MySQL Workbench
Over 200 grammars implemented in ANTLR 4 are available on GitHub.[20] They range from grammars for a URL to grammars for entire languages like C, Java and Go.
Example
[edit]In the following example, a parser in ANTLR describes the sum of expressions can be seen in the form of "1 + 2 + 3":
// Common options, for example, the target language
options
{
language = "CSharp";
}
// Followed by the parser
class SumParser extends Parser;
options
{
k = 1; // Parser Lookahead: 1 Token
}
// Definition of an expression
statement: INTEGER (PLUS^ INTEGER)*;
// Here is the Lexer
class SumLexer extends Lexer;
options
{
k = 1; // Lexer Lookahead: 1 characters
}
PLUS: '+';
DIGIT: ('0'..'9');
INTEGER: (DIGIT)+;
The following listing demonstrates the call of the parser in a program:
TextReader reader;
// (...) Fill TextReader with character
SumLexer lexer = new SumLexer(reader);
SumParser parser = new SumParser(lexer);
parser.statement();
See also
[edit]References
[edit]- ^ "Comp.compilers: Purdue Compiler-Construction Tool Set 1.00 available". compilers.iecc.com. 10 Apr 1992. Retrieved 2023-05-05.
- ^ "Comp.compilers: More on PCCTS". compilers.iecc.com. 30 Apr 1992. Retrieved 2023-05-05.
- ^ SML/NJ Language Processing Tools: User Guide
- ^ "Runtime Libraries and Code Generation Targets". github. 6 January 2022.
- ^ "The ANTLR4 C++ runtime reached home – Soft Gems". 16 November 2016.
- ^ a b c Parr, Terence (2013-01-15). The Definitive ANTLR 4 Reference. Pragmatic Bookshelf. ISBN 978-1-68050-500-9.
- ^ a b "antlr4/LICENSE.txt". GitHub. 2017-03-30.
- ^ Parr, Terence (2004-02-05). "licensing stuff". antlr-interest (Mailing list). Archived from the original on 2011-07-18. Retrieved 2009-12-15.
- ^ "ANTLR 4 Documentation". GitHub. 2017-03-30.
- ^ "ANTLR plugin for Eclipse".
- ^ "ANTLR IDE. An eclipse plugin for ANTLR grammars".
- ^ What is the difference between ANTLR 3 & 4
- ^ "ANTLR Development Tools".
- ^ "ANTLR Language Support - Visual Studio Marketplace".
- ^ "GroovyRecognizer (Groovy 2.4.0)".
- ^ "Jython: 31d97f0de5fe".
- ^ Ebersole, Steve (2018-12-06). "Hibernate ORM 6.0.0.Alpha1 released". In Relation To, The Hibernate team blog on everything data. Retrieved 2020-07-11.
- ^ "OpenJDK: Compiler Grammar".
- ^ "ANTLR Testimonials". Retrieved 2024-10-30.
- ^ Grammars written for ANTLR v4; expectation that the grammars are free of actions.: antlr/grammars-v4, Antlr Project, 2019-09-25, retrieved 2019-09-25
Bibliography
[edit]- Parr, Terence (May 17, 2007), The Definitive Antlr Reference: Building Domain-Specific Languages (1st ed.), Pragmatic Bookshelf, p. 376, ISBN 978-0-9787392-5-6, archived from the original on 2021-11-18, retrieved 2008-06-16
- Parr, Terence (December 2009), Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages (1st ed.), Pragmatic Bookshelf, p. 374, ISBN 978-1-934356-45-6, archived from the original on 2021-10-29, retrieved 2010-07-06
- Parr, Terence (January 15, 2013), The Definitive ANTLR 4 Reference (1st ed.), Pragmatic Bookshelf, p. 328, ISBN 978-1-93435-699-9
Further reading
[edit]- Parr, T.J.; Quong, R.W. (July 1995). "ANTLR: A Predicated-LL(k) Parser Generator". Software: Practice and Experience. 25 (7): 789–810. CiteSeerX 10.1.1.54.6015. doi:10.1002/spe.4380250705. S2CID 13453016.
External links
[edit]ANTLR
View on GrokipediaOverview
Definition and Purpose
ANTLR (ANother Tool for Language Recognition) is an open-source parser generator that produces lexers, parsers, and tree walkers from declarative grammar files written in a domain-specific language.[2] It enables developers to define the structure of languages or data formats in a concise, human-readable manner, automatically generating the corresponding recognition and processing code in target programming languages such as Java, C#, Python, or JavaScript.[2] The primary purpose of ANTLR is to facilitate the construction of language processors, including compilers, interpreters, query engines, and data translators, by allowing users to specify syntactic and lexical rules declaratively rather than implementing them imperatively.[2] This approach is particularly valuable for handling structured text or binary files, such as programming languages, domain-specific languages, configuration formats, or network protocols, where accurate and efficient recognition of input is essential.[2] For instance, it powers query parsing in systems like Twitter's search engine, processing billions of queries daily, and supports data processing tools like Apache Hive and Pig.[2] ANTLR was developed to simplify the creation of recognizers for complex structured data, addressing the challenges of manually coding parsers that are prone to errors and difficult to maintain. Originating from efforts dating back to 1989 by its creator, Terence Parr, it has evolved into a widely adopted tool in both academia and industry.[2] A key benefit lies in its use of LL(*) parsing, an adaptive predictive parsing strategy that achieves efficiency by lookahead without requiring backtracking in the majority of cases, making it suitable for real-time and large-scale applications.[2]Core Components
The core components of ANTLR form a modular pipeline for processing input text into structured representations, beginning with lexical analysis and extending to syntactic parsing and tree traversal. The lexer tokenizes the input character stream into discrete tokens based on predefined grammar rules, serving as the initial stage that breaks down raw text into meaningful units like keywords, identifiers, and literals.[5] Tokens are the fundamental vocabulary symbols produced by the lexer, each encapsulating attributes such as type, text content, line number, and position, which are managed within a token stream for subsequent processing. The parser operates on this token stream to validate syntax and construct parse trees according to context-free grammar rules, employing an adaptive LL(*) parsing algorithm that dynamically determines lookahead needs for efficient prediction of alternatives without fixed k-value limitations.[6] ANTLR distinguishes lexer rules, named in uppercase and defining token patterns using regular expressions, from parser rules, named in lowercase and specifying higher-level syntactic structures. Rules in both lexer and parser grammars support alternatives, denoted by the | operator, allowing multiple production options within a single rule to model syntactic choices like expressions or statements. Semantic predicates, embedded as boolean expressions like {condition}?, integrate application-specific logic into the grammar to resolve ambiguities or constrain parsing decisions at runtime, such as disambiguating overloaded operators based on context. Following parsing, tree walkers facilitate traversal of the resulting parse trees, which in ANTLR 4 double as abstract syntax trees (ASTs) without requiring separate tree grammars. The ParseTreeWalker class performs depth-first traversal, invoking methods on listener or visitor objects to process nodes; listeners use callback patterns for enter/exit events on rules, while visitors enable explicit recursive descent for custom computations like symbol resolution or code generation.History
Origins and Early Versions
ANTLR was developed by Terence Parr during his graduate studies at Purdue University, beginning in 1988 as a student project under advisor Henry G. Dietz for a compiler construction course, and evolving into his master's thesis work.[3] Initially named YUCC, the tool was renamed ANTLR (ANother Tool for Language Recognition) and became a core component of the Purdue Compiler Construction Tool Set (PCCTS), a suite aimed at simplifying compiler development through integrated lexical and syntactic analysis.[7] Motivated by the limitations of existing parser generators like YACC, which struggled with efficient lookahead for k > 1 in LL(k) and LR(k) grammars, Parr's research focused on practical variants using techniques such as grammar lookahead automata and linear approximations to manage complexity.[7] This PhD project, defended in 1993, laid the foundation for ANTLR's emphasis on flexible, high-performance parsing.[7] The first official release, ANTLR 1.00B, arrived in February 1990 and primarily generated C++ code for LL(1) parsers, merging lexical and syntactic analysis while supporting basic abstract syntax tree (AST) construction.[3] Subsequent updates, such as version 1.06 in December 1992, introduced semantic predicates for disambiguating grammar rules and enhanced error recovery mechanisms, including better diagnostic messages.[3] By version 1.10 in August 1993, ANTLR incorporated arbitrary lookahead operators, enabling more robust handling of complex grammars without exponential computational overhead.[3] These early iterations were distributed as part of PCCTS, with sample grammars for languages like ANSI C and Pascal, fostering initial adoption among compiler developers.[8] In the mid-1990s, following Parr's PhD and a period of refinement, ANTLR underwent a major rewrite leading to version 2.0.0 in May 1997, which added Java code generation as a primary target alongside C++, reflecting the growing popularity of Java in software development.[3] This version improved overall performance through optimized lexer strategies and introduced tree parsers, allowing users to process ASTs with grammar-like specifications for tasks such as code transformation and analysis.[9] Error handling was further enhanced with exception-based reporting derived from ANTLRException, providing finer control over recovery from syntax errors.[10] Later releases in the 2.x series, up to 2.7.5 in 2005, expanded support to additional targets like Python and incorporated features such as grammar inheritance for modular design.[3] ANTLR 3, released on May 17, 2007, marked a significant evolution under Parr's continued leadership, including his work through the jGuru training company he founded in 1997.[11] It shifted to an adaptive LL(*) parsing strategy, which dynamically determines lookahead needs using deterministic finite automata (DFA) for efficiency, outperforming fixed-k approaches in handling ambiguous grammars.[11] Performance gains were achieved through a redesigned recursive-descent lexer, and code generation became more flexible via the StringTemplate engine, supporting multiple targets including Java, C#, Python, and C.[11] Released under a clean BSD license with contributor agreements to encourage community involvement, ANTLR 3 transitioned fully to an open-source model, broadening its adoption beyond academic circles.[11]Evolution to ANTLR 4
ANTLR 4 was developed as a complete rewrite of its predecessor to address key limitations in ANTLR 3, particularly the inability to directly handle left-recursive rules, which are common in grammars for expressions and other language constructs. This redesign enabled ANTLR 4 to automatically convert left-recursive productions into equivalent non-left-recursive forms, simplifying grammar authoring for complex structures without manual refactoring.[12][6] The primary motivations for ANTLR 4 included enhancing support for ambiguous grammars and applications in natural language processing, where traditional LL parsers often struggle with nondeterminism and context sensitivity. By adopting an adaptive LL(*) parsing strategy, ANTLR 4 improved error recovery and prediction, making it more robust for real-world languages that exhibit ambiguity, while also easing maintenance through cleaner semantics and reduced boilerplate in generated code. These changes aimed to demystify parsing for developers building tools like query processors and configuration parsers.[6][12] Development began in earnest around 2012, with Terence Parr leading the effort alongside contributors like Sam Harwell, through open-source collaboration on GitHub. Beta and release candidate versions, such as 4.0-rc-1, were made available in December 2012 to gather community feedback. The stable 4.0 release launched on January 21, 2013, coinciding with the publication of Parr's book, The Definitive ANTLR 4 Reference, which served as a comprehensive guide to the new version's features and usage.[13][14][15] Since its stable debut, ANTLR 4 has seen continuous updates, with ongoing improvements to runtimes for targets including Java, Python, and JavaScript, focusing on performance optimizations, bug fixes, and expanded language support. Key milestones include the addition of new targets like TypeScript in version 4.12.0 and general stability enhancements up to the latest release, 4.13.2, on August 3, 2024, ensuring compatibility and efficiency across diverse environments.[16][4]Technical Architecture
Grammar Syntax
ANTLR grammars are defined using a declarative syntax in files with a.g4 extension, where the grammar name must match the filename without the extension. The structure begins with a header declaration such as grammar Name;, followed by optional sections for options, imports, tokens, channels (for lexers), and named actions, before the main rules section containing parser and lexer rules.[17] This format allows for combined, parser-only (parser grammar Name;), or lexer-only (lexer grammar Name;) grammars, providing flexibility in defining lexical and syntactic structures.[17]
Parser rules, which define the syntactic structure, start with a lowercase letter (e.g., expr), while lexer rules, which handle tokenization, begin with an uppercase letter (e.g., ID). Each rule consists of one or more alternatives separated by the pipe operator |, terminated by a semicolon, enabling the specification of multiple matching patterns for a given rule.[17] Subrules within alternatives can be labeled using the syntax label: subrule to facilitate access to parse tree nodes during traversal.
Semantic predicates, denoted by { predicate }?, allow for conditional matching based on arbitrary code evaluation during parsing, enhancing the expressiveness of rules for context-sensitive grammars.[18] Actions enclosed in curly braces {...} embed arbitrary code that executes at specific points, such as during rule entry or exit, to perform side effects like variable assignments. Lexer modes, introduced with mode ModeName;, enable switching between different lexical states, useful for handling embedded languages or varying tokenization contexts within the same grammar.[17]
For token definitions, ANTLR supports wildcards with the dot . to match any single character, character ranges like [a-z] for sets of characters, and string literals enclosed in double quotes "literal" for exact matches. Grammar-level actions include directives like @header { ... } to inject code into generated files, @members { ... } for shared class members (target-specific, e.g., Java), and scoped actions such as @parser::init { ... } for rule-specific initialization.[17] Options, set via options { language=[Java](/page/Java); }, configure aspects like the target language for code generation.
Parsing Process
The parsing process in ANTLR involves three primary phases: lexing, parsing, and tree walking. During lexing, the lexer scans the input character stream and applies lexical rules defined in the grammar to produce a sequential token stream, which serves as the input for subsequent phases. This phase handles tokenization, including context-sensitive rules in ANTLR 4, where the adaptive LL(*) strategy enables efficient processing of ambiguous lexemes like nested comments.[6] In the parsing phase, the generated parser consumes the token stream using an adaptive LL(*) algorithm to construct a concrete parse tree representing the syntactic structure of the input. The LL() approach is a predictive top-down strategy that employs an Augmented Transition Network (ATN) to model grammar decisions, allowing arbitrary lookahead while minimizing backtracking through static construction of lookahead Deterministic Finite Automata (DFAs) for each decision point. In ANTLR 4, this evolves into the ALL() variant, which dynamically analyzes the grammar at runtime to resolve nondeterminism, supporting direct left-recursion via integrated subparsers and graph-structured stacks without exponential complexity in practice. The parser traverses the ATN, predicting alternatives based on input lookahead and semantic predicates, ensuring decisions are made with sufficient tokens (typically 1-2 on average) to disambiguate paths.[19][6] Tree walking constitutes the final phase, where generated listeners or visitors recursively traverse the parse tree to perform semantic analysis, code generation, or other post-processing tasks. This separation enables multiple walks over the same tree for efficiency, avoiding re-parsing the input.[20] ANTLR incorporates robust error handling during lexing and parsing to maintain progress through faulty input. The default error strategy implements automatic recovery by attempting single-token insertion for missing elements or deletion for extraneous tokens, followed by synchronization via consumption until a valid follow-set token is encountered. Custom error listeners can be implemented to override reporting behaviors, such as suppressing duplicates or providing tailored diagnostics, while the parser notifies listeners of events like input mismatches or failed predicates.[21][22] For unambiguous grammars, the parsing process achieves O(n) time complexity, processing input linearly by consuming each token once, with empirical speeds reaching thousands of lines per second on large corpora like Java source code. This efficiency stems from the adaptive lookahead's ability to cache DFAs and throttle analysis to essential decisions, outperforming general parsing methods like GLR by orders of magnitude in practical scenarios.[19][6]ANTLR 4 Features
Key Improvements
ANTLR 4 introduced direct support for left-recursive rules, allowing grammars to define productions likeexpr : expr '+' term | term; without risking infinite loops during parsing. This advancement eliminates the need for manual refactoring into right-recursive forms, which was required in ANTLR 3 to avoid recursion issues. The parser achieves this through an adaptive LL(*) strategy that rewrites left-recursive rules into equivalent non-recursive forms using precedence parameters and semantic predicates, ensuring unambiguous resolution while preserving the original parse tree structure.[6]
A key enhancement in ambiguity resolution comes from ANTLR 4's ordered alternatives mechanism, where the parser prioritizes alternatives based on their order in the grammar file for cases where multiple rules could match the input. This predictable behavior simplifies grammar design for ambiguous languages, as the first matching alternative (with the lowest production number) is selected, akin to how PEG parsers operate. For instance, in expressions with operators of varying precedence, ordered alternatives ensure left-associativity without additional annotations, improving both usability and performance over ANTLR 3's less deterministic handling.[6]
ANTLR 4 promotes a cleaner separation between lexing and parsing phases compared to ANTLR 3, where developers often relied on ad-hoc lexer hacks like island grammars or embedded actions to handle context-sensitive tokens. The introduction of lexical modes enables the lexer to switch between distinct rule sets dynamically—for example, recognizing different tokens inside strings versus code—without contaminating parser logic. This modular design reduces complexity and errors in grammar maintenance, as tokenization occurs independently while still integrating seamlessly with the parser.[23]
Tooling in ANTLR 4 has been significantly upgraded with integrated IDE support and visualization capabilities, enhancing developer productivity beyond ANTLR 3's basic features. Plugins for environments like IntelliJ IDEA, Eclipse, and Visual Studio Code provide syntax highlighting, real-time error detection, code completion for rules and tokens, and live grammar interpretation. Grammar visualization tools, such as parse tree viewers and ATN (Augmented Transition Network) diagrams, allow users to inspect generated structures interactively. Additionally, runtime APIs like BaseErrorListener offer customizable error reporting, including detailed messages for ambiguities and recovery suggestions, facilitating robust application integration.[24]
Listener and Visitor Patterns
In ANTLR 4, the listener and visitor patterns provide mechanisms for traversing and processing parse trees generated by the parser, enabling users to perform actions such as semantic analysis, code generation, or tree transformations without modifying the core parsing logic. These patterns are implemented through generated interfaces and base classes tailored to the grammar rules, allowing for modular extension of parser functionality.[25][26] The listener pattern follows an event-driven approach, where aParseTreeWalker traverses the parse tree in a depth-first manner and invokes callback methods on a listener object at key points during the traversal. ANTLR generates a ParseTreeListener interface specific to the grammar (e.g., MyGrammarListener), which declares enterRule and exitRule methods for each rule in the grammar, along with methods for terminal nodes, error nodes, and every rule context. Users typically extend the generated abstract base class MyGrammarBaseListener, which provides no-op implementations, and override only the desired methods to perform side-effect operations, such as printing node information or updating global state. For instance, to process a rule named expression, one might override enterExpression to initialize a computation and exitExpression to finalize it based on child results. The traversal is initiated by calling ParseTreeWalker.DEFAULT.walk(listener, parseTree), ensuring automatic, non-recursive invocation of methods without user control over the order or depth. This pattern is particularly suited for tasks involving side effects, like logging or building data structures incrementally, as it decouples the processing logic from the tree structure itself.[27][25]
In contrast, the visitor pattern employs a recursive traversal model inspired by the classic design pattern, where the visitor object explicitly calls visit methods on each node and aggregates results from children. ANTLR generates a ParseTreeVisitor<T> interface (e.g., MyGrammarVisitor<T>), where T is the return type for computations (using Void if no value is needed), including a generic visit method and specialized visitRule methods for each grammar rule, plus handlers for terminals and errors. Users implement or extend the abstract MyGrammarBaseVisitor<T>, overriding visitRule methods to recurse into children via visitChildren and combine results—for example, in an expression evaluator, visitAddExpr might return the sum of visit(left) and visit(right). Unlike listeners, visitors allow returning values up the call stack, enabling functional-style computations, and provide control to skip subtrees by returning early or overriding visitChildren to customize aggregation. Traversal starts with a top-level call like visitor.visit(parseTree), making it ideal for tasks requiring computed outputs, such as evaluating expressions or translating to another representation.[26]
The primary differences between the patterns lie in their traversal control and result handling: listeners rely on the walker's fixed depth-first order for side effects without returns, promoting simplicity for event-like processing, while visitors offer flexibility for recursive aggregation and early termination, better for value-oriented operations like symbol table construction. Both support customization by selectively overriding methods in the base classes, minimizing boilerplate, and can be generated via command-line flags (-listener for listeners, -visitor for visitors, or both by default). For listeners, exception handling during traversal can be managed by overriding the walker's triggerExitRuleEvent to avoid halting the parse. These patterns integrate seamlessly with ANTLR's parse tree generation, allowing post-parse processing without altering the parser itself.[27][26]
Usage and Implementation
Basic Workflow
The basic workflow for using ANTLR involves defining a grammar, generating parser code, integrating it into an application, and testing the implementation. This process enables developers to create parsers for custom languages or data formats without manually coding the parsing logic. ANTLR automates the generation of lexical analyzers, parsers, and tree walkers from a declarative grammar specification, streamlining the development of language processors.[28] The first step is to write a grammar file with the.g4 extension, specifying lexer rules for tokenization and parser rules for syntax structure. For instance, a simple arithmetic expression grammar might define rules for numbers, operators, and expressions, but the focus remains on the overall structure rather than specific syntax details. Once the grammar is defined, the ANTLR tool is invoked to generate source code in the target language, such as Java or Python. The command-line interface (CLI) tool, antlr4, is run from the directory containing the .g4 file, for example: antlr4 MyGrammar.g4. This generates files including the lexer, parser, and base listener classes. Options like -Dlanguage=Python3 specify the target runtime, while -o outputdir directs the output location.[28][4]
After generation, the produced code must be compiled alongside the application's source files using the appropriate compiler for the target language. For Java, this involves javac on the generated .java files. The application then implements custom logic by extending the generated base listener or visitor classes to traverse the parse tree and perform actions, such as building an abstract syntax tree or evaluating expressions. Finally, input text is fed to the parser instance, which tokenizes it, builds the parse tree, and invokes the listener or visitor methods to process the output.[28]
For larger projects, ANTLR integrates with build tools like Maven and Gradle to automate grammar processing. In Maven, the antlr4-maven-plugin is configured in the pom.xml file under the build plugins section, specifying the grammar source directory (default: src/main/antlr4) and running during the generate-sources phase to produce code before compilation. Similarly, Gradle's built-in ANTLR plugin applies the antlr extension to tasks, processing grammars from src/antlr and generating sources integrated into the Java or other plugin tasks. These integrations ensure consistent code generation across builds.[29][30]
Testing forms a crucial part of the workflow, with ANTLR providing the TestRig tool (also known as grun in some distributions) for validating grammars. Invoked as grun MyGrammar startRule -gui, it parses input text, displays the parse tree visually, or outputs it in text form with -tree. Unit tests can validate specific inputs against expected parse results, helping identify issues early. For example, piping input like 10 + 20 to the tool confirms correct parsing of the start rule.[28]
Common pitfalls in this workflow include grammar ambiguities, where multiple parse paths exist for the same input, leading to nondeterministic behavior or performance issues. These are often resolved through refactoring the grammar structure, such as defining hierarchical rules for operator precedence (e.g., expr : addExpr; addExpr : multExpr (('+'|'-') multExpr)*;) or reordering rules to favor the intended interpretation during predictive parsing. Ensuring the grammar is LL(*) compatible by avoiding ambiguous lexer rules and leveraging ANTLR 4's support for left recursion prevents such problems during code generation.[28]
Code Generation Targets
ANTLR generates parser and lexer code in various programming languages, known as code generation targets, allowing developers to integrate the resulting parsers into applications written in those languages. The primary targets include Java, which serves as the default and most full-featured option with comprehensive support for features like XPath-based tree navigation; C#; JavaScript and TypeScript; Python; Go; and C++. These targets are selected using the-Dlanguage option during code generation with the ANTLR tool, for example, java -jar antlr-4.x-complete.jar -Dlanguage=Python3 MyGrammar.g4.[4][31]
Each target relies on a dedicated runtime library to handle core functionality such as token streams, parse trees, error reporting, and listener/visitor pattern implementations. For Java, the runtime is included in the org.antlr.v4.runtime package within the ANTLR complete JAR, providing classes like CommonTokenStream for token management and ParseTreeWalker for tree traversal. Similarly, C# uses the Antlr4.Runtime.Standard NuGet package for equivalent features; Python employs the antlr4-python3-runtime pip package with modules like InputStream and ParseTreeListener; JavaScript and TypeScript share the antlr4 npm package; Go utilizes the github.com/antlr/antlr4/runtime/Go/antlr module; and C++ draws from runtime sources built via tools like Conan. These libraries ensure consistent behavior across targets while adapting to language-specific idioms, such as async support in JavaScript.[32]
In addition to the primary targets, ANTLR provides support for Swift, PHP, and Dart. Developers should consult the official documentation for compatibility details, as some advanced features like ambiguous tree construction may not be fully available in these targets.[31]
Examples
Simple Grammar Example
A simple grammar in ANTLR defines rules for recognizing basic arithmetic expressions involving addition and subtraction. Consider the following grammar file, namedExpr.g4, which specifies a left-recursive parser rule for expressions and lexer rules for integers and whitespace:[1]
grammar Expr;
expr: expr ('+' | '-') expr
| INT
;
INT : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
grammar Expr;
expr: expr ('+' | '-') expr
| INT
;
INT : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
3+4 to be parsed unambiguously.[1]
Upon processing Expr.g4 with the ANTLR tool (version 4), it generates Java source files including ExprLexer.java for tokenization and ExprParser.java for parsing the token stream into a concrete syntax tree.[1]
To use the generated parser in a Java application, load input from a stream, create the lexer and parser instances, invoke the entry rule, and visualize the parse tree. The following snippet demonstrates this for the input "3 + 4" (assuming necessary imports and ANTLR runtime on the classpath):[1]
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class ExprTest {
public static void main([String](/page/String)[] args) throws Exception {
[String](/page/String) input = "3 + 4";
CharStream stream = CharStreams.fromString(input);
ExprLexer lexer = new ExprLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ExprParser parser = new ExprParser(tokens);
ParseTree tree = parser.expr();
System.out.println(tree.toStringTree(parser));
}
}
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class ExprTest {
public static void main([String](/page/String)[] args) throws Exception {
[String](/page/String) input = "3 + 4";
CharStream stream = CharStreams.fromString(input);
ExprLexer lexer = new ExprLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ExprParser parser = new ExprParser(tokens);
ParseTree tree = parser.expr();
System.out.println(tree.toStringTree(parser));
}
}
INT tokens for 3 and 4, plus a '+' token, while skipping whitespace. The resulting parse tree output is (expr (expr 3) + (expr 4)), illustrating the hierarchical structure: the top-level expr rule matches the addition, with each operand as a subexpression matching INT.[1]
Real-World Application
ANTLR has been employed in production environments to develop sophisticated parsers for query languages, exemplified by its use in Twitter's search infrastructure where it parses complex user queries to enable efficient retrieval from vast datasets.[1] In a detailed case study, Bytebase integrated ANTLR4 to build a SQL autocomplete framework supporting multiple dialects, starting with a grammar that defines core structures like SELECT and FROM clauses to handle real-world query variations.[33] A representative SQL-like query parser grammar, drawn from the official ANTLR grammars repository, illustrates this application. The grammar specifies rules for SELECT statements, including column selections and table sources, while addressing ambiguities such as keyword-identifier overlaps through mode switching in the lexer. For instance:query: selectStatement EOF ;
selectStatement
: SELECT selectItem (',' selectItem)* FROM tableSource
;
selectItem: qualifiedName ;
tableSource: qualifiedName ;
query: selectStatement EOF ;
selectStatement
: SELECT selectItem (',' selectItem)* FROM tableSource
;
selectItem: qualifiedName ;
tableSource: qualifiedName ;
public class QueryVisitor extends PostgreSQLParserBaseVisitor<Object> {
@Override
public Object visitSelectStatement(PostgreSQLParser.SelectStatementContext ctx) {
// Type check select items
for (var item : ctx.selectItem()) {
visitSelectItem(item); // Validate expression types
}
return visitChildren(ctx);
}
private void visitSelectItem(PostgreSQLParser.SelectItemContext ctx) {
// Implement [type inference](/page/Type_inference) and checking logic here
// e.g., resolve column types from [schema](/page/Schema)
}
}
public class QueryVisitor extends PostgreSQLParserBaseVisitor<Object> {
@Override
public Object visitSelectStatement(PostgreSQLParser.SelectStatementContext ctx) {
// Type check select items
for (var item : ctx.selectItem()) {
visitSelectItem(item); // Validate expression types
}
return visitChildren(ctx);
}
private void visitSelectItem(PostgreSQLParser.SelectItemContext ctx) {
// Implement [type inference](/page/Type_inference) and checking logic here
// e.g., resolve column types from [schema](/page/Schema)
}
}
syntaxError to provide user-friendly messages, such as "Expected column name after SELECT" with line and position details, improving usability in interactive environments.[33][36]
The benefits manifest in scalability, as seen in ANTLR's application to JSON validators—where grammars parse and validate nested structures efficiently—and domain-specific languages (DSLs), supporting high-throughput processing in tools like Twitter's query system without performance degradation for large inputs.[1]
Ecosystem
Supported Languages
ANTLR grammars, defined using the .g4 file format, are inherently language-agnostic, allowing the same grammar to generate parser code for various host programming languages without modification.[13] The ANTLR 4 runtime provides full support for over 10 target languages, enabling the execution of generated parsers in diverse environments; these include C++, C#, Dart, Java (version 8 and later), JavaScript, Go, PHP, Python 3, Swift, and TypeScript.[13] Experimental or third-party runtimes exist for additional languages such as Rust, which offers a community-maintained implementation via theantlr4rust crate but lacks official integration as of 2025.[37] All official runtimes maintain consistent versioning and API compatibility across targets to facilitate cross-language development.[13]
Runtime compatibility emphasizes robust handling of international text and concurrent operations tailored to each language's ecosystem. For Unicode support, ANTLR runtimes incorporate mechanisms like code-point buffering in Java and equivalent string handling in other targets, ensuring accurate parsing of UTF-8 and other encodings without locale-specific byte limitations. Threading models vary by language but generally require dedicated parser instances per thread for safety, as the core runtime structures like the DFA cache are shared and thread-safe, while mutable components such as lexers and parsers are not.[38][39]
To enhance development, ANTLR offers extensions in the form of IDE plugins for grammar editing and integration. Notable examples include the official ANTLR v4 plugin for IntelliJ IDEA and other JetBrains IDEs, which provides syntax highlighting, grammar visualization, and code generation, and the ANTLR Language Support extension for Visual Studio Code, supporting both ANTLR 3 and 4 grammars with features like error detection and preview.[24][40][41]
