Recent from talks
Contribute something
Nothing was collected or created yet.
Compiler
View on Wikipedia
| Program execution |
|---|
| General concepts |
| Types of code |
| Compilation strategies |
| Notable runtimes |
|
| Notable compilers & toolchains |
|
In computing, a compiler is software that translates computer code written in one programming language (the source language) into another language (the target language). The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language (e.g. assembly language, object code, or machine code) to create an executable program.[1][2]: p1 [3]
There are many different types of compilers which produce output in different useful forms. A cross-compiler produces code for a different CPU or operating system than the one on which the cross-compiler itself runs. A bootstrap compiler is often a temporary compiler, used for compiling a more permanent or better optimized compiler for a language.
Related software include decompilers, programs that translate from low-level languages to higher level ones; programs that translate between high-level languages, usually called source-to-source compilers or transpilers; language rewriters, usually programs that translate the form of expressions without a change of language; and compiler-compilers, compilers that produce compilers (or parts of them), often in a generic and reusable way so as to be able to produce many differing compilers.
A compiler is likely to perform some or all of the following operations, often called phases: preprocessing, lexical analysis, parsing, semantic analysis (syntax-directed translation), conversion of input programs to an intermediate representation, code optimization and machine specific code generation. Compilers generally implement these phases as modular components, promoting efficient design and correctness of transformations of source input to target output. Program faults caused by incorrect compiler behavior can be very difficult to track down and work around; therefore, compiler implementers invest significant effort to ensure compiler correctness.[4]
Comparison with interpreter
[edit]With respect to making source code runnable, an interpreter provides a similar function as a compiler, but via a different mechanism. An interpreter executes code without converting it to machine code.[2]: p2 Therefore, some interpreters execute source code while others execute an intermediate form such as bytecode.
Hence a program compiled to native code tends to run faster than when interpreted. Environments with a bytecode-intermediate-form tends toward intermediate-speed. While Just-in-time compilation allows for native execution speed with a one-time startup processing time cost.
For Low-level programming languages, such as assembly and C, it is typical that they are compiled, especially when speed is a significant concern, rather than being cross-platform supported. So that for such languages, there are more one-to-one correspondences between the source code and the resulting machine code, making it easier for programmers to control the use of hardware.
In theory; a programming language can be used via either a compiler or an interpreter, but in practice, each language tends to be used with only one or the other. Nonetheless, it is possible to write a compiler for a language that is commonly interpreted. For example, Common Lisp can be compiled to Java bytecode (and then interpreted by the Java virtual machine), as well as C code (then compiled to native machine code), or directly to native code.
History
[edit]
Theoretical computing concepts developed by scientists, mathematicians, and engineers formed the basis of digital modern computing development during World War II. Primitive binary languages evolved because digital devices only understand ones and zeros and the circuit patterns in the underlying machine architecture. In the late 1940s, assembly languages were created to offer a more workable abstraction of the computer architectures.[5] Limited memory capacity of early computers led to substantial technical challenges when the first compilers were designed. Therefore, the compilation process needed to be divided into several small programs. The front end programs produce the analysis products used by the back end programs to generate target code. As computer technology provided more resources, compiler designs could align better with the compilation process.
It is usually more productive for a programmer to use a high-level language, so the development of high-level languages followed naturally from the capabilities offered by digital computers. High-level languages are formal languages that are strictly defined by their syntax and semantics which form the high-level language architecture. Elements of these formal languages include:
- Alphabet, any finite set of symbols;
- String, a finite sequence of symbols;
- Language, any set of strings on an alphabet.
The sentences in a language may be defined by a set of rules called a grammar.[6]
Backus–Naur form (BNF) describes the syntax of "sentences" of a language. It was developed by John Backus and used for the syntax of Algol 60.[7] The ideas derive from the context-free grammar concepts by linguist Noam Chomsky.[8] "BNF and its extensions have become standard tools for describing the syntax of programming notations. In many cases, parts of compilers are generated automatically from a BNF description."[9]
Between 1942 and 1945, Konrad Zuse designed the first (algorithmic) programming language for computers called Plankalkül ("Plan Calculus"). Zuse also envisioned a Planfertigungsgerät ("Plan assembly device") to automatically translate the mathematical formulation of a program into machine-readable punched film stock.[10] While no actual implementation occurred until the 1970s, it presented concepts later seen in APL designed by Ken Iverson in the late 1950s.[11] APL is a language for mathematical computations.
Between 1949 and 1951, Heinz Rutishauser proposed Superplan, a high-level language and automatic translator.[12] His ideas were later refined by Friedrich L. Bauer and Klaus Samelson.[13]
High-level language design during the formative years of digital computing provided useful programming tools for a variety of applications:
- FORTRAN (Formula Translation) for engineering and science applications is considered to be one of the first actually implemented high-level languages and first optimizing compiler.[14][independent source needed]
- COBOL (Common Business-Oriented Language) evolved from A-0 and FLOW-MATIC to become the dominant high-level language for business applications.[15]
- LISP (List Processor) for symbolic computation.[16]
Compiler technology evolved from the need for a strictly defined transformation of the high-level source program into a low-level target program for the digital computer. The compiler could be viewed as a front end to deal with the analysis of the source code and a back end to synthesize the analysis into the target code. Optimization between the front end and back end could produce more efficient target code.[17]
Some early milestones in the development of compiler technology:
- May 1952: Grace Hopper's team at Remington Rand wrote the compiler for the A-0 programming language (and coined the term compiler to describe it),[18][19][20] although the A-0 compiler functioned more as a loader or linker than the modern notion of a full compiler.[21][22][23]
- 1952, before September: An Autocode compiler developed by Alick Glennie for the Manchester Mark I computer at the University of Manchester is considered by some to be the first compiled programming language.[24]
- 1954–1957: A team led by John Backus at IBM developed FORTRAN which is usually considered the first high-level language. In 1957, they completed a FORTRAN compiler that is generally credited as having introduced the first unambiguously complete compiler.[25]
- 1959: The Conference on Data Systems Language (CODASYL) initiated development of COBOL. The COBOL design drew on A-0 and FLOW-MATIC. By the early 1960s COBOL was compiled on multiple architectures.
- 1958–1960: Algol 58 was the precursor to ALGOL 60. It introduced code blocks, a key advance in the rise of structured programming. ALGOL 60 was the first language to implement nested function definitions with lexical scope. It included recursion. Its syntax was defined using BNF. ALGOL 60 inspired many languages that followed it. Tony Hoare remarked: "... it was not only an improvement on its predecessors but also on nearly all its successors."[26][27]
- 1958–1962: John McCarthy at MIT designed LISP.[28] The symbol processing capabilities provided useful features for artificial intelligence research. In 1962, LISP 1.5 release noted some tools: an interpreter written by Stephen Russell and Daniel J. Edwards, a compiler and assembler written by Tim Hart and Mike Levin.[29]
Early operating systems and software were written in assembly language. In the 1960s and early 1970s, the use of high-level languages for system programming was still controversial due to resource limitations. However, several research and industry efforts began the shift toward high-level systems programming languages, for example, BCPL, BLISS, B, and C.
BCPL (Basic Combined Programming Language) designed in 1966 by Martin Richards at the University of Cambridge was originally developed as a compiler writing tool.[30] Several compilers have been implemented, Richards' book provides insights to the language and its compiler.[31] BCPL was not only an influential systems programming language that is still used in research[32] but also provided a basis for the design of B and C languages.
BLISS (Basic Language for Implementation of System Software) was developed for a Digital Equipment Corporation (DEC) PDP-10 computer by W. A. Wulf's Carnegie Mellon University (CMU) research team. The CMU team went on to develop BLISS-11 compiler one year later in 1970.
Multics (Multiplexed Information and Computing Service), a time-sharing operating system project, involved MIT, Bell Labs, General Electric (later Honeywell) and was led by Fernando Corbató from MIT.[33] Multics was written in the PL/I language developed by IBM and IBM User Group.[34] IBM's goal was to satisfy business, scientific, and systems programming requirements. There were other languages that could have been considered but PL/I offered the most complete solution even though it had not been implemented.[35] For the first few years of the Multics project, a subset of the language could be compiled to assembly language with the Early PL/I (EPL) compiler by Doug McIlory and Bob Morris from Bell Labs.[36] EPL supported the project until a boot-strapping compiler for the full PL/I could be developed.[37]
Bell Labs left the Multics project in 1969, and developed a system programming language B based on BCPL concepts, written by Dennis Ritchie and Ken Thompson. Ritchie created a boot-strapping compiler for B and wrote Unics (Uniplexed Information and Computing Service) operating system for a PDP-7 in B. Unics eventually became spelled Unix.
Bell Labs started the development and expansion of C based on B and BCPL. The BCPL compiler had been transported to Multics by Bell Labs and BCPL was a preferred language at Bell Labs.[38] Initially, a front-end program to Bell Labs' B compiler was used while a C compiler was developed. In 1971, a new PDP-11 provided the resource to define extensions to B and rewrite the compiler. By 1973 the design of C language was essentially complete and the Unix kernel for a PDP-11 was rewritten in C. Steve Johnson started development of Portable C Compiler (PCC) to support retargeting of C compilers to new machines.[39][40]
Object-oriented programming (OOP) offered some interesting possibilities for application development and maintenance. OOP concepts go further back but were part of LISP and Simula language science.[41] Bell Labs became interested in OOP with the development of C++.[42] C++ was first used in 1980 for systems programming. The initial design leveraged C language systems programming capabilities with Simula concepts. Object-oriented facilities were added in 1983.[43] The Cfront program implemented a C++ front-end for C84 language compiler. In subsequent years several C++ compilers were developed as C++ popularity grew.
In many application domains, the idea of using a higher-level language quickly caught on. Because of the expanding functionality supported by newer programming languages and the increasing complexity of computer architectures, compilers became more complex.
DARPA (Defense Advanced Research Projects Agency) sponsored a compiler project with Wulf's CMU research team in 1970. The Production Quality Compiler-Compiler PQCC design would produce a Production Quality Compiler (PQC) from formal definitions of source language and the target.[44] PQCC tried to extend the term compiler-compiler beyond the traditional meaning as a parser generator (e.g., Yacc) without much success. PQCC might more properly be referred to as a compiler generator.
PQCC research into code generation process sought to build a truly automatic compiler-writing system. The effort discovered and designed the phase structure of the PQC. The BLISS-11 compiler provided the initial structure.[45] The phases included analyses (front end), intermediate translation to virtual machine (middle end), and translation to the target (back end). TCOL was developed for the PQCC research to handle language specific constructs in the intermediate representation.[46] Variations of TCOL supported various languages. The PQCC project investigated techniques of automated compiler construction. The design concepts proved useful in optimizing compilers and compilers for the (since 1995, object-oriented) programming language Ada.
The Ada STONEMAN document[a] formalized the program support environment (APSE) along with the kernel (KAPSE) and minimal (MAPSE). An Ada interpreter NYU/ED supported development and standardization efforts with the American National Standards Institute (ANSI) and the International Standards Organization (ISO). Initial Ada compiler development by the U.S. Military Services included the compilers in a complete integrated design environment along the lines of the STONEMAN document. Army and Navy worked on the Ada Language System (ALS) project targeted to DEC/VAX architecture while the Air Force started on the Ada Integrated Environment (AIE) targeted to IBM 370 series. While the projects did not provide the desired results, they did contribute to the overall effort on Ada development.[47]
Other Ada compiler efforts got underway in Britain at the University of York and in Germany at the University of Karlsruhe. In the U. S., Verdix (later acquired by Rational) delivered the Verdix Ada Development System (VADS) to the Army. VADS provided a set of development tools including a compiler. Unix/VADS could be hosted on a variety of Unix platforms such as DEC Ultrix and the Sun 3/60 Solaris targeted to Motorola 68020 in an Army CECOM evaluation.[48] There were soon many Ada compilers available that passed the Ada Validation tests. The Free Software Foundation GNU project developed the GNU Compiler Collection (GCC) which provides a core capability to support multiple languages and targets. The Ada version GNAT is one of the most widely used Ada compilers. GNAT is free but there is also commercial support, for example, AdaCore, was founded in 1994 to provide commercial software solutions for Ada. GNAT Pro includes the GNU GCC based GNAT with a tool suite to provide an integrated development environment.
High-level languages continued to drive compiler research and development. Focus areas included optimization and automatic code generation. Trends in programming languages and development environments influenced compiler technology. More compilers became included in language distributions (PERL, Java Development Kit) and as a component of an IDE (VADS, Eclipse, Ada Pro). The interrelationship and interdependence of technologies grew. The advent of web services promoted growth of web languages and scripting languages. Scripts trace back to the early days of Command Line Interfaces (CLI) where the user could enter commands to be executed by the system. User Shell concepts developed with languages to write shell programs. Early Windows designs offered a simple batch programming capability. The conventional transformation of these language used an interpreter. While not widely used, Bash and Batch compilers have been written. More recently sophisticated interpreted languages became part of the developers tool kit. Modern scripting languages include PHP, Python, Ruby and Lua. (Lua is widely used in game development.) All of these have interpreter and compiler support.[49]
"When the field of compiling began in the late 50s, its focus was limited to the translation of high-level language programs into machine code ... The compiler field is increasingly intertwined with other disciplines including computer architecture, programming languages, formal methods, software engineering, and computer security."[50] The "Compiler Research: The Next 50 Years" article noted the importance of object-oriented languages and Java. Security and parallel computing were cited among the future research targets.
Compiler construction
[edit]This section includes a list of general references, but it lacks sufficient corresponding inline citations. (December 2019) |
A compiler implements a formal transformation from a high-level source program to a low-level target program. Compiler design can define an end-to-end solution or tackle a defined subset that interfaces with other compilation tools e.g. preprocessors, assemblers, linkers. Design requirements include rigorously defined interfaces both internally between compiler components and externally between supporting toolsets.
In the early days, the approach taken to compiler design was directly affected by the complexity of the computer language to be processed, the experience of the person(s) designing it, and the resources available. Resource limitations led to the need to pass through the source code more than once.
A compiler for a relatively simple language written by one person might be a single, monolithic piece of software. However, as the source language grows in complexity the design may be split into a number of interdependent phases. Separate phases provide design improvements that focus development on the functions in the compilation process.
One-pass vis-à-vis multi-pass compilers
[edit]Classifying compilers by number of passes has its background in the hardware resource limitations of computers. Compiling involves performing much work and early computers did not have enough memory to contain one program that did all of this work. As a result, compilers were split up into smaller programs which each made a pass over the source (or some representation of it) performing some of the required analysis and translations.
The ability to compile in a single pass has classically been seen as a benefit because it simplifies the job of writing a compiler and one-pass compilers generally perform compilations faster than multi-pass compilers. Thus, partly driven by the resource limitations of early systems, many early languages were specifically designed so that they could be compiled in a single pass (e.g., Pascal).
In some cases, the design of a language feature may require a compiler to perform more than one pass over the source. For instance, consider a declaration appearing on line 20 of the source which affects the translation of a statement appearing on line 10. In this case, the first pass needs to gather information about declarations appearing after statements that they affect, with the actual translation happening during a subsequent pass.
The disadvantage of compiling in a single pass is that it is not possible to perform many of the sophisticated optimizations needed to generate high quality code. It can be difficult to count exactly how many passes an optimizing compiler makes. For instance, different phases of optimization may analyse one expression many times but only analyse another expression once.
Splitting a compiler up into small programs is a technique used by researchers interested in producing provably correct compilers. Proving the correctness of a set of small programs often requires less effort than proving the correctness of a larger, single, equivalent program.
Three-stage compiler structure
[edit]
Regardless of the exact number of phases in the compiler design, the phases can be assigned to one of three stages. The stages include a front end, a middle end, and a back end.
- The front end scans the input and verifies syntax and semantics according to a specific source language. For statically typed languages it performs type checking by collecting type information. If the input program is syntactically incorrect or has a type error, it generates error and/or warning messages, usually identifying the location in the source code where the problem was detected; in some cases the actual error may be (much) earlier in the program. Aspects of the front end include lexical analysis, syntax analysis, and semantic analysis. The front end transforms the input program into an intermediate representation (IR) for further processing by the middle end. This IR is usually a lower-level representation of the program with respect to the source code.
- The middle end performs optimizations on the IR that are independent of the CPU architecture being targeted. This source code/machine code independence is intended to enable generic optimizations to be shared between versions of the compiler supporting different languages and target processors. Examples of middle end optimizations are removal of useless (dead-code elimination) or unreachable code (reachability analysis), discovery and propagation of constant values (constant propagation), relocation of computation to a less frequently executed place (e.g., out of a loop), or specialization of computation based on the context, eventually producing the "optimized" IR that is used by the back end.
- The back end takes the optimized IR from the middle end. It may perform more analysis, transformations and optimizations that are specific for the target CPU architecture. The back end generates the target-dependent assembly code, performing register allocation in the process. The back end performs instruction scheduling, which re-orders instructions to keep parallel execution units busy by filling delay slots. Although most optimization problems are NP-hard, heuristic techniques for solving them are well-developed and implemented in production-quality compilers. Typically the output of a back end is machine code specialized for a particular processor and operating system.
This front/middle/back-end approach makes it possible to combine front ends for different languages with back ends for different CPUs while sharing the optimizations of the middle end.[51] Practical examples of this approach are the GNU Compiler Collection, Clang (LLVM-based C/C++ compiler),[52] and the Amsterdam Compiler Kit, which have multiple front-ends, shared optimizations and multiple back-ends.
Front end
[edit]
if(net>0.0)total+=net*(1.0+tax/100.0);", the scanner composes a sequence of tokens, and categorizes each of them, for example as identifier, reserved word, number literal, or operator. The latter sequence is transformed by the parser into a syntax tree, which is then treated by the remaining compiler phases. The scanner and parser handles the regular and properly context-free parts of the grammar for C, respectively.The front end analyzes the source code to build an internal representation of the program, called the intermediate representation (IR). It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope.
While the frontend can be a single monolithic function or program, as in a scannerless parser, it was traditionally implemented and analyzed as several phases, which may execute sequentially or concurrently. This method is favored due to its modularity and separation of concerns. Most commonly, the frontend is broken into three phases: lexical analysis (also known as lexing or scanning), syntax analysis (also known as scanning or parsing), and semantic analysis. Lexing and parsing comprise the syntactic analysis (word syntax and phrase syntax, respectively), and in simple cases, these modules (the lexer and parser) can be automatically generated from a grammar for the language, though in more complex cases these require manual modification. The lexical grammar and phrase grammar are usually context-free grammars, which simplifies analysis significantly, with context-sensitivity handled at the semantic analysis phase. The semantic analysis phase is generally more complex and written by hand, but can be partially or fully automated using attribute grammars. These phases themselves can be further broken down: lexing as scanning and evaluating, and parsing as building a concrete syntax tree (CST, parse tree) and then transforming it into an abstract syntax tree (AST, syntax tree). In some cases additional phases are used, notably line reconstruction and preprocessing, but these are rare.
The main phases of the front end include the following:
- Line reconstruction converts the input character sequence to a canonical form ready for the parser. Languages which strop their keywords or allow arbitrary spaces within identifiers require this phase. The top-down, recursive-descent, table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. Atlas Autocode and Imp (and some implementations of ALGOL and Coral 66) are examples of stropped languages whose compilers would have a Line Reconstruction phase.
- Preprocessing supports macro substitution and conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as Scheme support macro substitutions based on syntactic forms.
- Lexical analysis (also known as lexing or tokenization) breaks the source code text into a sequence of small pieces called lexical tokens.[53] This phase can be divided into two stages: the scanning, which segments the input text into syntactic units called lexemes and assigns them a category; and the evaluating, which converts lexemes into a processed value. A token is a pair consisting of a token name and an optional token value.[54] Common token categories may include identifiers, keywords, separators, operators, literals and comments, although the set of token categories varies in different programming languages. The lexeme syntax is typically a regular language, so a finite-state automaton constructed from a regular expression can be used to recognize it. The software doing lexical analysis is called a lexical analyzer. This may not be a separate step—it can be combined with the parsing step in scannerless parsing, in which case parsing is done at the character level, not the token level.
- Syntax analysis (also known as parsing) involves parsing the token sequence to identify the syntactic structure of the program. This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language's syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.[55]
- Semantic analysis adds semantic information to the parse tree and builds the symbol table. This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsing phase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.
Middle end
[edit]The middle end, also known as optimizer, performs optimizations on the intermediate representation in order to improve the performance and the quality of the produced machine code.[56] The middle end contains those optimizations that are independent of the CPU architecture being targeted.
The main phases of the middle end include the following:
- Analysis: This is the gathering of program information from the intermediate representation derived from the input; data-flow analysis is used to build use-define chains, together with dependence analysis, alias analysis, pointer analysis, escape analysis, etc. Accurate analysis is the basis for any compiler optimization. The control-flow graph of every compiled function and the call graph of the program are usually also built during the analysis phase.
- Optimization: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Popular optimizations are inline expansion, dead-code elimination, constant propagation, loop transformation and even automatic parallelization.
Compiler analysis is the prerequisite for any compiler optimization, and they tightly work together. For example, dependence analysis is crucial for loop transformation.
The scope of compiler analysis and optimizations vary greatly; their scope may range from operating within a basic block, to whole procedures, or even the whole program. There is a trade-off between the granularity of the optimizations and the cost of compilation. For example, peephole optimizations are fast to perform during compilation but only affect a small local fragment of the code, and can be performed independently of the context in which the code fragment appears. In contrast, interprocedural optimization requires more compilation time and memory space, but enable optimizations that are only possible by considering the behavior of multiple functions simultaneously.
Interprocedural analysis and optimizations are common in modern commercial compilers from HP, IBM, SGI, Intel, Microsoft, and Sun Microsystems. The free software GCC was criticized for a long time for lacking powerful interprocedural optimizations, but it is changing in this respect. Another open source compiler with full analysis and optimization infrastructure is Open64, which is used by many organizations for research and commercial purposes.
Due to the extra time and space needed for compiler analysis and optimizations, some compilers skip them by default. Users have to use compilation options to explicitly tell the compiler which optimizations should be enabled.
Back end
[edit]The back end is responsible for the CPU architecture specific optimizations and for code generation.[56]
The main phases of the back end include the following:
- Machine dependent optimizations: optimizations that depend on the details of the CPU architecture that the compiler targets.[57] A prominent example is peephole optimizations, which rewrites short sequences of assembler instructions into more efficient instructions.
- Code generation: the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes (see also Sethi–Ullman algorithm). Debug data may also need to be generated to facilitate debugging.
Compiler correctness
[edit]Compiler correctness is the branch of software engineering that deals with trying to show that a compiler behaves according to its language specification.[58] Techniques include developing the compiler using formal methods and using rigorous testing (often called compiler validation) on an existing compiler.
Compiled vis-à-vis interpreted languages
[edit]Higher-level programming languages usually appear with a type of translation in mind: either designed as compiled language or interpreted language. However, in practice there is rarely anything about a language that requires it to be exclusively compiled or exclusively interpreted, although it is possible to design languages that rely on re-interpretation at run time. The categorization usually reflects the most popular or widespread implementations of a language – for instance, BASIC is sometimes called an interpreted language, and C a compiled one, despite the existence of BASIC compilers and C interpreters.[59]
Interpretation does not replace compilation completely. It only hides it from the user and makes it gradual. Even though an interpreter can itself be interpreted, a set of directly executed machine instructions is needed somewhere at the bottom of the execution stack (see machine language).
Furthermore, for optimization compilers can contain interpreter functionality, and interpreters may include ahead of time compilation techniques. For example, where an expression can be executed during compilation and the results inserted into the output program, then it prevents it having to be recalculated each time the program runs, which can greatly speed up the final program. Modern trends toward just-in-time compilation and bytecode interpretation at times blur the traditional categorizations of compilers and interpreters even further. Meta-tracing is an automated compiler synthesis approach which takes this further and can be used to synthesize a compiler from a language interpreter.
Some language specifications spell out that implementations must include a compilation facility; for example, Common Lisp. However, there is nothing inherent in the definition of Common Lisp that stops it from being interpreted. Other languages have features that are very easy to implement in an interpreter, but make writing a compiler much harder; for example, APL, SNOBOL4,[60] and many scripting languages allow programs to construct arbitrary source code at runtime with regular string operations, and then execute that code by passing it to a special evaluation function. To implement these features in a compiled language, programs must usually be shipped with a runtime library that includes a version of the compiler itself.
Types
[edit]One classification of compilers is by the platform on which their generated code executes. This is known as the target platform.
A native or hosted compiler is one whose output is intended to directly run on the same type of computer and operating system that the compiler itself runs on. The output of a cross compiler is designed to run on a different platform. Cross compilers are often used when developing software for embedded systems that are not intended to support a software development environment.
The output of a compiler that produces code for a virtual machine (VM) may or may not be executed on the same platform as the compiler that produced it. For this reason, such compilers are not usually classified as native or cross compilers.
The lower level language that is the target of a compiler may itself be a high-level programming language. C, viewed by some as a sort of portable assembly language, is frequently the target language of such compilers. For example, Cfront, the original compiler for C++, used C as its target language. The C code generated by such a compiler is usually not intended to be readable and maintained by humans, so indent style and creating pretty C intermediate code are ignored. Some of the features of C that make it a good target language include the #line directive, which can be generated by the compiler to support debugging of the original source, and the wide platform support available with C compilers.
While a common compiler type outputs machine code, there are many other types:
- Source-to-source compilers are a type of compiler that takes a high-level language as its input and outputs a high-level language. For example, an automatic parallelizing compiler will frequently take in a high-level language program as an input and then transform the code and annotate it with parallel code annotations (e.g. OpenMP) or language constructs (e.g. Fortran's
DOALLstatements). Other terms for a source-to-source compiler are transcompiler or transpiler.[61] - Bytecode compilers compile to assembly language of a theoretical machine, like some Prolog implementations
- This Prolog machine is also known as the Warren Abstract Machine (or WAM).
- Bytecode compilers for Java, Python are also examples of this category.
- Just-in-time compilers (JIT compiler) defer compilation until runtime. JIT compilers exist for many modern languages including Python, JavaScript, Smalltalk, Java, Microsoft .NET's Common Intermediate Language (CIL) and others. A JIT compiler generally runs inside an interpreter. When the interpreter detects that a code path is "hot", meaning it is executed frequently, the JIT compiler will be invoked and compile the "hot" code for increased performance.
- For some languages, such as Java, applications are first compiled using a bytecode compiler and delivered in a machine-independent intermediate representation. A bytecode interpreter executes the bytecode, but the JIT compiler will translate the bytecode to machine code when increased performance is necessary.[62][non-primary source needed]
- Hardware compilers (also known as synthesis tools) are compilers whose input is a hardware description language and whose output is a description, in the form of a netlist or otherwise, of a hardware configuration.
- The output of these compilers target computer hardware at a very low level, for example a field-programmable gate array (FPGA) or structured application-specific integrated circuit (ASIC).[63][non-primary source needed] Such compilers are said to be hardware compilers, because the source code they compile effectively controls the final configuration of the hardware and how it operates. The output of the compilation is only an interconnection of transistors or lookup tables.
- An example of hardware compiler is XST, the Xilinx Synthesis Tool used for configuring FPGAs.[64][non-primary source needed] Similar tools are available from Altera,[65][non-primary source needed] Synplicity, Synopsys and other hardware vendors.[citation needed]
- Research systems compile subsets of high level serial languages, such as Python or C++, directly into parallelized digital logic. This is typically easier to do for functional languages or functional subsets of multi-paradigm languages.[66]
- A program that translates from a low-level language to a higher level one is a decompiler.[67]
- A program that translates into an object code format that is not supported on the compilation machine is called a cross compiler and is commonly used to prepare code for execution on embedded software applications.[68][better source needed]
- A program that rewrites object code back into the same type of object code while applying optimisations and transformations is a binary recompiler.
Assemblers, which translate human readable assembly language to the machine code instructions executed by hardware, are not considered compilers.[69][b] (The inverse program that translates machine code to assembly language is called a disassembler.)
See also
[edit]Notes and references
[edit]- ^ United States Department of Defense (18 February 1980) Stoneman requirements
- ^ "The many source-language features described in the preceding section result in a number of salient differences between compilers and assemblers. On any one item the distinction may not be clear-cut. Moreover, it may be difficult to distinguish a simple compiler from a powerful macro assembler. Nevertheless, the differences are usually substantial enough that there remains a qualitative distinction between assemblers and compilers."
- ^ "Encyclopedia: Definition of Compiler". PCMag.com. Retrieved 2 July 2022.
- ^ a b Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman - Second Edition, 2007
- ^ Sudarsanam, Ashok; Malik, Sharad; Fujita, Masahiro (2002). "A Retargetable Compilation Methodology for Embedded Digital Signal Processors Using a Machine-Dependent Code Optimization Library". Readings in Hardware/Software Co-Design. Elsevier. pp. 506–515. doi:10.1016/b978-155860702-6/50045-4. ISBN 9781558607026.
A compiler is a computer program that translates a program written in a high-level language (HLL), such as C, into an equivalent assembly language program [2].
- ^ Sun, Chengnian; Le, Vu; Zhang, Qirun; Su, Zhendong (2016). "Toward understanding compiler bugs in GCC and LLVM". Proceedings of the 25th International Symposium on Software Testing and Analysis. ISSTA 2016. ACM. pp. 294–305. doi:10.1145/2931037.2931074. ISBN 9781450343909. S2CID 8339241.
- ^ Baghai, Christian (4 April 2023). "The Evolution of Programming Languages: From Primitive Binary to High-Level Abstractions". Medium. Retrieved 10 July 2024.
- ^ Lecture notes. Compilers: Principles, Techniques, and Tools. Jing-Shin Chang. Department of Computer Science & Information Engineering. National Chi-Nan University
- ^ Naur, P. et al. "Report on ALGOL 60". Communications of the ACM 3 (May 1960), 299–314.
- ^ Chomsky, Noam; Lightfoot, David W. (2002). Syntactic Structures. Walter de Gruyter. ISBN 978-3-11-017279-9.
- ^ Gries, David (2012). "Appendix 1: Backus-Naur Form". The Science of Programming. Springer Science & Business Media. p. 304. ISBN 978-1461259831.
- ^ Hellige, Hans Dieter, ed. (2004) [November 2002]. Written at Bremen, Germany. Geschichten der Informatik - Visionen, Paradigmen, Leitmotive (in German) (1 ed.). Berlin / Heidelberg, Germany: Springer-Verlag. pp. 45, 104, 105. doi:10.1007/978-3-642-18631-8. ISBN 978-3-540-00217-8. ISBN 3-540-00217-0. (xii+514 pages)
- ^ Iverson, Kenneth E. (1962). A Programming Language. John Wiley & Sons. ISBN 978-0-471430-14-8.
{{cite book}}: ISBN / Date incompatibility (help) - ^ Rutishauser, Heinz (1951). "Über automatische Rechenplanfertigung bei programmgesteuerten Rechenanlagen". Zeitschrift für Angewandte Mathematik und Mechanik (in German). 31: 255. doi:10.1002/zamm.19510310820.
- ^ Fothe, Michael; Wilke, Thomas, eds. (2015) [2014-11-14]. Written at Jena, Germany. Keller, Stack und automatisches Gedächtnis – eine Struktur mit Potenzial [Cellar, stack and automatic memory - a structure with potential] (PDF) (Tagungsband zum Kolloquium 14. November 2014 in Jena). GI Series: Lecture Notes in Informatics (LNI) – Thematics (in German). Vol. T-7. Bonn, Germany: Gesellschaft für Informatik (GI) / Köllen Druck + Verlag GmbH. pp. 20–21. ISBN 978-3-88579-426-4. ISSN 1614-3213. Archived (PDF) from the original on 12 April 2020. Retrieved 12 April 2020. [1] (77 pages)
- ^ Backus, John. "The history of FORTRAN I, II and III" (PDF). History of Programming Languages. Archived (PDF) from the original on 10 October 2022.
{{cite book}}:|website=ignored (help) - ^ Porter Adams, Vicki (5 October 1981). "Captain Grace M. Hopper: the Mother of COBOL". InfoWorld. 3 (20): 33. ISSN 0199-6649.
- ^ McCarthy, J.; Brayton, R.; Edwards, D.; Fox, P.; Hodes, L.; Luckham, D.; Maling, K.; Park, D.; Russell, S. (March 1960). "LISP I Programmers Manual" (PDF). Boston, Massachusetts: Artificial Intelligence Group, M.I.T. Computation Center and Research Laboratory.
- ^ Compilers Principles, Techniques, & Tools 2nd edition by Aho, Lam, Sethi, Ullman ISBN 0-321-48681-1
- ^ Hopper, Grace Murray (1952). "The education of a computer". Proceedings of the 1952 ACM national meeting (Pittsburgh) on - ACM '52. pp. 243–249. doi:10.1145/609784.609818. S2CID 10081016.
- ^ Ridgway, Richard K. (1952). "Compiling routines". Proceedings of the 1952 ACM national meeting (Toronto) on - ACM '52. pp. 1–5. doi:10.1145/800259.808980. S2CID 14878552.
- ^ "List of early compilers and assemblers".
- ^ Hopper, Grace. "Keynote Address". Proceedings of the ACM SIGPLAN History of Programming Languages (HOPL) conference, June 1978. doi:10.1145/800025.1198341.
- ^ Bruderer, Herbert (21 December 2022). "Did Grace Hopper Create the First Compiler?".
- ^ Strawn, George; Strawn, Candace (2015). "Grace Hopper: Compilers and Cobol". IT Professional. 17 (Jan.-Feb. 2015): 62–64. doi:10.1109/MITP.2015.6.
- ^ Knuth, Donald E.; Pardo, Luis Trabb, "Early development of programming languages", Encyclopedia of Computer Science and Technology (Marcel Dekker) 7: 419–493
- ^ Backus, John (1 June 1978), "The history of Fortran I, II, and III", History of programming languages, New York, NY, USA: Association for Computing Machinery, pp. 25–74, doi:10.1145/800025.1198345, ISBN 978-0-12-745040-7, retrieved 9 October 2024
- ^ Hoare, C.A.R. (December 1973). "Hints on Programming Language Design" (PDF). p. 27. Archived (PDF) from the original on 10 October 2022. (This statement is sometimes erroneously attributed to Edsger W. Dijkstra, also involved in implementing the first ALGOL 60 compiler.)
- ^ Abelson, Hal; Dybvig, R. K.; et al. Rees, Jonathan; Clinger, William (eds.). "Revised(3) Report on the Algorithmic Language Scheme, (Dedicated to the Memory of ALGOL 60)". Retrieved 20 October 2009.
- ^ "Recursive Functions of Symbolic Expressions and Their Computation by Machine", Communications of the ACM, April 1960
- ^ McCarthy, John; Abrahams, Paul W.; Edwards, Daniel J.; Hart, Timothy P.; Levin, Michael I. (1965). Lisp 1.5 Programmers Manual. The MIT Press. ISBN 978-0-26213011-0.
- ^ "BCPL: A tool for compiler writing and system programming" M. Richards, University Mathematical Laboratory Cambridge, England 1969
- ^ BCPL: The Language and Its Compiler, M Richards, Cambridge University Press (first published 31 December 1981)
- ^ The BCPL Cintsys and Cintpos User Guide, M. Richards, 2017
- ^ Corbató, F. J.; Vyssotsky, V. A. "Introduction and Overview of the MULTICS System". 1965 Fall Joint Computer Conference. Multicians.org.
- ^ Report II of the SHARE Advanced Language Development Committee, 25 June 1964
- ^ Multicians.org "The Choice of PL/I" article, Editor /tom Van Vleck
- ^ "PL/I As a Tool for System Programming", F.J. Corbato, Datamation 6 May 1969 issue
- ^ "The Multics PL/1 Compiler", R. A. Freiburghouse, GE, Fall Joint Computer Conference 1969
- ^ Dennis M. Ritchie, "The Development of the C Language", ACM Second History of Programming Languages Conference, April 1993
- ^ S.C. Johnson, "a Portable C Compiler: Theory and Practice", 5th ACM POPL Symposium, January 1978
- ^ A. Snyder, A Portable Compiler for the Language C, MIT, 1974.
- ^ K. Nygaard, University of Oslo, Norway, "Basic Concepts in Object Oriented Programming", SIGPLAN Notices V21, 1986
- ^ B. Stroustrup: "What is Object-Oriented Programming?" Proceedings 14th ASU Conference, 1986.
- ^ Bjarne Stroustrup, "An Overview of the C++ Programming Language", Handbook of Object Technology (Editor: Saba Zamir, ISBN 0-8493-3135-8)
- ^ Leverett, Cattell, Hobbs, Newcomer, Reiner, Schatz, Wulf: "An Overview of the Production Quality Compiler-Compiler Project", CMU-CS-89-105, 1979
- ^ W. Wulf, K. Nori, "Delayed binding in PQCC generated compilers", CMU Research Showcase Report, CMU-CS-82-138, 1982
- ^ Joseph M. Newcomer, David Alex Lamb, Bruce W. Leverett, Michael Tighe, William A. Wulf - Carnegie-Mellon University and David Levine, Andrew H. Reinerit - Intermetrics: "TCOL Ada: Revised Report on An Intermediate Representation for the DOD Standard Programming Language", 1979
- ^ William A. Whitaker, "Ada - the project: the DoD High Order Working Group", ACM SIGPLAN Notices (Volume 28, No. 3, March 1991)
- ^ CECOM Center for Software Engineering Advanced Software Technology, "Final Report - Evaluation of the ACEC Benchmark Suite for Real-Time Applications", AD-A231 968, 1990
- ^ P.Biggar, E. de Vries, D. Gregg, "A Practical Solution for Scripting Language Compilers", submission to Science of Computer Programming, 2009
- ^ M.Hall, D. Padua, K. Pingali, "Compiler Research: The Next 50 Years", ACM Communications 2009 Vol 54 #2
- ^ Cooper and Torczon 2012, p. 8
- ^ Lattner, Chris (2017). "LLVM". In Brown, Amy; Wilson, Greg (eds.). The Architecture of Open Source Applications. Archived from the original on 2 December 2016. Retrieved 28 February 2017.
- ^ Aho, Lam, Sethi, Ullman 2007, p. 5-6, 109-189
- ^ Aho, Lam, Sethi, Ullman 2007, p. 111
- ^ Aho, Lam, Sethi, Ullman 2007, p. 8, 191-300
- ^ a b Blindell, Gabriel Hjort (3 June 2016). Instruction selection: Principles, methods, and applications. Switzerland: Springer. ISBN 978-3-31934019-7. OCLC 951745657.
- ^ Cooper and Toczon (2012), p. 540
- ^ "S1-A Simple Compiler", Compiler Construction Using Java, JavaCC, and Yacc, Hoboken, NJ, US: John Wiley & Sons, Inc., pp. 289–329, 28 February 2012, doi:10.1002/9781118112762.ch12, ISBN 978-1-118-11276-2, retrieved 17 May 2023
- ^ "Compiler vs. Interpreter in Programming". Built In. Retrieved 25 May 2025.
- ^ "SNOBOL4 Programming Language". SourceForge. 27 October 2023. Retrieved 25 May 2025.
- ^ Ilyushin, Evgeniy; Namiot, Dmitry (2016). "On source-to-source compilers". International Journal of Open Information Technologies. 4 (5): 48–51. Archived from the original on 13 September 2022. Retrieved 14 September 2022.
- ^ Aycock, John (2003). "A Brief History of Just-in-Time". ACM Comput. Surv. 35 (2): 93–113. doi:10.1145/857076.857077. S2CID 15345671.
- ^ Swartz, Jordan S.; Betz, Vaugh; Rose, Jonathan (22–25 February 1998). "A fast routability-driven router for FPGAs" (PDF). Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays - FPGA '98. Monterey, CA: ACM. pp. 140–149. doi:10.1145/275107.275134. ISBN 978-0897919784. S2CID 7128364. Archived (PDF) from the original on 9 August 2017.
- ^ Xilinx Staff (2009). "XST Synthesis Overview". Xilinx, Inc. Archived from the original on 2 November 2016. Retrieved 28 February 2017.
- ^ Altera Staff (2017). "Spectra-Q™ Engine". Altera.com. Archived from the original on 10 October 2016. Retrieved 28 February 2017.
- ^ Jurkans, K; Fox, C (2023). Python Subset to Digital Logic Dataflow Compiler for Robots and IoT. IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom-2023).
- ^ Metula, Erez (2011). "Tools of the Trade". Managed Code Rootkits. pp. 39–62. doi:10.1016/B978-1-59749-574-5.00003-9. ISBN 978-1-59749-574-5.
- ^ Chandrasekaran, Siddharth (26 January 2018). "Cross Compilation Demystified". embedjournal.com. Retrieved 5 March 2023.
- ^ Calingaert and Horowitz 1979, pp. 186-187
Further reading
[edit]- Aho, Alfred V.; Sethi, Ravi; Ullman, Jeffrey D. (1986). Compilers: Principles, Techniques, and Tools (1st ed.). Addison-Wesley. ISBN 9780201100884.
- Allen, Frances E. (September 1981). "A History of Language Processor Technology in IBM". IBM Journal of Research and Development. 25 (5). IBM: 535–548. doi:10.1147/rd.255.0535.
- Allen, Randy; Kennedy, Ken (2001). Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers. ISBN 978-1-55860-286-1.
- Appel, Andrew Wilson (2002). Modern Compiler Implementation in Java (2nd ed.). Cambridge University Press. ISBN 978-0-521-82060-8.
- Appel, Andrew Wilson (1998). Modern Compiler Implementation in ML. Cambridge University Press. ISBN 978-0-521-58274-2.
- Bornat, Richard (1979). Understanding and Writing Compilers: A Do It Yourself Guide (PDF). Macmillan Publishing. ISBN 978-0-333-21732-0. Archived from the original (PDF) on 15 June 2007. Retrieved 11 April 2007.
- Calingaert, Peter (1979). Horowitz, Ellis (ed.). Assemblers, Compilers, and Program Translation. Computer software engineering series (1st printing, 1st ed.). Potomac, Maryland: Computer Science Press, Inc. ISBN 0-914894-23-4. ISSN 0888-2088. LCCN 78-21905. Retrieved 20 March 2020. (2+xiv+270+6 pages)
- Cooper, Keith Daniel; Torczon, Linda (2012). Engineering a compiler (2nd ed.). Amsterdam, Netherlands: Elsevier/Morgan Kaufmann. p. 8. ISBN 978-0-12088478-0. OCLC 714113472.
- Gries, David (1971). Compiler Construction for Digital Computers (in English, Spanish, Japanese, Chinese, Italian, and Russian). New York: John Wiley and Sons. ISBN 0-471-32776-X.
The first text on compiler construction.
- McKeeman, William Marshall; Horning, James J.; Wortman, David B. (1970). A Compiler Generator. Englewood Cliffs, NJ: Prentice-Hall. ISBN 978-0-13-155077-3.
- Muchnick, Steven (1997). Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers. ISBN 978-1-55860-320-2.
- Scott, Michael Lee (2005). Programming Language Pragmatics (2nd ed.). Morgan Kaufmann. ISBN 978-0-12-633951-2.
- Srikant, Y. N.; Shankar, Priti (2003). The Compiler Design Handbook: Optimizations and Machine Code Generation. CRC Press. ISBN 978-0-8493-1240-3.
- Terry, Patrick D. (1997). Compilers and Compiler Generators: An Introduction with C++. International Thomson Computer Press. ISBN 978-1-85032-298-6.
- Wirth, Niklaus (1996). Compiler Construction (PDF). Addison-Wesley. ISBN 978-0-201-40353-4. Archived from the original (PDF) on 17 February 2017. Retrieved 24 April 2012.
- LLVM community. "The LLVM Target-Independent Code Generator". LLVM Documentation. Retrieved 17 June 2016.
- Compiler textbook references A collection of references to mainstream Compiler Construction Textbooks
External links
[edit]- Incremental Approach to Compiler Construction – a PDF tutorial
- Basics of Compiler Design at the Wayback Machine (archived 15 May 2018)
- Short animation on YouTube explaining the key conceptual difference between compilers and interpreters
- Syntax Analysis & LL1 Parsing on YouTube
- Let's Build a Compiler, by Jack Crenshaw
- Forum about compiler development at the Wayback Machine (archived 10 October 2014)
Compiler
View on GrokipediaOverview
Definition and Purpose
A compiler is a computer program that translates source code written in a high-level programming language into a lower-level target language, such as machine code or bytecode, suitable for execution on a specific computer architecture.[7][8] This translation process ensures that the resulting program maintains the intended functionality of the original source while adapting it to the hardware's instruction set.[9] The primary purpose of a compiler is to bridge the gap between human-readable, abstract programming languages and the low-level instructions required by processors, thereby enabling developers to focus on algorithmic logic rather than machine-specific details.[10] By abstracting hardware complexities, compilers promote code portability across different systems and improve overall software efficiency through optimizations that produce faster, more resource-efficient executables.[11] This abstraction not only accelerates software development but also enhances maintainability and scalability in large-scale applications.[12] At a high level, a compiler's workflow involves taking source code as input, performing an analysis phase to parse and validate the code's structure and semantics, followed by a synthesis phase to generate the target output, such as object code or an executable binary.[3][13] For instance, the early A-0 system developed by Grace Hopper in 1952 represented one of the first such translation tools, functioning as a rudimentary compiler for the UNIVAC computer by linking subroutines into executable form.[14][15] A practical example is compiling a simple C program, such as one printing "Hello, World!", using a tool like GCC with the-S flag to produce x86 assembly code; this generates instructions like mov and call that the assembler then converts to machine code for execution on an x86 processor.[16]
Key Components
A compiler's internal structure is typically modular, comprising several core components that process source code through distinct stages to produce executable target code. The primary components include the lexical analyzer (also known as the scanner), parser, semantic analyzer, code generator, and symbol table manager. These elements work in sequence to analyze and synthesize code, ensuring syntactic and semantic correctness while optimizing for the target machine.[3] The lexical analyzer, or scanner, is the first component in the pipeline, responsible for reading the source code as a stream of characters and converting it into a sequence of meaningful tokens, such as keywords, identifiers, operators, and literals. It performs tasks like removing comments and whitespace, expanding macros, and handling language-specific features such as case sensitivity or indentation-based scoping (e.g., in Python). By grouping characters into tokens, the scanner simplifies subsequent processing and detects basic lexical errors, such as invalid characters.[17][18] Following the scanner, the parser, or syntax analyzer, takes the token stream as input and constructs an abstract syntax tree (AST) or parse tree that represents the hierarchical structure of the program according to the language's context-free grammar. It verifies that the tokens conform to syntactic rules, resolving ambiguities through techniques like LL or LR parsing, and reports syntax errors if the structure is invalid. The resulting AST captures the program's logical organization, such as expressions, statements, and declarations, enabling higher-level analysis.[19][18] The semantic analyzer then processes the AST to enforce static semantic rules beyond syntax, such as type checking, scope resolution, and declaration consistency. It verifies that operations are semantically valid—for instance, ensuring no type mismatches in expressions or undeclared variable usage—and annotates the AST with additional information, like inferred types or resolved references, often producing a decorated AST or semantic graph. This phase detects errors that syntax alone cannot catch, such as assigning a string to an integer variable.[20][18] The code generator operates toward the end of the process, translating the semantically verified intermediate representation (IR)—typically generated after semantic analysis—into target-specific machine code or assembly. It maps high-level constructs to low-level instructions, incorporating register allocation and addressing modes tailored to the target architecture, such as x86-64 or ARM, to produce efficient executable output.[21] Throughout these stages, the symbol table manager maintains a dynamic data structure that stores and retrieves information about program symbols, including variables, functions, types, and scopes. It tracks attributes like declaration locations, types, and visibility, supporting lookups during parsing, semantic analysis, and code generation to resolve references and prevent errors like redeclarations. The symbol table facilitates interconnections by providing shared access to symbol data across components, often implemented as hash tables or trees for efficient operations.[22][18] These components interconnect in a linear data flow: the source code passes to the lexical analyzer, yielding tokens to the parser, which builds the AST for the semantic analyzer; the enriched IR then feeds the code generator, with the symbol table manager integrating data bidirectionally to support analysis and synthesis. This modular design allows for independent development and reuse, as seen in compiler toolkits like Lex for generating lexical analyzers from regular expression specifications and Yacc for producing parsers from context-free grammars, which accelerate prototyping by automating routine tasks.[23][24] A typical block diagram of these interactions can be visualized as follows:[Source Code](/page/Source_code)
|
v
Lexical Analyzer (Scanner)
|
v
[Tokens](/page/The_Tokens)
|
v
Parser
|
v
Abstract Syntax Tree (AST)
|
v
Semantic Analyzer
| ^
v | (Symbol lookups)
[Intermediate Representation](/page/Intermediate_representation) (IR) <--- [Symbol Table](/page/Symbol_table) Manager
|
v
Code Generator
|
v
Target Code
[Source Code](/page/Source_code)
|
v
Lexical Analyzer (Scanner)
|
v
[Tokens](/page/The_Tokens)
|
v
Parser
|
v
Abstract Syntax Tree (AST)
|
v
Semantic Analyzer
| ^
v | (Symbol lookups)
[Intermediate Representation](/page/Intermediate_representation) (IR) <--- [Symbol Table](/page/Symbol_table) Manager
|
v
Code Generator
|
v
Target Code
Comparison with Other Systems
Interpreters
An interpreter is a program that executes source code directly, typically line by line or statement by statement, without first translating the entire code into a separate executable file.[25] Unlike compilers, which produce machine code for later execution, interpreters read, analyze, and run the code on-the-fly during runtime.[26] The execution model of an interpreter involves processing the source code sequentially, often translating it into an intermediate form such as bytecode before immediate execution, sometimes within a virtual machine environment that abstracts the underlying hardware.[27] This approach allows for dynamic evaluation, where code can be modified or extended while running, providing greater flexibility for interactive and scripting applications.[28] Interpreters offer several advantages over compilers, including easier debugging due to line-by-line execution that pinpoints errors immediately, rapid prototyping with instant feedback without a separate compilation phase, and enhanced portability across platforms via the virtual machine layer.[29] For instance, Python's CPython implementation exemplifies this model, where source code is compiled to bytecode and then interpreted by the virtual machine, enabling seamless execution in interactive sessions or scripts.[30] However, interpreters generally incur performance penalties, as the repeated analysis and translation at each run eliminate opportunities for whole-program optimizations that compilers can apply, resulting in significantly slower execution speeds, with slowdown factors often ranging from 10x to thousands compared to compiled equivalents, depending on the workload.[31] This overhead stems from the on-the-fly processing, which avoids upfront compilation but limits efficiency for compute-intensive tasks.[32] Earlier examples include Short Code, proposed in 1949 by John Mauchly and implemented for computers like the BINAC and UNIVAC I, which used an interpreter to execute high-level instructions line by line. A notable early interpreter was developed for the Lisp programming language starting in fall 1958 by John McCarthy and his team at MIT, initially as a simpler alternative to building a full compiler, marking a pivotal innovation in symbolic computation and AI research.[33] Hybrid approaches, such as those combining interpretation with just-in-time compilation, bridge the gap by interpreting initially for quick startup while compiling hot code paths for improved performance, though these extend beyond pure interpretation.[34]Assemblers and Linkers
Assemblers are specialized programs that translate assembly language, a low-level symbolic representation of machine instructions, into relocatable machine code. Assembly language uses mnemonic instructions and symbolic addresses to represent binary operations, providing a one-to-one correspondence with the target processor's instruction set architecture (ISA). This translation process involves resolving symbols, such as labels for memory locations, into actual addresses while generating object files that include the machine code and relocation information for later use.[35] Unlike compilers, which process high-level source code through syntactic and semantic analysis, assemblers operate directly on this intermediate, ISA-specific input without performing complex optimizations or type checking, focusing instead on straightforward symbol resolution and opcode mapping. Typically, assemblers perform this in one or two passes: the first pass builds a symbol table and calculates addresses, while the second generates the object code by substituting machine opcodes for mnemonics.[35] Linkers, also known as link editors, take the output from compilers or assemblers—object files containing relocatable code and unresolved external references—and combine them into a single executable binary suitable for execution on the target system. Their primary functions include symbol resolution, where undefined references (e.g., calls to functions in other modules or libraries) are matched to their definitions; relocation, which adjusts addresses to fit the final memory layout; and library management, incorporating necessary routines from static or shared libraries.[36] Linkers support two main types of linking: static and dynamic. In static linking, all required library code is copied into the executable at link time, resulting in a self-contained binary that does not depend on external files at runtime, though this increases file size. Dynamic linking, by contrast, defers resolution until load or run time, referencing shared libraries (e.g., .so files on Unix-like systems) to reduce redundancy and enable updates to libraries without recompiling the program. A prominent example is the GNU linker (ld), part of the GNU Binutils suite, which supports multiple object formats like ELF and uses a command language for customizing output sections and memory placement.[37][38] In the typical compilation pipeline, a compiler generates object code from high-level source, which may already include assembly-like intermediates; if source is written in assembly, an assembler explicitly produces the object file. The linker then processes these object files, resolving dependencies and producing the final executable, ensuring the program can run independently or with dynamic libraries.[39][36] Early assemblers emerged in the 1940s to simplify programming of the first electronic computers, with Kathleen Booth developing the initial assembly language, known as "Contracted Notation" or autocode, in 1947 for the Automatic Relay Computer (ARC) at Birkbeck College, London. This innovation replaced tedious binary coding with symbolic instructions, paving the way for more efficient low-level programming before the widespread adoption of high-level languages.[40]Historical Development
Early Innovations (1950s–1970s)
The development of compilers began in the early 1950s amid significant hardware constraints, including limited memory and processing power on machines like the UNIVAC I, which compelled innovators to create tools for automating code translation from high-level descriptions to machine instructions.[41] Grace Hopper led the creation of the A-0 system in 1952 at Remington Rand, recognized as the first compiler precursor, which automatically linked and ordered subroutines based on specifications read from punched cards, marking an initial step toward automatic programming.[15] Concurrently, A.E. Glennie developed Autocode in 1952 for the Manchester Mark 1 computer, an early high-level programming system that translated mathematical expressions into machine code, addressing the tedium of manual assembly coding.[42] By 1954, R.A. Brooker at the University of Manchester advanced this with the Mark 1 Autocode, introducing the first compiled subroutines that generated efficient machine code for reusable code blocks, overcoming manual optimization challenges on resource-scarce hardware.[43] A pivotal milestone arrived in 1957 with the release of the FORTRAN compiler by John Backus and his team at IBM for the IBM 704, the first widely used high-level language compiler that translated scientific formulas into optimized assembly code, dramatically reducing programming time from weeks to hours despite the era's 36-bit word size and 4K-word memory limits.[44] This compiler incorporated early optimization techniques, such as register allocation and common subexpression elimination, to produce code rivaling hand-written assembly in efficiency, thus proving high-level languages viable on limited hardware.[42] Key figures like Hopper, Backus, and Brooker navigated these constraints by emphasizing syntax analysis and code generation, laying groundwork for structured translation methods. In the 1960s, ALGOL 60, formalized in 1960 by an international committee including Niklaus Wirth, profoundly influenced structured programming by introducing block structures, recursive procedures, and lexical scoping, which compilers could enforce to promote readable, modular code free from unstructured jumps.[45] This language spurred advancements in compiler design, including syntax-directed translation—a technique where parsing drives semantic actions for code generation—exemplified in early ALGOL implementations that used recursive descent parsers to handle nested blocks efficiently.[46] IBM's FORTRAN IV, released in 1962 for systems like the IBM 7090, built on prior versions with enhanced optimizations such as index register usage and loop invariant code motion, enabling better performance on evolving hardware while standardizing features like explicit type declarations to ease portability.[47] The 1970s saw further innovation with Dennis Ritchie developing the C language in 1972 at Bell Labs, designed as a portable systems programming tool that combined high-level abstractions with low-level access, initially compiled for the PDP-11 to rewrite Unix in C by 1973.[48] Ritchie's collaboration with Ken Thompson integrated C into the emerging Unix toolchain, where the compiler formed a core component alongside assemblers and linkers, facilitating modular software development on minicomputers.[49] To address machine dependency, Stephen C. Johnson's Portable C Compiler (PCC) in the mid-1970s employed an intermediate language representation, allowing front-end parsing to target multiple back-ends and enabling Unix porting across architectures like the VAX, thus overcoming hardware fragmentation through retargetable design.[50] These efforts by Ritchie, Thompson, and Johnson resolved persistent challenges like manual optimization and non-portability, establishing compilers as essential for scalable software ecosystems.Modern Advancements (1980s–Present)
The 1980s and 1990s saw the rise of open-source compilers that emphasized portability and modularity, addressing the limitations of proprietary systems tied to specific hardware. The GNU Compiler Collection (GCC), launched by Richard Stallman in 1987 as part of the GNU Project, became a cornerstone for free software development by supporting multiple languages and architectures, enabling developers to build portable applications without vendor lock-in. By the late 1990s, GCC had evolved to support several programming languages, including C, C++, Objective-C, Fortran, and Ada, and numerous processor architectures, fostering widespread adoption in Unix-like systems and beyond.[51] The Low Level Virtual Machine (LLVM) project, started in 2000 by Chris Lattner at the University of Illinois, introduced a modular compiler infrastructure that separated the front-end language parsing from back-end code generation, allowing reusable components across different languages and targets. This design facilitated easier experimentation with optimizations and supported the creation of compilers for new languages with minimal effort. In the 2000s, just-in-time (JIT) compilation gained prominence with the HotSpot Java Virtual Machine, released by Sun Microsystems in 1999, which dynamically compiled bytecode to native code at runtime for improved performance in enterprise applications. Similarly, the .NET Framework's Common Language Runtime, introduced by Microsoft in 2002, incorporated JIT compilation to enable seamless execution of managed code across platforms, boosting the adoption of languages like C#. The decade also witnessed hybrid approaches for scripting languages, such as V8 for JavaScript in 2008, blending interpretation with JIT to balance startup speed and runtime efficiency in web browsers. The 2010s brought compilers focused on safety and performance for modern paradigms. Mozilla's Rust compiler (rustc), with development beginning in 2009 and the first stable release in 2015, pioneered memory-safe systems programming through its borrow checker, preventing common errors like null pointer dereferences without garbage collection, and has since been used in critical infrastructure like the Linux kernel. Apple's Swift compiler, announced in 2014, emphasized type safety and high performance for iOS development, compiling to native code while integrating with Objective-C ecosystems to reduce boilerplate and errors. WebAssembly (Wasm) compilers emerged around 2015 as a portable binary format, with tools like Emscripten (extended in this era) enabling C/C++ code to run efficiently in browsers via ahead-of-time compilation to Wasm modules. From the late 2010s to the 2020s, AI-driven optimizations transformed compiler design. LLVM's MLGO framework, integrated in 2021, uses machine learning to guide phase ordering and inlining decisions, achieving up to 1.3% geomean speedup on SPEC benchmarks compared to traditional heuristics. Key trends include modular architectures for easier extension, as seen in LLVM's influence on projects like Rust and Swift; cross-compilation support for embedded and IoT devices, with GCC and Clang enabling builds for ARM and RISC-V without host modifications; and the dominance of open-source compilers, which by 2023 powered over 90% of Linux distributions and cloud infrastructure. In 2025, parallel compilation advancements in Clang/LLVM, such as distributed task execution, improved build times for large projects, enhancing developer productivity. Quantum compilation also advanced, with IBM's Qiskit framework incorporating circuit optimization passes in its 2025 release to map high-level quantum algorithms to noisy intermediate-scale quantum hardware more efficiently.Compiler Architecture
Phases and Passes
Compiler processing is organized into phases and passes to systematically transform source code into executable machine code. Phases represent distinct, sequential transformation steps that operate on the input, such as lexical analysis, which breaks the source code into tokens; syntax analysis, which verifies the structure against grammar rules; semantic analysis, which checks type compatibility and scope rules; and code generation, which produces target code from an intermediate representation.[52] These phases can be executed in a single traversal or repeated across multiple passes, depending on the compiler design. To illustrate the flow through these main phases, consider how a typical compiler processes the expressioni = i * 70 + j + 2 (assuming integer types and standard operator precedence where multiplication binds tighter than addition):
1. Lexical Analysis (Tokens):
i(identifier)=(operator)i(identifier)*(operator)70(integer literal)+(operator)j(identifier)+(operator)2(integer literal)
i = ((i * 70) + j) + 2
Abstract Syntax Tree outline:
- Assignment node
- Left: identifier "i"
- Right: + node
- Left: + node
- Left: * node
- Left: identifier "i"
- Right: constant 70
- Right: identifier "j"
- Left: * node
- Right: constant 2
- Left: + node
- Verifies type compatibility (all operands are integers → no errors)
- Builds symbol table entries for
iandj(assuming declared as int) - Annotates the AST with type information
t1 = i * 70
t2 = t1 + j
t3 = t2 + 2
i = t3
t1 = i * 70
t2 = t1 + j
t3 = t2 + 2
i = t3
( *, i, 70, t1 )( +, t1, j, t2 )( +, t2, 2, t3 )( =, t3, _, i ))
5. Code Optimization (basic, no major changes here):
Same as above (no common subexpressions or constant folding applicable without variable values)
6. Code Generation (example x86 assembly-like):
mov eax, [i] ; load i
imul eax, 70 ; i * 70
add eax, [j] ; + j
add eax, 2 ; + 2
mov [i], eax ; store back to i
mov eax, [i] ; load i
imul eax, 70 ; i * 70
add eax, [j] ; + j
add eax, 2 ; + 2
mov [i], eax ; store back to i
Source Code → Lexical → Syntax → Semantic → Code Gen → Machine Code
Source Code → Lexical → Syntax → Semantic → Code Gen → Machine Code
Front End
The front end of a compiler performs the initial analysis of source code, verifying its adherence to the programming language's syntax and semantics while generating a machine-independent intermediate representation. This phase is source-language dependent, focusing on transforming raw text into a structured form suitable for further processing. By isolating language-specific concerns, the front end enables reusability with multiple back ends for different target architectures. The front end comprises three primary components: lexical analysis, syntax analysis, and semantic analysis. Lexical analysis scans the input stream of characters and groups them into meaningful units called tokens, such as keywords, identifiers, operators, and literals. This process relies on regular expressions to define token patterns, which are converted into deterministic finite automata (DFAs) for efficient recognition. For instance, the regular expression for an identifier, , matches a starting letter followed by zero or more alphanumeric characters, ensuring tokens like variable names are correctly isolated. Tools like Lex automate DFA construction from these specifications, optimizing scanning performance through state minimization. Syntax analysis builds upon the token stream to validate the program's structure against the language's context-free grammar, producing a parse tree that represents hierarchical relationships. Parsers operate either top-down, as in LL algorithms that predict expansions from the start symbol, or bottom-up, as in LR algorithms that reduce tokens to non-terminals via shift-reduce mechanisms. LR parsers, introduced by Knuth, handle a broader class of grammars deterministically in linear time, making them suitable for most programming languages. Parser generators such as Yacc, which produces LALR(1) parsers from grammar rules augmented with semantic actions, and its GNU successor Bison, facilitate rapid development by automating table-driven parsing. Semantic analysis examines the parse tree for context-sensitive properties, ensuring the program's meaning is well-defined. Key tasks include type checking, which verifies operand compatibility in operations (e.g., disallowing addition of incompatible types like integer and string), and scope resolution, which binds identifiers to their declarations within nested scopes using mechanisms like symbol tables. Violations trigger errors, such as undeclared variables or type mismatches. This phase often integrates attribute grammars to propagate information across the tree. The front end culminates in an abstract syntax tree (AST), a condensed representation omitting extraneous details like parentheses, or a preliminary intermediate representation (IR) such as three-address code, where statements are simplified to forms like . For the declarationint x = 5;, lexical analysis yields tokens for "int", "x", "=", and "5"; syntax analysis constructs nodes for a variable declaration with assignment; and semantic analysis confirms integer type assignment, yielding an AST with a declaration node parenting an identifier "x" and integer literal 5.
To support robust compilation, the front end incorporates error handling and recovery strategies during lexical, syntactic, and semantic phases. Upon detecting issues like unexpected tokens, parsers may skip to synchronization points (e.g., semicolons) or insert/deleted symbols locally to resume parsing, minimizing cascading errors and enabling multi-error reporting in a single pass.
Middle End
The middle end of a compiler, often referred to as the optimizer, bridges the front end and back end by transforming the intermediate representation (IR) into a more efficient form through machine-independent optimizations. These transformations focus on improving program performance, such as reducing computational redundancy or eliminating unnecessary operations, without relying on details of the target architecture. Core techniques include constant folding, which evaluates expressions with known constant values at compile time, and dead code elimination, which removes code that has no effect on the program's observable behavior, such as unreachable statements or computations whose results are never used.[59][60] Central to middle-end optimizations are data-flow analysis and control-flow graphs, which provide the analytical framework for identifying optimization opportunities. A control-flow graph (CFG) models the program's execution paths, with nodes representing basic blocks—sequences of instructions without branches—and edges indicating possible control transfers, enabling precise modeling of program structure for analysis. Data-flow analysis operates on this CFG to track variable states across the program; for example, reaching definitions analysis computes the set of variable definitions that may reach each program point, while live variables analysis determines which variables retain value beyond a given point. These analyses support optimizations by revealing redundancies, such as unused definitions that can be eliminated.[61][62][63] The basic data-flow equations for reaching definitions illustrate this process formally. For a node in the CFG: Here, are the predecessors of , is the set of definitions generated within , and is the set of definitions invalidated by . To solve, initialize all and sets to the empty set (for a forward analysis starting conservatively), then propagate information iteratively in topological order across the CFG or using worklist algorithms until a fixed point is reached, where no further changes occur due to the monotonic nature of the union and set operations. This fixed-point computation ensures all possible reaching definitions are captured efficiently.[63][64] Middle-end processing involves multiple optimization passes, often organized in loops to apply transformations iteratively as new opportunities arise from prior changes. For instance, loop invariant code motion (LICM) identifies computations within a loop that yield the same result on every iteration—such as loop-independent expressions—and relocates them to a pre-loop position, minimizing redundant execution while preserving program semantics. These passes rely on prior analyses like dominators (nodes that precede all paths to a given node) to ensure safe movement. In practice, compilers execute sequences of such passes, potentially repeating them to handle interdependencies.[65][66] Frameworks like LLVM's pass manager facilitate modular implementation of these optimizations, allowing developers to define and schedule sequences of analysis and transformation passes on the IR. The new pass manager in LLVM supports nested pipelines, enabling bottom-up optimization of call graphs and efficient reuse of analysis results across passes, which enhances scalability for large codebases. For example, the optimization ofx = 2 + 3; to x = 5; via constant folding demonstrates a simple yet impactful transformation, replacing runtime addition with a direct assignment.[59][67][68]
Back End
The back end of a compiler translates the optimized intermediate representation (IR) into target-specific machine code, accounting for the hardware architecture's instruction set, registers, and memory model. This phase is highly dependent on the target platform, such as x86, ARM, or RISC-V, requiring architecture-specific rules to ensure efficient and correct execution. Unlike the machine-independent optimizations in prior phases, the back end focuses on low-level transformations to produce executable binaries or assembly code.[69] Key components include instruction selection, register allocation, and code generation. Instruction selection maps operations in the IR, often represented as a directed acyclic graph (DAG), to concrete machine instructions using techniques like tree pattern matching. In this approach, the IR tree or DAG is covered by predefined patterns (tiles) corresponding to instruction alternatives, selecting the optimal match based on cost functions for latency or code size; for instance, the Bottom-Up Rewrite System (BURS) automates this by generating efficient parsers from a cost-augmented grammar.[70] Register allocation assigns virtual registers from the IR to a limited set of physical machine registers, modeled as graph coloring on an interference graph where nodes represent live ranges and edges indicate overlapping lifetimes. If the graph is not colorable with the available registers, spilling to memory occurs. Code generation then emits assembly instructions, incorporating peephole optimizations to refine short sequences for better performance, such as combining multiple moves into a single instruction. Modern back ends, such as those in LLVM, support retargeting across multiple instruction set architectures (ISAs) through modular designs, allowing reuse of front and middle ends while customizing selection rules, scheduling, and emission for each target. For example, LLVM's TableGen tool defines instruction encodings and patterns, enabling back ends for over 20 ISAs including x86 and ARM. The overall process proceeds from optimized IR to a DAG representation for selection, followed by instruction scheduling to respect dependencies and hazards, register allocation, and final assembly emission into object code or binaries.[69] A representative example is compiling a simple IR assignment like%result = add i32 %a, 5 on x86: instruction selection might choose add eax, 5 after loading %a into eax, with register allocation ensuring no conflicts via coloring. For register allocation, the greedy graph coloring algorithm approximates the chromatic number by iteratively assigning the smallest available color (register) to nodes ordered by decreasing degree, building the interference graph first where edges connect variables live simultaneously; this heuristic guarantees at most colors, where is the maximum degree, often sufficient for typical register counts like 16 on x86.[71] Retargeting challenges arise from ISA differences, such as ARM's load-store architecture versus x86's register-memory operations, necessitating careful pattern matching to avoid inefficient code.[72]
Correctness and Verification
Compiler correctness refers to the guarantee that the compiled code preserves the semantics of the source program, ensuring equivalent observable behavior between the input and output across all possible executions.[73] This semantic preservation is fundamental, as violations can lead to subtle bugs where the generated code produces incorrect results or crashes unexpectedly. Optimizing compilers prioritize performance while aiming to maintain this equivalence, whereas verifying compilers incorporate proofs to mathematically confirm it.[74] Verification techniques for compilers span empirical testing and formal methods. Testing approaches include unit tests for individual components, integration tests for pipeline interactions, and random testing tools like Csmith, which generates complex C programs to expose discrepancies via differential testing against multiple compilers. Csmith has uncovered hundreds of bugs in production compilers such as GCC and LLVM by producing inputs that trigger miscompilations.[75] Formal methods, in contrast, use theorem provers like Coq to establish semantic preservation; the CompCert project exemplifies this by verifying a subset of C to assembly through a multi-pass pipeline, proving that optimizations do not alter program behavior.[73] Model checking can also validate properties in specific phases, though it scales poorly to full compilers. Challenges in verification arise from the undecidability of checking semantic equivalence between arbitrary source and target programs, stemming from the halting problem and requiring approximations or restrictions to subsets of languages. Phase errors, where transformations in one stage invalidate assumptions in another, further complicate guarantees. Examples include formal verification efforts in blockchain systems like Tezos, where the Mi-Cho-Coq framework uses Coq to certify the functional correctness of Michelson smart contracts against their compiled execution semantics.[76] Bug-finding tools like Csmith complement this by targeting undefined behaviors that evade formal proofs.[75] Metrics for assessing verification effectiveness include test coverage rates, often aiming for over 90% branch coverage in regression suites to catch regressions, and self-hosting, where the compiler recompiles itself across multiple generations to validate consistency in output binaries. Self-hosting provides a practical sanity check, as discrepancies indicate bugs, though it does not prove full correctness. In recent developments, AI techniques such as large language models have been applied to detect missed optimizations and bugs in LLVM; one approach identified 24 issues in production compilers by analyzing code for optimization opportunities.[77] These methods, including LLVM IR-based crossover testing in FLUX, enhance traditional fuzzing by generating targeted inputs that reveal subtle miscompilations. As of 2025, ongoing efforts include LLM-assisted fuzzing tools like WhiteFox, which have detected additional bugs in LLVM.[78]Types of Compilers
Traditional Batch Compilers
Traditional batch compilers are systems that translate an entire program's source code into machine-executable form ahead of time, typically in an offline batch process without user interaction during compilation. This approach involves static analysis of the full codebase to generate optimized object code or binaries before program execution begins.[79] Key characteristics include ahead-of-time (AOT) compilation, where the compiler performs all necessary analysis and optimization in a single or multi-pass run on the complete source, producing standalone executables or intermediate representations suitable for direct execution. Prominent examples are the GNU Compiler Collection (GCC), which compiles languages like C and C++ into native machine code, and the Java compiler (javac), which translates Java source files into platform-independent bytecode for the Java Virtual Machine. These compilers operate in batch mode, processing multiple input files as a unit to resolve dependencies and generate outputs like object files or class files.[51][80] The typical workflow begins with source code input, which undergoes preprocessing (e.g., macro expansion in C), lexical analysis, parsing, semantic analysis, optimization, and code generation to produce object files containing relocatable machine code. These object files are then linked with libraries and resolved for external references to create a final executable binary, ready for immediate execution without further compilation. This process is managed via command-line invocation or build tools, ensuring reproducibility in non-interactive environments.[80] Advantages of traditional batch compilers include the generation of highly optimized binaries that exhibit superior runtime performance and minimal startup latency, as all compilation overhead occurs upfront rather than during execution. They also eliminate runtime compilation costs, making them ideal for resource-constrained environments. However, disadvantages encompass extended compilation times for large codebases, which can prolong development iterations, and a lack of adaptability to runtime conditions since optimizations are fixed at compile time.[81][82] Common use cases for batch compilers include systems programming, where native code efficiency is paramount, such as developing operating system kernels or device drivers with GCC. They are also prevalent in embedded software development, where predictable performance and small footprint are essential, as AOT compilation avoids the overhead of just-in-time mechanisms in constrained hardware.[83][51] The evolution of batch compilers traces back to the original FORTRAN compiler developed for the IBM 704 in the mid-1950s by John Backus and his team at IBM, which introduced high-level language translation and optimization in a batch-oriented manner to simplify scientific computing. Over decades, this model advanced from single-language tools like early FORTRAN implementations to polyglot systems like GCC, initiated in 1987, which now supports over a dozen languages and targets diverse architectures while retaining the core batch compilation paradigm.[84][51]Just-In-Time (JIT) Compilers
Just-in-time (JIT) compilers translate bytecode or intermediate representation (IR) into native machine code during program execution, enabling dynamic optimization based on runtime behavior. This contrasts with ahead-of-time compilation by deferring code generation until necessary, often incorporating profiling to target frequently executed code paths for improved efficiency. The JIT process typically starts with interpretation or baseline compilation for rapid startup, then progresses to full optimization as usage patterns emerge.[85][86] The core mechanism of JIT compilers involves monitoring execution to identify "hot" code—regions executed often—and compiling them on-the-fly with aggressive optimizations unavailable at static compile time. For instance, tracing JITs record sequences of operations as linear traces, applying transformations like loop unrolling and dead code elimination to these traces for streamlined execution. In the V8 JavaScript engine, this pipeline includes an interpreter (Ignition) for bytecode generation, a baseline JIT (Sparkplug) for quick machine code, and advanced optimizing JITs (Maglev and TurboFan) that use static single assignment (SSA) form and runtime feedback on types and shapes to specialize code, with deoptimization handling failures in assumptions. Such approaches allow JITs to outperform pure interpretation while adapting to actual workloads.[87][88] Prominent examples of JIT compilers include the HotSpot JVM, released by Sun Microsystems in 1999, which uses a tiered system: the C1 client compiler for fast, lightweight optimization and the C2 server compiler for thorough analysis, significantly reducing interpretation overhead in Java applications. The .NET Common Language Runtime (CLR) applies JIT to convert Common Intermediate Language (CIL) into platform-specific native code at method invocation, supporting managed execution across environments. LuaJIT, a trace-based JIT for the Lua scripting language developed by Mike Pall since 2005, combines an assembler-based interpreter with advanced SSA optimizations to achieve near-native speeds in dynamic scenarios like games and embedded systems, while maintaining compatibility with Lua 5.1.[89][86][90] JIT compilers offer advantages in runtime adaptability, such as tailoring optimizations to observed branch probabilities and data types, which can yield higher peak performance than static methods for long-running or polymorphic code. This enables techniques like speculative inlining based on profile data, boosting throughput in applications with unpredictable paths. However, drawbacks include initial startup delays from on-the-fly compilation, potentially increasing launch times by seconds in large programs, and elevated memory consumption from caching multiple code versions and profiles.[88][91] Essential techniques in JIT compilers encompass adaptive optimization, which iteratively recompiles code using evolving runtime profiles to refine decisions like register allocation, and deoptimization, a safety mechanism to discard optimized code and resume from a baseline interpreter if assumptions (e.g., constant values or type stability) invalidate. Deoptimization relies on inserted anchors—synchronization points that capture stack state—and assume checks to trigger on-stack replacement, preserving correctness in speculative scenarios like constant propagation or inlining. These methods, formalized in verified JIT frameworks, balance aggressive speculation with reliable fallback, as seen in systems handling dynamic languages.[92] In the 2020s, JIT developments for WebAssembly in browsers have advanced runtime compilation support, integrating profiling, on-stack replacement, and recompilation to generate specialized native code efficiently while adhering to sandbox constraints. Profile-guided optimization (PGO) has also matured in JIT environments, with .NET 9 expanding runtime profiling with Dynamic PGO to optimize more code patterns, such as type checks and casts.[93]Specialized Compilers
Specialized compilers are tailored for particular domains, architectures, or development needs, extending beyond general-purpose tools to address unique constraints such as target platform differences or specialized hardware. Cross-compilers, for instance, run on a host platform but produce executable code for a different target platform, enabling development for embedded systems or mobile devices without requiring the target hardware. The GNU Compiler Collection (GCC) supports cross-compilation through target-specific configurations, allowing developers to specify the host and target architectures via command-line options like--target=arm-linux-gnueabihf. A prominent example is the Android Native Development Kit (NDK), which uses cross-compilers to build C/C++ code for ARM-based Android devices from x86 hosts.
Source-to-source compilers, also known as transpilers, translate code from one high-level language to another, often to ensure compatibility with existing environments or to leverage specific features. Babel transpiles modern JavaScript (ES6+) to older versions (ES5) for broader browser support, using plugins to handle syntax transformations without altering semantics. Similarly, the TypeScript compiler (tsc) transpiles TypeScript—a superset of JavaScript with static typing—to plain JavaScript, enabling type-checked development while producing runnable code for web and Node.js environments.
Domain-specific compilers optimize for niche hardware or computational paradigms, incorporating specialized optimizations and backends. For graphics processing units (GPUs), NVIDIA's NVCC (CUDA Compiler Driver) compiles CUDA C++ code to parallel executable kernels for NVIDIA GPUs, handling device-specific instructions and memory management through a phased compilation process that separates host and device code. In quantum computing, Microsoft's Q# compiler translates quantum algorithms written in Q#—a domain-specific language—to executable operations for quantum simulators or hardware like Azure Quantum, supporting operations like qubit entanglement and measurement. For machine learning, the XLA (Accelerated Linear Algebra) compiler optimizes TensorFlow and JAX models by fusing operations into efficient kernels for CPUs, GPUs, or TPUs, reducing execution time through graph-level transformations without requiring code changes.
Other specialized variants include incremental compilers, which recompile only modified program portions to accelerate iterative development, and bootstrapping compilers, which compile their own source code to enable self-hosting. Rust's compiler employs incremental compilation by tracking dependencies and caching intermediate results, allowing rapid rebuilds after small changes. Bootstrapping, as in GCC's build process, starts with a minimal compiler written in C to produce a full self-compiling version.
Modern examples highlight practical applications: Rust facilitates cross-compilation via rustup's target addition (e.g., rustup target add aarch64-unknown-linux-gnu), enabling seamless builds for multiple architectures including ARM for embedded systems. Zig serves as a drop-in C/C++ compiler with built-in C interop, using @cImport to seamlessly link C headers and libraries while supporting cross-compilation out-of-the-box for targets like Windows from Linux hosts.
Emerging trends as of 2025 focus on embedded AI compilers for edge devices, optimizing neural networks for low-power microcontrollers with techniques like quantization and operator fusion. Compiler-aware hardware designs integrate AI-specific instructions into edge processors, enabling efficient inference on resource-constrained devices. These advancements support autonomous edge computing in IoT without cloud reliance.
Advanced Topics
Optimization Techniques
Optimization techniques in compilers transform intermediate representations to improve program performance metrics such as execution time and code size while preserving semantics. These techniques are broadly classified into local optimizations, which operate on small code segments; global optimizations, which span basic blocks within a function; and interprocedural optimizations, which analyze across function boundaries to enable more aggressive transformations. Local optimizations, such as peephole optimization, examine short sequences of instructions—typically 1 to 4 machine words—and replace them with equivalent but more efficient code, such as eliminating redundant loads or jumps. This approach, first systematically explored in early compiler designs, can reduce instruction count in machine-specific code generation phases.[94] Global optimizations leverage data-flow analysis across an entire function to eliminate dead code, propagate constants, or reorder computations for better register utilization. Interprocedural analysis extends this by constructing call graphs and summary information, allowing optimizations like inlining or alias elimination that cross function boundaries, improving execution speed in large programs through reduced function call overhead. Loop optimizations form a critical subset, targeting iterative structures common in numerical and data-processing code. Loop unrolling replicates the loop body to reduce overhead from increment and test operations; for instance, the following C loop:for (int i = 0; i < 4; i++) {
a[i] = b[i] + c;
}
for (int i = 0; i < 4; i++) {
a[i] = b[i] + c;
}
a[0] = b[0] + c;
a[1] = b[1] + c;
a[2] = b[2] + c;
a[3] = b[3] + c;
a[0] = b[0] + c;
a[1] = b[1] + c;
a[2] = b[2] + c;
a[3] = b[3] + c;
