Hubbry Logo
CompilerCompilerMain
Open search
Compiler
Community hub
Compiler
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Compiler
Compiler
from Wikipedia

In computing, a compiler is software that translates computer code written in one programming language (the source language) into another language (the target language). The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language (e.g. assembly language, object code, or machine code) to create an executable program.[1][2]: p1 [3]

There are many different types of compilers which produce output in different useful forms. A cross-compiler produces code for a different CPU or operating system than the one on which the cross-compiler itself runs. A bootstrap compiler is often a temporary compiler, used for compiling a more permanent or better optimized compiler for a language.

Related software include decompilers, programs that translate from low-level languages to higher level ones; programs that translate between high-level languages, usually called source-to-source compilers or transpilers; language rewriters, usually programs that translate the form of expressions without a change of language; and compiler-compilers, compilers that produce compilers (or parts of them), often in a generic and reusable way so as to be able to produce many differing compilers.

A compiler is likely to perform some or all of the following operations, often called phases: preprocessing, lexical analysis, parsing, semantic analysis (syntax-directed translation), conversion of input programs to an intermediate representation, code optimization and machine specific code generation. Compilers generally implement these phases as modular components, promoting efficient design and correctness of transformations of source input to target output. Program faults caused by incorrect compiler behavior can be very difficult to track down and work around; therefore, compiler implementers invest significant effort to ensure compiler correctness.[4]

Comparison with interpreter

[edit]

With respect to making source code runnable, an interpreter provides a similar function as a compiler, but via a different mechanism. An interpreter executes code without converting it to machine code.[2]: p2  Therefore, some interpreters execute source code while others execute an intermediate form such as bytecode.

Hence a program compiled to native code tends to run faster than when interpreted. Environments with a bytecode-intermediate-form tends toward intermediate-speed. While Just-in-time compilation allows for native execution speed with a one-time startup processing time cost.

For Low-level programming languages, such as assembly and C, it is typical that they are compiled, especially when speed is a significant concern, rather than being cross-platform supported. So that for such languages, there are more one-to-one correspondences between the source code and the resulting machine code, making it easier for programmers to control the use of hardware.

In theory; a programming language can be used via either a compiler or an interpreter, but in practice, each language tends to be used with only one or the other. Nonetheless, it is possible to write a compiler for a language that is commonly interpreted. For example, Common Lisp can be compiled to Java bytecode (and then interpreted by the Java virtual machine), as well as C code (then compiled to native machine code), or directly to native code.

History

[edit]
A diagram of the operation of a typical multi-language, multi-target compiler

Theoretical computing concepts developed by scientists, mathematicians, and engineers formed the basis of digital modern computing development during World War II. Primitive binary languages evolved because digital devices only understand ones and zeros and the circuit patterns in the underlying machine architecture. In the late 1940s, assembly languages were created to offer a more workable abstraction of the computer architectures.[5] Limited memory capacity of early computers led to substantial technical challenges when the first compilers were designed. Therefore, the compilation process needed to be divided into several small programs. The front end programs produce the analysis products used by the back end programs to generate target code. As computer technology provided more resources, compiler designs could align better with the compilation process.

It is usually more productive for a programmer to use a high-level language, so the development of high-level languages followed naturally from the capabilities offered by digital computers. High-level languages are formal languages that are strictly defined by their syntax and semantics which form the high-level language architecture. Elements of these formal languages include:

  • Alphabet, any finite set of symbols;
  • String, a finite sequence of symbols;
  • Language, any set of strings on an alphabet.

The sentences in a language may be defined by a set of rules called a grammar.[6]

Backus–Naur form (BNF) describes the syntax of "sentences" of a language. It was developed by John Backus and used for the syntax of Algol 60.[7] The ideas derive from the context-free grammar concepts by linguist Noam Chomsky.[8] "BNF and its extensions have become standard tools for describing the syntax of programming notations. In many cases, parts of compilers are generated automatically from a BNF description."[9]

Between 1942 and 1945, Konrad Zuse designed the first (algorithmic) programming language for computers called Plankalkül ("Plan Calculus"). Zuse also envisioned a Planfertigungsgerät ("Plan assembly device") to automatically translate the mathematical formulation of a program into machine-readable punched film stock.[10] While no actual implementation occurred until the 1970s, it presented concepts later seen in APL designed by Ken Iverson in the late 1950s.[11] APL is a language for mathematical computations.

Between 1949 and 1951, Heinz Rutishauser proposed Superplan, a high-level language and automatic translator.[12] His ideas were later refined by Friedrich L. Bauer and Klaus Samelson.[13]

High-level language design during the formative years of digital computing provided useful programming tools for a variety of applications:

  • FORTRAN (Formula Translation) for engineering and science applications is considered to be one of the first actually implemented high-level languages and first optimizing compiler.[14][independent source needed]
  • COBOL (Common Business-Oriented Language) evolved from A-0 and FLOW-MATIC to become the dominant high-level language for business applications.[15]
  • LISP (List Processor) for symbolic computation.[16]

Compiler technology evolved from the need for a strictly defined transformation of the high-level source program into a low-level target program for the digital computer. The compiler could be viewed as a front end to deal with the analysis of the source code and a back end to synthesize the analysis into the target code. Optimization between the front end and back end could produce more efficient target code.[17]

Some early milestones in the development of compiler technology:

  • May 1952: Grace Hopper's team at Remington Rand wrote the compiler for the A-0 programming language (and coined the term compiler to describe it),[18][19][20] although the A-0 compiler functioned more as a loader or linker than the modern notion of a full compiler.[21][22][23]
  • 1952, before September: An Autocode compiler developed by Alick Glennie for the Manchester Mark I computer at the University of Manchester is considered by some to be the first compiled programming language.[24]
  • 1954–1957: A team led by John Backus at IBM developed FORTRAN which is usually considered the first high-level language. In 1957, they completed a FORTRAN compiler that is generally credited as having introduced the first unambiguously complete compiler.[25]
  • 1959: The Conference on Data Systems Language (CODASYL) initiated development of COBOL. The COBOL design drew on A-0 and FLOW-MATIC. By the early 1960s COBOL was compiled on multiple architectures.
  • 1958–1960: Algol 58 was the precursor to ALGOL 60. It introduced code blocks, a key advance in the rise of structured programming. ALGOL 60 was the first language to implement nested function definitions with lexical scope. It included recursion. Its syntax was defined using BNF. ALGOL 60 inspired many languages that followed it. Tony Hoare remarked: "... it was not only an improvement on its predecessors but also on nearly all its successors."[26][27]
  • 1958–1962: John McCarthy at MIT designed LISP.[28] The symbol processing capabilities provided useful features for artificial intelligence research. In 1962, LISP 1.5 release noted some tools: an interpreter written by Stephen Russell and Daniel J. Edwards, a compiler and assembler written by Tim Hart and Mike Levin.[29]

Early operating systems and software were written in assembly language. In the 1960s and early 1970s, the use of high-level languages for system programming was still controversial due to resource limitations. However, several research and industry efforts began the shift toward high-level systems programming languages, for example, BCPL, BLISS, B, and C.

BCPL (Basic Combined Programming Language) designed in 1966 by Martin Richards at the University of Cambridge was originally developed as a compiler writing tool.[30] Several compilers have been implemented, Richards' book provides insights to the language and its compiler.[31] BCPL was not only an influential systems programming language that is still used in research[32] but also provided a basis for the design of B and C languages.

BLISS (Basic Language for Implementation of System Software) was developed for a Digital Equipment Corporation (DEC) PDP-10 computer by W. A. Wulf's Carnegie Mellon University (CMU) research team. The CMU team went on to develop BLISS-11 compiler one year later in 1970.

Multics (Multiplexed Information and Computing Service), a time-sharing operating system project, involved MIT, Bell Labs, General Electric (later Honeywell) and was led by Fernando Corbató from MIT.[33] Multics was written in the PL/I language developed by IBM and IBM User Group.[34] IBM's goal was to satisfy business, scientific, and systems programming requirements. There were other languages that could have been considered but PL/I offered the most complete solution even though it had not been implemented.[35] For the first few years of the Multics project, a subset of the language could be compiled to assembly language with the Early PL/I (EPL) compiler by Doug McIlory and Bob Morris from Bell Labs.[36] EPL supported the project until a boot-strapping compiler for the full PL/I could be developed.[37]

Bell Labs left the Multics project in 1969, and developed a system programming language B based on BCPL concepts, written by Dennis Ritchie and Ken Thompson. Ritchie created a boot-strapping compiler for B and wrote Unics (Uniplexed Information and Computing Service) operating system for a PDP-7 in B. Unics eventually became spelled Unix.

Bell Labs started the development and expansion of C based on B and BCPL. The BCPL compiler had been transported to Multics by Bell Labs and BCPL was a preferred language at Bell Labs.[38] Initially, a front-end program to Bell Labs' B compiler was used while a C compiler was developed. In 1971, a new PDP-11 provided the resource to define extensions to B and rewrite the compiler. By 1973 the design of C language was essentially complete and the Unix kernel for a PDP-11 was rewritten in C. Steve Johnson started development of Portable C Compiler (PCC) to support retargeting of C compilers to new machines.[39][40]

Object-oriented programming (OOP) offered some interesting possibilities for application development and maintenance. OOP concepts go further back but were part of LISP and Simula language science.[41] Bell Labs became interested in OOP with the development of C++.[42] C++ was first used in 1980 for systems programming. The initial design leveraged C language systems programming capabilities with Simula concepts. Object-oriented facilities were added in 1983.[43] The Cfront program implemented a C++ front-end for C84 language compiler. In subsequent years several C++ compilers were developed as C++ popularity grew.

In many application domains, the idea of using a higher-level language quickly caught on. Because of the expanding functionality supported by newer programming languages and the increasing complexity of computer architectures, compilers became more complex.

DARPA (Defense Advanced Research Projects Agency) sponsored a compiler project with Wulf's CMU research team in 1970. The Production Quality Compiler-Compiler PQCC design would produce a Production Quality Compiler (PQC) from formal definitions of source language and the target.[44] PQCC tried to extend the term compiler-compiler beyond the traditional meaning as a parser generator (e.g., Yacc) without much success. PQCC might more properly be referred to as a compiler generator.

PQCC research into code generation process sought to build a truly automatic compiler-writing system. The effort discovered and designed the phase structure of the PQC. The BLISS-11 compiler provided the initial structure.[45] The phases included analyses (front end), intermediate translation to virtual machine (middle end), and translation to the target (back end). TCOL was developed for the PQCC research to handle language specific constructs in the intermediate representation.[46] Variations of TCOL supported various languages. The PQCC project investigated techniques of automated compiler construction. The design concepts proved useful in optimizing compilers and compilers for the (since 1995, object-oriented) programming language Ada.

The Ada STONEMAN document[a] formalized the program support environment (APSE) along with the kernel (KAPSE) and minimal (MAPSE). An Ada interpreter NYU/ED supported development and standardization efforts with the American National Standards Institute (ANSI) and the International Standards Organization (ISO). Initial Ada compiler development by the U.S. Military Services included the compilers in a complete integrated design environment along the lines of the STONEMAN document. Army and Navy worked on the Ada Language System (ALS) project targeted to DEC/VAX architecture while the Air Force started on the Ada Integrated Environment (AIE) targeted to IBM 370 series. While the projects did not provide the desired results, they did contribute to the overall effort on Ada development.[47]

Other Ada compiler efforts got underway in Britain at the University of York and in Germany at the University of Karlsruhe. In the U. S., Verdix (later acquired by Rational) delivered the Verdix Ada Development System (VADS) to the Army. VADS provided a set of development tools including a compiler. Unix/VADS could be hosted on a variety of Unix platforms such as DEC Ultrix and the Sun 3/60 Solaris targeted to Motorola 68020 in an Army CECOM evaluation.[48] There were soon many Ada compilers available that passed the Ada Validation tests. The Free Software Foundation GNU project developed the GNU Compiler Collection (GCC) which provides a core capability to support multiple languages and targets. The Ada version GNAT is one of the most widely used Ada compilers. GNAT is free but there is also commercial support, for example, AdaCore, was founded in 1994 to provide commercial software solutions for Ada. GNAT Pro includes the GNU GCC based GNAT with a tool suite to provide an integrated development environment.

High-level languages continued to drive compiler research and development. Focus areas included optimization and automatic code generation. Trends in programming languages and development environments influenced compiler technology. More compilers became included in language distributions (PERL, Java Development Kit) and as a component of an IDE (VADS, Eclipse, Ada Pro). The interrelationship and interdependence of technologies grew. The advent of web services promoted growth of web languages and scripting languages. Scripts trace back to the early days of Command Line Interfaces (CLI) where the user could enter commands to be executed by the system. User Shell concepts developed with languages to write shell programs. Early Windows designs offered a simple batch programming capability. The conventional transformation of these language used an interpreter. While not widely used, Bash and Batch compilers have been written. More recently sophisticated interpreted languages became part of the developers tool kit. Modern scripting languages include PHP, Python, Ruby and Lua. (Lua is widely used in game development.) All of these have interpreter and compiler support.[49]

"When the field of compiling began in the late 50s, its focus was limited to the translation of high-level language programs into machine code ... The compiler field is increasingly intertwined with other disciplines including computer architecture, programming languages, formal methods, software engineering, and computer security."[50] The "Compiler Research: The Next 50 Years" article noted the importance of object-oriented languages and Java. Security and parallel computing were cited among the future research targets.

Compiler construction

[edit]

A compiler implements a formal transformation from a high-level source program to a low-level target program. Compiler design can define an end-to-end solution or tackle a defined subset that interfaces with other compilation tools e.g. preprocessors, assemblers, linkers. Design requirements include rigorously defined interfaces both internally between compiler components and externally between supporting toolsets.

In the early days, the approach taken to compiler design was directly affected by the complexity of the computer language to be processed, the experience of the person(s) designing it, and the resources available. Resource limitations led to the need to pass through the source code more than once.

A compiler for a relatively simple language written by one person might be a single, monolithic piece of software. However, as the source language grows in complexity the design may be split into a number of interdependent phases. Separate phases provide design improvements that focus development on the functions in the compilation process.

One-pass vis-à-vis multi-pass compilers

[edit]

Classifying compilers by number of passes has its background in the hardware resource limitations of computers. Compiling involves performing much work and early computers did not have enough memory to contain one program that did all of this work. As a result, compilers were split up into smaller programs which each made a pass over the source (or some representation of it) performing some of the required analysis and translations.

The ability to compile in a single pass has classically been seen as a benefit because it simplifies the job of writing a compiler and one-pass compilers generally perform compilations faster than multi-pass compilers. Thus, partly driven by the resource limitations of early systems, many early languages were specifically designed so that they could be compiled in a single pass (e.g., Pascal).

In some cases, the design of a language feature may require a compiler to perform more than one pass over the source. For instance, consider a declaration appearing on line 20 of the source which affects the translation of a statement appearing on line 10. In this case, the first pass needs to gather information about declarations appearing after statements that they affect, with the actual translation happening during a subsequent pass.

The disadvantage of compiling in a single pass is that it is not possible to perform many of the sophisticated optimizations needed to generate high quality code. It can be difficult to count exactly how many passes an optimizing compiler makes. For instance, different phases of optimization may analyse one expression many times but only analyse another expression once.

Splitting a compiler up into small programs is a technique used by researchers interested in producing provably correct compilers. Proving the correctness of a set of small programs often requires less effort than proving the correctness of a larger, single, equivalent program.

Three-stage compiler structure

[edit]
Compiler design

Regardless of the exact number of phases in the compiler design, the phases can be assigned to one of three stages. The stages include a front end, a middle end, and a back end.

  • The front end scans the input and verifies syntax and semantics according to a specific source language. For statically typed languages it performs type checking by collecting type information. If the input program is syntactically incorrect or has a type error, it generates error and/or warning messages, usually identifying the location in the source code where the problem was detected; in some cases the actual error may be (much) earlier in the program. Aspects of the front end include lexical analysis, syntax analysis, and semantic analysis. The front end transforms the input program into an intermediate representation (IR) for further processing by the middle end. This IR is usually a lower-level representation of the program with respect to the source code.
  • The middle end performs optimizations on the IR that are independent of the CPU architecture being targeted. This source code/machine code independence is intended to enable generic optimizations to be shared between versions of the compiler supporting different languages and target processors. Examples of middle end optimizations are removal of useless (dead-code elimination) or unreachable code (reachability analysis), discovery and propagation of constant values (constant propagation), relocation of computation to a less frequently executed place (e.g., out of a loop), or specialization of computation based on the context, eventually producing the "optimized" IR that is used by the back end.
  • The back end takes the optimized IR from the middle end. It may perform more analysis, transformations and optimizations that are specific for the target CPU architecture. The back end generates the target-dependent assembly code, performing register allocation in the process. The back end performs instruction scheduling, which re-orders instructions to keep parallel execution units busy by filling delay slots. Although most optimization problems are NP-hard, heuristic techniques for solving them are well-developed and implemented in production-quality compilers. Typically the output of a back end is machine code specialized for a particular processor and operating system.

This front/middle/back-end approach makes it possible to combine front ends for different languages with back ends for different CPUs while sharing the optimizations of the middle end.[51] Practical examples of this approach are the GNU Compiler Collection, Clang (LLVM-based C/C++ compiler),[52] and the Amsterdam Compiler Kit, which have multiple front-ends, shared optimizations and multiple back-ends.

Front end

[edit]
Lexer and parser example for C. Starting from the sequence of characters "if(net>0.0)total+=net*(1.0+tax/100.0);", the scanner composes a sequence of tokens, and categorizes each of them, for example as identifier, reserved word, number literal, or operator. The latter sequence is transformed by the parser into a syntax tree, which is then treated by the remaining compiler phases. The scanner and parser handles the regular and properly context-free parts of the grammar for C, respectively.

The front end analyzes the source code to build an internal representation of the program, called the intermediate representation (IR). It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope.

While the frontend can be a single monolithic function or program, as in a scannerless parser, it was traditionally implemented and analyzed as several phases, which may execute sequentially or concurrently. This method is favored due to its modularity and separation of concerns. Most commonly, the frontend is broken into three phases: lexical analysis (also known as lexing or scanning), syntax analysis (also known as scanning or parsing), and semantic analysis. Lexing and parsing comprise the syntactic analysis (word syntax and phrase syntax, respectively), and in simple cases, these modules (the lexer and parser) can be automatically generated from a grammar for the language, though in more complex cases these require manual modification. The lexical grammar and phrase grammar are usually context-free grammars, which simplifies analysis significantly, with context-sensitivity handled at the semantic analysis phase. The semantic analysis phase is generally more complex and written by hand, but can be partially or fully automated using attribute grammars. These phases themselves can be further broken down: lexing as scanning and evaluating, and parsing as building a concrete syntax tree (CST, parse tree) and then transforming it into an abstract syntax tree (AST, syntax tree). In some cases additional phases are used, notably line reconstruction and preprocessing, but these are rare.

The main phases of the front end include the following:

  • Line reconstruction converts the input character sequence to a canonical form ready for the parser. Languages which strop their keywords or allow arbitrary spaces within identifiers require this phase. The top-down, recursive-descent, table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. Atlas Autocode and Imp (and some implementations of ALGOL and Coral 66) are examples of stropped languages whose compilers would have a Line Reconstruction phase.
  • Preprocessing supports macro substitution and conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as Scheme support macro substitutions based on syntactic forms.
  • Lexical analysis (also known as lexing or tokenization) breaks the source code text into a sequence of small pieces called lexical tokens.[53] This phase can be divided into two stages: the scanning, which segments the input text into syntactic units called lexemes and assigns them a category; and the evaluating, which converts lexemes into a processed value. A token is a pair consisting of a token name and an optional token value.[54] Common token categories may include identifiers, keywords, separators, operators, literals and comments, although the set of token categories varies in different programming languages. The lexeme syntax is typically a regular language, so a finite-state automaton constructed from a regular expression can be used to recognize it. The software doing lexical analysis is called a lexical analyzer. This may not be a separate step—it can be combined with the parsing step in scannerless parsing, in which case parsing is done at the character level, not the token level.
  • Syntax analysis (also known as parsing) involves parsing the token sequence to identify the syntactic structure of the program. This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language's syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.[55]
  • Semantic analysis adds semantic information to the parse tree and builds the symbol table. This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsing phase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.

Middle end

[edit]

The middle end, also known as optimizer, performs optimizations on the intermediate representation in order to improve the performance and the quality of the produced machine code.[56] The middle end contains those optimizations that are independent of the CPU architecture being targeted.

The main phases of the middle end include the following:

Compiler analysis is the prerequisite for any compiler optimization, and they tightly work together. For example, dependence analysis is crucial for loop transformation.

The scope of compiler analysis and optimizations vary greatly; their scope may range from operating within a basic block, to whole procedures, or even the whole program. There is a trade-off between the granularity of the optimizations and the cost of compilation. For example, peephole optimizations are fast to perform during compilation but only affect a small local fragment of the code, and can be performed independently of the context in which the code fragment appears. In contrast, interprocedural optimization requires more compilation time and memory space, but enable optimizations that are only possible by considering the behavior of multiple functions simultaneously.

Interprocedural analysis and optimizations are common in modern commercial compilers from HP, IBM, SGI, Intel, Microsoft, and Sun Microsystems. The free software GCC was criticized for a long time for lacking powerful interprocedural optimizations, but it is changing in this respect. Another open source compiler with full analysis and optimization infrastructure is Open64, which is used by many organizations for research and commercial purposes.

Due to the extra time and space needed for compiler analysis and optimizations, some compilers skip them by default. Users have to use compilation options to explicitly tell the compiler which optimizations should be enabled.

Back end

[edit]

The back end is responsible for the CPU architecture specific optimizations and for code generation.[56]

The main phases of the back end include the following:

  • Machine dependent optimizations: optimizations that depend on the details of the CPU architecture that the compiler targets.[57] A prominent example is peephole optimizations, which rewrites short sequences of assembler instructions into more efficient instructions.
  • Code generation: the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes (see also Sethi–Ullman algorithm). Debug data may also need to be generated to facilitate debugging.

Compiler correctness

[edit]

Compiler correctness is the branch of software engineering that deals with trying to show that a compiler behaves according to its language specification.[58] Techniques include developing the compiler using formal methods and using rigorous testing (often called compiler validation) on an existing compiler.

Compiled vis-à-vis interpreted languages

[edit]

Higher-level programming languages usually appear with a type of translation in mind: either designed as compiled language or interpreted language. However, in practice there is rarely anything about a language that requires it to be exclusively compiled or exclusively interpreted, although it is possible to design languages that rely on re-interpretation at run time. The categorization usually reflects the most popular or widespread implementations of a language – for instance, BASIC is sometimes called an interpreted language, and C a compiled one, despite the existence of BASIC compilers and C interpreters.[59]

Interpretation does not replace compilation completely. It only hides it from the user and makes it gradual. Even though an interpreter can itself be interpreted, a set of directly executed machine instructions is needed somewhere at the bottom of the execution stack (see machine language).

Furthermore, for optimization compilers can contain interpreter functionality, and interpreters may include ahead of time compilation techniques. For example, where an expression can be executed during compilation and the results inserted into the output program, then it prevents it having to be recalculated each time the program runs, which can greatly speed up the final program. Modern trends toward just-in-time compilation and bytecode interpretation at times blur the traditional categorizations of compilers and interpreters even further. Meta-tracing is an automated compiler synthesis approach which takes this further and can be used to synthesize a compiler from a language interpreter.

Some language specifications spell out that implementations must include a compilation facility; for example, Common Lisp. However, there is nothing inherent in the definition of Common Lisp that stops it from being interpreted. Other languages have features that are very easy to implement in an interpreter, but make writing a compiler much harder; for example, APL, SNOBOL4,[60] and many scripting languages allow programs to construct arbitrary source code at runtime with regular string operations, and then execute that code by passing it to a special evaluation function. To implement these features in a compiled language, programs must usually be shipped with a runtime library that includes a version of the compiler itself.

Types

[edit]

One classification of compilers is by the platform on which their generated code executes. This is known as the target platform.

A native or hosted compiler is one whose output is intended to directly run on the same type of computer and operating system that the compiler itself runs on. The output of a cross compiler is designed to run on a different platform. Cross compilers are often used when developing software for embedded systems that are not intended to support a software development environment.

The output of a compiler that produces code for a virtual machine (VM) may or may not be executed on the same platform as the compiler that produced it. For this reason, such compilers are not usually classified as native or cross compilers.

The lower level language that is the target of a compiler may itself be a high-level programming language. C, viewed by some as a sort of portable assembly language, is frequently the target language of such compilers. For example, Cfront, the original compiler for C++, used C as its target language. The C code generated by such a compiler is usually not intended to be readable and maintained by humans, so indent style and creating pretty C intermediate code are ignored. Some of the features of C that make it a good target language include the #line directive, which can be generated by the compiler to support debugging of the original source, and the wide platform support available with C compilers.

While a common compiler type outputs machine code, there are many other types:

  • Source-to-source compilers are a type of compiler that takes a high-level language as its input and outputs a high-level language. For example, an automatic parallelizing compiler will frequently take in a high-level language program as an input and then transform the code and annotate it with parallel code annotations (e.g. OpenMP) or language constructs (e.g. Fortran's DOALL statements). Other terms for a source-to-source compiler are transcompiler or transpiler.[61]
  • Bytecode compilers compile to assembly language of a theoretical machine, like some Prolog implementations
  • Just-in-time compilers (JIT compiler) defer compilation until runtime. JIT compilers exist for many modern languages including Python, JavaScript, Smalltalk, Java, Microsoft .NET's Common Intermediate Language (CIL) and others. A JIT compiler generally runs inside an interpreter. When the interpreter detects that a code path is "hot", meaning it is executed frequently, the JIT compiler will be invoked and compile the "hot" code for increased performance.
    • For some languages, such as Java, applications are first compiled using a bytecode compiler and delivered in a machine-independent intermediate representation. A bytecode interpreter executes the bytecode, but the JIT compiler will translate the bytecode to machine code when increased performance is necessary.[62][non-primary source needed]
  • Hardware compilers (also known as synthesis tools) are compilers whose input is a hardware description language and whose output is a description, in the form of a netlist or otherwise, of a hardware configuration.
  • A program that translates from a low-level language to a higher level one is a decompiler.[67]
  • A program that translates into an object code format that is not supported on the compilation machine is called a cross compiler and is commonly used to prepare code for execution on embedded software applications.[68][better source needed]
  • A program that rewrites object code back into the same type of object code while applying optimisations and transformations is a binary recompiler.

Assemblers, which translate human readable assembly language to the machine code instructions executed by hardware, are not considered compilers.[69][b] (The inverse program that translates machine code to assembly language is called a disassembler.)

See also

[edit]

Notes and references

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A compiler is a computer program that translates source code written in a high-level programming language, such as C or Java, into a lower-level target language, typically machine code or bytecode, to produce an executable program that a computer can run directly. This translation enables developers to write abstract, human-readable instructions without needing to manage low-level hardware details, fundamentally separating programming logic from machine-specific operations. The architecture of a compiler is divided into a front-end and a back-end, with the front-end handling language-specific tasks and the back-end focusing on machine-specific output. Key phases in the front-end include , which breaks the source code into tokens like keywords and identifiers while ignoring whitespace and comments; syntax analysis, which parses tokens to build an representing the program's structure; and semantic analysis, which checks for meaning and context, such as type compatibility and scope resolution. In the back-end, intermediate code generation creates a platform-independent representation, followed by optimization to improve efficiency—such as eliminating redundant computations—and final code generation to produce target machine instructions. The development of compilers traces back to the early , when the term "compiler" was coined by pioneering computer scientist Grace Murray Hopper to describe a system for automatically translating high-level instructions into . One of the first practical implementations was the compiler in the late 1950s, which demonstrated the feasibility of generating optimized from a problem-oriented source language, marking a shift from hand-written assembly to automated translation. This innovation laid the groundwork for modern by enabling portable, efficient code across diverse hardware. Compilers play a pivotal role in by enhancing software , energy efficiency, and through advanced optimizations and detection, while also facilitating the creation of domain-specific languages and tools. Unlike interpreters, which execute code line-by-line at runtime, compilers produce standalone executables that run faster and with lower resource overhead, making them indispensable for large-scale applications. Ongoing research continues to evolve compilers to support emerging paradigms like and AI-driven code generation.

Overview

Definition and Purpose

A compiler is a computer program that translates source code written in a high-level programming language into a lower-level target language, such as machine code or bytecode, suitable for execution on a specific computer architecture. This translation process ensures that the resulting program maintains the intended functionality of the original source while adapting it to the hardware's instruction set. The primary purpose of a compiler is to bridge the gap between human-readable, abstract programming languages and the low-level instructions required by processors, thereby enabling developers to focus on algorithmic logic rather than machine-specific details. By abstracting hardware complexities, compilers promote portability across different systems and improve overall software efficiency through optimizations that produce faster, more resource-efficient executables. This abstraction not only accelerates but also enhances and in large-scale applications. At a high level, a compiler's involves taking as input, performing an analysis phase to parse and validate the code's structure and semantics, followed by a synthesis phase to generate the target output, such as or an binary. For instance, the early developed by in 1952 represented one of the first such translation tools, functioning as a rudimentary compiler for the computer by linking subroutines into form. A practical example is compiling a simple C program, such as one printing "Hello, World!", using a tool like GCC with the -S flag to produce x86 assembly code; this generates instructions like mov and call that the assembler then converts to machine code for execution on an x86 processor.

Key Components

A compiler's internal structure is typically modular, comprising several core components that process source code through distinct stages to produce executable target code. The primary components include the lexical analyzer (also known as the scanner), parser, semantic analyzer, code generator, and symbol table manager. These elements work in sequence to analyze and synthesize code, ensuring syntactic and semantic correctness while optimizing for the target machine. The lexical analyzer, or scanner, is the first component in the , responsible for reading the source code as a stream of characters and converting it into a sequence of meaningful tokens, such as keywords, identifiers, operators, and literals. It performs tasks like removing comments and whitespace, expanding macros, and handling language-specific features such as or indentation-based scoping (e.g., in Python). By grouping characters into tokens, the scanner simplifies subsequent and detects basic lexical errors, such as invalid characters. Following the scanner, the parser, or syntax analyzer, takes the token stream as input and constructs an (AST) or that represents the hierarchical structure of the program according to the language's . It verifies that the tokens conform to syntactic rules, resolving ambiguities through techniques like or LR parsing, and reports syntax errors if the structure is invalid. The resulting AST captures the program's logical organization, such as expressions, statements, and declarations, enabling higher-level analysis. The semantic analyzer then processes the AST to enforce static semantic rules beyond syntax, such as type checking, scope resolution, and declaration consistency. It verifies that operations are semantically valid—for instance, ensuring no type mismatches in expressions or undeclared variable usage—and annotates the AST with additional information, like inferred types or resolved references, often producing a decorated AST or semantic graph. This phase detects errors that syntax alone cannot catch, such as assigning a to an integer variable. The code generator operates toward the end of the process, translating the semantically verified (IR)—typically generated after semantic analysis—into target-specific or assembly. It maps high-level constructs to low-level instructions, incorporating and addressing modes tailored to the target architecture, such as or , to produce efficient executable output. Throughout these stages, the manager maintains a dynamic that stores and retrieves information about program symbols, including variables, functions, types, and scopes. It tracks attributes like declaration locations, types, and , supporting lookups during , semantic , and code generation to resolve references and prevent errors like redeclarations. The facilitates interconnections by providing shared access to symbol data across components, often implemented as hash tables or trees for efficient operations. These components interconnect in a linear data flow: the source code passes to the lexical analyzer, yielding tokens to the parser, which builds the AST for the semantic analyzer; the enriched IR then feeds the code generator, with the manager integrating data bidirectionally to support analysis and synthesis. This allows for independent development and reuse, as seen in compiler toolkits like Lex for generating lexical analyzers from specifications and for producing parsers from context-free grammars, which accelerate prototyping by automating routine tasks. A typical block diagram of these interactions can be visualized as follows:

[Source Code](/page/Source_code) | v Lexical Analyzer (Scanner) | v [Tokens](/page/The_Tokens) | v Parser | v Abstract Syntax Tree (AST) | v Semantic Analyzer | ^ v | (Symbol lookups) [Intermediate Representation](/page/Intermediate_representation) (IR) <--- [Symbol Table](/page/Symbol_table) Manager | v Code Generator | v Target Code

[Source Code](/page/Source_code) | v Lexical Analyzer (Scanner) | v [Tokens](/page/The_Tokens) | v Parser | v Abstract Syntax Tree (AST) | v Semantic Analyzer | ^ v | (Symbol lookups) [Intermediate Representation](/page/Intermediate_representation) (IR) <--- [Symbol Table](/page/Symbol_table) Manager | v Code Generator | v Target Code

This flow highlights the progressive transformation from raw input to executable output, with the symbol table enabling cross-phase consistency.

Comparison with Other Systems

Interpreters

An interpreter is a program that executes directly, typically line by line or statement by statement, without first translating the entire code into a separate file. Unlike compilers, which produce for later execution, interpreters read, analyze, and run the code on-the-fly during runtime. The execution model of an interpreter involves processing the source sequentially, often translating it into an intermediate form such as before immediate execution, sometimes within a environment that abstracts the underlying hardware. This approach allows for dynamic evaluation, where can be modified or extended while running, providing greater flexibility for interactive and scripting applications. Interpreters offer several advantages over compilers, including easier due to line-by-line execution that pinpoints errors immediately, with instant feedback without a separate compilation phase, and enhanced portability across platforms via the layer. For instance, Python's implementation exemplifies this model, where source code is compiled to and then interpreted by the , enabling seamless execution in interactive sessions or scripts. However, interpreters generally incur performance penalties, as the repeated analysis and translation at each run eliminate opportunities for whole-program optimizations that compilers can apply, resulting in significantly slower execution speeds, with slowdown factors often ranging from 10x to thousands compared to compiled equivalents, depending on the workload. This overhead stems from the on-the-fly processing, which avoids upfront compilation but limits efficiency for compute-intensive tasks. Earlier examples include Short Code, proposed in 1949 by John Mauchly and implemented for computers like the BINAC and UNIVAC I, which used an interpreter to execute high-level instructions line by line. A notable early interpreter was developed for the Lisp programming language starting in fall 1958 by John McCarthy and his team at MIT, initially as a simpler alternative to building a full compiler, marking a pivotal innovation in symbolic computation and AI research. Hybrid approaches, such as those combining interpretation with just-in-time compilation, bridge the gap by interpreting initially for quick startup while compiling hot code paths for improved performance, though these extend beyond pure interpretation.

Assemblers and Linkers

Assemblers are specialized programs that translate assembly language, a low-level symbolic representation of machine instructions, into relocatable machine code. Assembly language uses mnemonic instructions and symbolic addresses to represent binary operations, providing a one-to-one correspondence with the target processor's instruction set architecture (ISA). This translation process involves resolving symbols, such as labels for memory locations, into actual addresses while generating object files that include the machine code and relocation information for later use. Unlike compilers, which process high-level source code through syntactic and semantic analysis, assemblers operate directly on this intermediate, ISA-specific input without performing complex optimizations or type checking, focusing instead on straightforward resolution and mapping. Typically, assemblers perform this in one or two passes: the first pass builds a and calculates addresses, while the second generates the by substituting machine for mnemonics. Linkers, also known as link editors, take the output from compilers or assemblers—object files containing relocatable and unresolved external references—and combine them into a single binary suitable for execution on the target system. Their primary functions include symbol resolution, where undefined references (e.g., calls to functions in other modules or libraries) are matched to their definitions; relocation, which adjusts addresses to fit the final memory layout; and , incorporating necessary routines from static or shared libraries. Linkers support two main types of linking: static and dynamic. In static linking, all required library code is copied into the executable at link time, resulting in a self-contained binary that does not depend on external files at runtime, though this increases file size. Dynamic linking, by contrast, defers resolution until load or run time, referencing shared libraries (e.g., .so files on Unix-like systems) to reduce redundancy and enable updates to libraries without recompiling the program. A prominent example is the GNU linker (ld), part of the GNU Binutils suite, which supports multiple object formats like ELF and uses a command language for customizing output sections and memory placement. In the typical compilation pipeline, a compiler generates from high-level source, which may already include assembly-like intermediates; if source is written in assembly, an assembler explicitly produces the . The linker then processes these s, resolving dependencies and producing the final , ensuring the program can run independently or with dynamic libraries. Early assemblers emerged in the to simplify programming of the first electronic computers, with developing the initial , known as "Contracted Notation" or autocode, in 1947 for the Automatic Relay Computer (ARC) at Birkbeck College, . This innovation replaced tedious binary coding with symbolic instructions, paving the way for more efficient low-level programming before the widespread adoption of high-level languages.

Historical Development

Early Innovations (1950s–1970s)

The development of compilers began in the early 1950s amid significant hardware constraints, including limited memory and processing power on machines like the UNIVAC I, which compelled innovators to create tools for automating code translation from high-level descriptions to machine instructions. Grace Hopper led the creation of the A-0 system in 1952 at Remington Rand, recognized as the first compiler precursor, which automatically linked and ordered subroutines based on specifications read from punched cards, marking an initial step toward automatic programming. Concurrently, A.E. Glennie developed Autocode in 1952 for the Manchester Mark 1 computer, an early high-level programming system that translated mathematical expressions into machine code, addressing the tedium of manual assembly coding. By 1954, R.A. Brooker at the University of Manchester advanced this with the Mark 1 Autocode, introducing the first compiled subroutines that generated efficient machine code for reusable code blocks, overcoming manual optimization challenges on resource-scarce hardware. A pivotal milestone arrived in 1957 with the release of the compiler by and his team at for the , the first widely used high-level language compiler that translated scientific formulas into optimized assembly code, dramatically reducing programming time from weeks to hours despite the era's 36-bit word size and 4K-word memory limits. This compiler incorporated early optimization techniques, such as and , to produce code rivaling hand-written assembly in efficiency, thus proving high-level languages viable on limited hardware. Key figures like Hopper, , and Brooker navigated these constraints by emphasizing syntax analysis and code generation, laying groundwork for structured translation methods. In the 1960s, , formalized in 1960 by an international committee including , profoundly influenced by introducing block structures, recursive procedures, and lexical scoping, which compilers could enforce to promote readable, modular code free from unstructured jumps. This language spurred advancements in compiler design, including syntax-directed —a technique where drives semantic actions for code generation—exemplified in early implementations that used recursive descent parsers to handle nested blocks efficiently. 's FORTRAN IV, released in 1962 for systems like the IBM 7090, built on prior versions with enhanced optimizations such as index register usage and , enabling better performance on evolving hardware while standardizing features like explicit type declarations to ease portability. The 1970s saw further innovation with developing the language in 1972 at , designed as a portable tool that combined high-level abstractions with low-level access, initially compiled for the PDP-11 to rewrite Unix in by 1973. Ritchie's collaboration with integrated into the emerging Unix , where the compiler formed a core component alongside assemblers and linkers, facilitating modular on minicomputers. To address machine dependency, Stephen C. Johnson's (PCC) in the mid-1970s employed an intermediate language representation, allowing front-end parsing to target multiple back-ends and enabling Unix porting across architectures like the VAX, thus overcoming hardware fragmentation through retargetable design. These efforts by Ritchie, Thompson, and Johnson resolved persistent challenges like manual optimization and non-portability, establishing compilers as essential for scalable software ecosystems.

Modern Advancements (1980s–Present)

The 1980s and 1990s saw the rise of open-source compilers that emphasized portability and modularity, addressing the limitations of proprietary systems tied to specific hardware. The GNU Compiler Collection (GCC), launched by in 1987 as part of the GNU Project, became a cornerstone for development by supporting multiple languages and architectures, enabling developers to build portable applications without . By the late 1990s, GCC had evolved to support several programming languages, including C, C++, , , and Ada, and numerous processor architectures, fostering widespread adoption in systems and beyond. The Low Level Virtual Machine () project, started in 2000 by at the University of Illinois, introduced a modular compiler infrastructure that separated the front-end language parsing from back-end code generation, allowing reusable components across different languages and targets. This design facilitated easier experimentation with optimizations and supported the creation of compilers for new languages with minimal effort. In the 2000s, just-in-time () compilation gained prominence with the HotSpot , released by in 1999, which dynamically compiled to native code at runtime for improved performance in enterprise applications. Similarly, the .NET Framework's , introduced by in 2002, incorporated compilation to enable seamless execution of managed code across platforms, boosting the adoption of languages like C#. The decade also witnessed hybrid approaches for scripting languages, such as V8 for in 2008, blending interpretation with to balance startup speed and runtime efficiency in web browsers. The brought compilers focused on safety and performance for modern paradigms. Mozilla's compiler (rustc), with development beginning in 2009 and the first stable release in 2015, pioneered memory-safe through its borrow checker, preventing common errors like dereferences without garbage collection, and has since been used in like the . Apple's Swift compiler, announced in 2014, emphasized and high performance for development, compiling to native code while integrating with ecosystems to reduce boilerplate and errors. (Wasm) compilers emerged around 2015 as a portable binary format, with tools like (extended in this era) enabling C/C++ code to run efficiently in browsers via to Wasm modules. From the late to the , AI-driven optimizations transformed compiler design. LLVM's MLGO framework, integrated in 2021, uses to guide phase ordering and inlining decisions, achieving up to 1.3% geomean speedup on SPEC benchmarks compared to traditional heuristics. Key trends include modular architectures for easier extension, as seen in LLVM's influence on projects like and Swift; cross-compilation support for embedded and IoT devices, with GCC and enabling builds for and without host modifications; and the dominance of open-source compilers, which by 2023 powered over 90% of distributions and cloud infrastructure. In 2025, parallel compilation advancements in /LLVM, such as distributed task execution, improved build times for large projects, enhancing developer productivity. Quantum compilation also advanced, with IBM's framework incorporating circuit optimization passes in its 2025 release to map high-level quantum algorithms to noisy intermediate-scale quantum hardware more efficiently.

Compiler Architecture

Phases and Passes

Compiler processing is organized into phases and passes to systematically transform into executable . Phases represent distinct, sequential transformation steps that operate on the input, such as , which breaks the into tokens; syntax analysis, which verifies the structure against grammar rules; semantic analysis, which checks type compatibility and scope rules; and code generation, which produces target code from an . These phases can be executed in a single traversal or repeated across multiple passes, depending on the compiler design. To illustrate the flow through these main phases, consider how a typical compiler processes the expression i = i * 70 + j + 2 (assuming integer types and standard operator precedence where multiplication binds tighter than addition): 1. Lexical Analysis (Tokens):
  • i (identifier)
  • = (operator)
  • i (identifier)
  • * (operator)
  • 70 (integer literal)
  • + (operator)
  • j (identifier)
  • + (operator)
  • 2 (integer literal)
2. Syntax Analysis (Parse Tree / AST structure): The expression is parsed as an assignment: i = ((i * 70) + j) + 2 Abstract Syntax Tree outline:
  • Assignment node
    • Left: identifier "i"
    • Right: + node
      • Left: + node
        • Left: * node
          • Left: identifier "i"
          • Right: constant 70
        • Right: identifier "j"
      • Right: constant 2
3. Semantic Analysis:
  • Verifies type compatibility (all operands are integers → no errors)
  • Builds symbol table entries for i and j (assuming declared as int)
  • Annotates the AST with type information
4. Intermediate Code Generation (Three-Address Code):

t1 = i * 70 t2 = t1 + j t3 = t2 + 2 i = t3

t1 = i * 70 t2 = t1 + j t3 = t2 + 2 i = t3

(Alternative quadruple form:
( *, i, 70, t1 )
( +, t1, j, t2 )
( +, t2, 2, t3 )
( =, t3, _, i ))
5. Code Optimization (basic, no major changes here): Same as above (no common subexpressions or constant folding applicable without variable values) 6. Code Generation (example x86 assembly-like):

mov eax, [i] ; load i imul eax, 70 ; i * 70 add eax, [j] ; + j add eax, 2 ; + 2 mov [i], eax ; store back to i

mov eax, [i] ; load i imul eax, 70 ; i * 70 add eax, [j] ; + j add eax, 2 ; + 2 mov [i], eax ; store back to i

Passes, in contrast, refer to complete traversals of the source code or intermediate representations, where each pass may encompass one or more phases to build or refine the output progressively. One-pass compilers perform all necessary analysis and generation in a single sweep through the source code, enabling faster compilation times but requiring languages with restricted features, such as no forward references to symbols or simple block structures to avoid deferred decisions. For instance, designed languages like Pascal to support single-pass compilation for simplicity and efficiency in resource-constrained environments. In contrast, multi-pass compilers make multiple traversals, allowing phases to be separated and repeated for deeper analysis, such as multiple optimization iterations on intermediate code; this approach supports more complex languages and aggressive optimizations but increases compilation time and usage due to repeated loading and processing. Early compilers like the I system from 1957 employed a multi-pass design with six passes to handle optimization despite limited hardware, marking a shift from manual assembly to automated code improvement. The trade-offs between one-pass and multi-pass designs balance compilation speed against code quality and language expressiveness. Single-pass approaches minimize overhead, making them suitable for interactive or embedded systems where quick feedback is essential, but they limit optimizations and error detection to what can be resolved on-the-fly. Multi-pass designs, while resource-intensive, enable comprehensive error checking and transformations, such as global , which are infeasible in a single traversal; however, they can lead to longer build times, especially in large projects. Over time, as hardware capabilities grew, compilers evolved toward multi-pass architectures to exploit opportunities for better performance, with modern systems like GCC using dozens of passes for modular optimization. A simple example of multi-pass processing is the two-pass assembler, commonly used in early computing systems. In the first pass, the assembler scans the source to build a by assigning addresses to labels and resolving forward references without generating code. The second pass then traverses the code again, substituting symbol addresses and emitting machine instructions based on the table. This design illustrates the core benefit of passes: deferring resolution until sufficient information is gathered. Bootstrapping compilers, which compile their own , often rely on multi-pass structures to iteratively improve the compiler itself, starting from an initial version written in another language. The NELIAC compiler from 1958 pioneered this by using a bootstrap to generate subsequent versions on the 7090. In terms of structure, one-pass compilers follow a linear flow:

Source Code → Lexical → Syntax → Semantic → Code Gen → Machine Code

Source Code → Lexical → Syntax → Semantic → Code Gen → Machine Code

While multi-pass designs involve iterative or pipelined traversals, such as separate passes for front-end analysis (covered in the Front End section) and back-end generation (covered in the Back End section), allowing reuse of intermediate forms.

Front End

The front end of a compiler performs the initial analysis of , verifying its adherence to the programming language's and semantics while generating a machine-independent . This phase is source-language dependent, focusing on transforming raw text into a structured form suitable for further processing. By isolating language-specific concerns, the front end enables reusability with multiple back ends for different target architectures. The front end comprises three primary components: lexical analysis, syntax analysis, and semantic analysis. Lexical analysis scans the input stream of characters and groups them into meaningful units called tokens, such as keywords, identifiers, operators, and literals. This process relies on regular expressions to define token patterns, which are converted into deterministic finite automata (DFAs) for efficient recognition. For instance, the regular expression for an identifier, [azAZ][azAZ09][a-zA-Z][a-zA-Z0-9]^*, matches a starting letter followed by zero or more alphanumeric characters, ensuring tokens like variable names are correctly isolated. Tools like Lex automate DFA construction from these specifications, optimizing scanning performance through state minimization. Syntax analysis builds upon the token stream to validate the program's structure against the language's , producing a that represents hierarchical relationships. Parsers operate either top-down, as in algorithms that predict expansions from the start symbol, or bottom-up, as in LR algorithms that reduce tokens to non-terminals via shift-reduce mechanisms. LR parsers, introduced by Knuth, handle a broader class of grammars deterministically in linear time, making them suitable for most programming languages. Parser generators such as , which produces LALR(1) parsers from grammar rules augmented with semantic actions, and its GNU successor , facilitate rapid development by automating table-driven parsing. Semantic analysis examines the for context-sensitive properties, ensuring the program's meaning is well-defined. Key tasks include type checking, which verifies operand compatibility in operations (e.g., disallowing of incompatible types like and ), and scope resolution, which binds identifiers to their declarations within nested scopes using mechanisms like symbol tables. Violations trigger errors, such as undeclared variables or type mismatches. This phase often integrates attribute grammars to propagate information across the tree. The front end culminates in an (AST), a condensed representation omitting extraneous details like parentheses, or a preliminary (IR) such as three-address code, where statements are simplified to forms like x=y+zx = y + z. For the declaration int x = 5;, yields tokens for "int", "x", "=", and "5"; syntax analysis constructs nodes for a variable declaration with assignment; and semantic analysis confirms integer type assignment, yielding an AST with a declaration node parenting an identifier "x" and integer literal 5. To support robust compilation, the front end incorporates error handling and recovery strategies during lexical, syntactic, and semantic phases. Upon detecting issues like unexpected tokens, parsers may skip to synchronization points (e.g., semicolons) or insert/deleted symbols locally to resume , minimizing cascading errors and enabling multi-error reporting in a single pass.

Middle End

The middle end of a compiler, often referred to as the optimizer, bridges the front end and back end by transforming the (IR) into a more efficient form through machine-independent optimizations. These transformations focus on improving program performance, such as reducing computational redundancy or eliminating unnecessary operations, without relying on details of the target . Core techniques include , which evaluates expressions with known constant values at , and dead code elimination, which removes code that has no effect on the program's observable behavior, such as unreachable statements or computations whose results are never used. Central to middle-end optimizations are and s, which provide the analytical framework for identifying optimization opportunities. A (CFG) models the program's execution paths, with nodes representing basic blocks—sequences of instructions without branches—and edges indicating possible control transfers, enabling precise modeling of program structure for analysis. operates on this CFG to track variable states across the program; for example, reaching definitions analysis computes the set of variable definitions that may reach each program point, while live variables analysis determines which variables retain value beyond a given point. These analyses support optimizations by revealing redundancies, such as unused definitions that can be eliminated. The basic data-flow equations for reaching definitions illustrate this process formally. For a node nn in the CFG: RDin(n)=ppred(n)RDout(p)\text{RD}_{\text{in}}(n) = \bigcup_{p \in \text{pred}(n)} \text{RD}_{\text{out}}(p) RDout(n)=gen(n)(RDin(n)kill(n))\text{RD}_{\text{out}}(n) = \text{gen}(n) \cup \left( \text{RD}_{\text{in}}(n) - \text{kill}(n) \right) Here, pred(n)\text{pred}(n) are the predecessors of nn, gen(n)\text{gen}(n) is the set of definitions generated within nn, and kill(n)\text{kill}(n) is the set of definitions invalidated by nn. To solve, initialize all RDin\text{RD}_{\text{in}} and RDout\text{RD}_{\text{out}} sets to the (for a forward starting conservatively), then propagate information iteratively in across the CFG or using worklist algorithms until a fixed point is reached, where no further changes occur due to the monotonic nature of the union and set operations. This fixed-point computation ensures all possible reaching definitions are captured efficiently. Middle-end processing involves multiple optimization passes, often organized in loops to apply transformations iteratively as new opportunities arise from prior changes. For instance, (LICM) identifies computations within a loop that yield the same result on every iteration—such as loop-independent expressions—and relocates them to a pre-loop position, minimizing redundant execution while preserving program semantics. These passes rely on prior analyses like (nodes that precede all paths to a given node) to ensure safe movement. In practice, compilers execute sequences of such passes, potentially repeating them to handle interdependencies. Frameworks like LLVM's pass manager facilitate modular implementation of these optimizations, allowing developers to define and schedule sequences of and transformation passes on the IR. The new pass manager in LLVM supports nested pipelines, enabling bottom-up optimization of call graphs and efficient reuse of results across passes, which enhances for large codebases. For example, the optimization of x = 2 + 3; to x = 5; via demonstrates a simple yet impactful transformation, replacing runtime with a direct assignment.

Back End

The back end of a compiler translates the optimized (IR) into target-specific , accounting for the hardware architecture's instruction set, registers, and memory model. This phase is highly dependent on the target platform, such as x86, , or , requiring architecture-specific rules to ensure efficient and correct execution. Unlike the machine-independent optimizations in prior phases, the back end focuses on low-level transformations to produce executable binaries or assembly code. Key components include instruction selection, register allocation, and code generation. Instruction selection maps operations in the IR, often represented as a directed acyclic graph (DAG), to concrete machine instructions using techniques like tree pattern matching. In this approach, the IR tree or DAG is covered by predefined patterns (tiles) corresponding to instruction alternatives, selecting the optimal match based on cost functions for latency or code size; for instance, the Bottom-Up Rewrite System (BURS) automates this by generating efficient parsers from a cost-augmented grammar. Register allocation assigns virtual registers from the IR to a limited set of physical machine registers, modeled as graph coloring on an interference graph where nodes represent live ranges and edges indicate overlapping lifetimes. If the graph is not colorable with the available registers, spilling to memory occurs. Code generation then emits assembly instructions, incorporating peephole optimizations to refine short sequences for better performance, such as combining multiple moves into a single instruction. Modern back ends, such as those in , support retargeting across multiple instruction set architectures (ISAs) through modular designs, allowing reuse of front and middle ends while customizing selection rules, scheduling, and emission for each target. For example, LLVM's TableGen tool defines instruction encodings and patterns, enabling back ends for over 20 ISAs including x86 and . The overall process proceeds from optimized IR to a DAG representation for selection, followed by to respect dependencies and hazards, , and final assembly emission into or binaries. A representative example is compiling a simple IR assignment like %result = add i32 %a, 5 on x86: instruction selection might choose add eax, 5 after loading %a into eax, with register allocation ensuring no conflicts via coloring. For register allocation, the greedy graph coloring algorithm approximates the chromatic number by iteratively assigning the smallest available color (register) to nodes ordered by decreasing degree, building the interference graph first where edges connect variables live simultaneously; this heuristic guarantees at most Δ(G)+1\Delta(G) + 1 colors, where Δ(G)\Delta(G) is the maximum degree, often sufficient for typical register counts like 16 on x86. Retargeting challenges arise from ISA differences, such as ARM's load-store architecture versus x86's register-memory operations, necessitating careful pattern matching to avoid inefficient code.

Correctness and Verification

Compiler correctness refers to the guarantee that the compiled preserves the semantics of the source program, ensuring equivalent observable behavior between the input and output across all possible executions. This semantic preservation is fundamental, as violations can lead to subtle bugs where the generated produces incorrect results or crashes unexpectedly. Optimizing compilers prioritize while aiming to maintain this equivalence, whereas verifying compilers incorporate proofs to mathematically confirm it. Verification techniques for compilers span empirical testing and . Testing approaches include unit tests for individual components, integration tests for interactions, and tools like Csmith, which generates complex programs to expose discrepancies via differential testing against multiple compilers. Csmith has uncovered hundreds of bugs in production compilers such as GCC and by producing inputs that trigger miscompilations. , in contrast, use theorem provers like Coq to establish semantic preservation; the project exemplifies this by verifying a subset of to assembly through a multi-pass , proving that optimizations do not alter program . can also validate properties in specific phases, though it scales poorly to full compilers. Challenges in verification arise from the undecidability of checking semantic equivalence between arbitrary source and target programs, stemming from the and requiring approximations or restrictions to subsets of languages. Phase errors, where transformations in one stage invalidate assumptions in another, further complicate guarantees. Examples include efforts in systems like , where the Mi-Cho-Coq framework uses Coq to certify the functional correctness of Michelson smart contracts against their compiled execution semantics. Bug-finding tools like Csmith complement this by targeting undefined behaviors that evade formal proofs. Metrics for assessing verification effectiveness include test coverage rates, often aiming for over 90% coverage in regression suites to catch regressions, and self-hosting, where the compiler recompiles itself across multiple generations to validate consistency in output binaries. Self-hosting provides a practical , as discrepancies indicate bugs, though it does not prove full correctness. In recent developments, AI techniques such as large language models have been applied to detect missed optimizations and bugs in ; one approach identified 24 issues in production compilers by analyzing code for optimization opportunities. These methods, including IR-based crossover testing in FLUX, enhance traditional by generating targeted inputs that reveal subtle miscompilations. As of 2025, ongoing efforts include LLM-assisted tools like WhiteFox, which have detected additional bugs in .

Types of Compilers

Traditional Batch Compilers

Traditional batch compilers are systems that translate an entire program's source code into machine-executable form ahead of time, typically in an offline batch process without user interaction during compilation. This approach involves static analysis of the full codebase to generate optimized object code or binaries before program execution begins. Key characteristics include ahead-of-time (AOT) compilation, where the compiler performs all necessary analysis and optimization in a single or multi-pass run on the complete source, producing standalone executables or intermediate representations suitable for direct execution. Prominent examples are the GNU Compiler Collection (GCC), which compiles languages like C and C++ into native machine code, and the Java compiler (javac), which translates Java source files into platform-independent bytecode for the Java Virtual Machine. These compilers operate in batch mode, processing multiple input files as a unit to resolve dependencies and generate outputs like object files or class files. The typical workflow begins with source code input, which undergoes preprocessing (e.g., macro expansion in C), lexical analysis, parsing, semantic analysis, optimization, and code generation to produce object files containing relocatable machine code. These object files are then linked with libraries and resolved for external references to create a final executable binary, ready for immediate execution without further compilation. This process is managed via command-line invocation or build tools, ensuring reproducibility in non-interactive environments. Advantages of traditional batch compilers include the generation of highly optimized binaries that exhibit superior runtime performance and minimal startup latency, as all compilation overhead occurs upfront rather than during execution. They also eliminate runtime compilation costs, making them ideal for resource-constrained environments. However, disadvantages encompass extended compilation times for large codebases, which can prolong development iterations, and a lack of adaptability to runtime conditions since optimizations are fixed at . Common use cases for batch compilers include , where native code efficiency is paramount, such as developing operating system kernels or device drivers with GCC. They are also prevalent in development, where predictable performance and small footprint are essential, as AOT compilation avoids the overhead of just-in-time mechanisms in constrained hardware. The evolution of batch compilers traces back to the original compiler developed for the in the mid-1950s by and his team at , which introduced high-level language translation and optimization in a batch-oriented manner to simplify scientific . Over decades, this model advanced from single-language tools like early implementations to polyglot systems like GCC, initiated in 1987, which now supports over a dozen languages and targets diverse architectures while retaining the core batch compilation paradigm.

Just-In-Time (JIT) Compilers

Just-in-time (JIT) compilers translate or (IR) into native during program execution, enabling dynamic optimization based on runtime behavior. This contrasts with by deferring code generation until necessary, often incorporating profiling to target frequently executed code paths for improved efficiency. The JIT process typically starts with interpretation or baseline compilation for rapid startup, then progresses to full optimization as usage patterns emerge. The core mechanism of JIT compilers involves monitoring execution to identify "hot" code—regions executed often—and compiling them on-the-fly with aggressive optimizations unavailable at static compile time. For instance, tracing JITs record sequences of operations as linear traces, applying transformations like loop unrolling and dead code elimination to these traces for streamlined execution. In the V8 JavaScript engine, this pipeline includes an interpreter (Ignition) for bytecode generation, a baseline JIT (Sparkplug) for quick machine code, and advanced optimizing JITs (Maglev and TurboFan) that use static single assignment (SSA) form and runtime feedback on types and shapes to specialize code, with deoptimization handling failures in assumptions. Such approaches allow JITs to outperform pure interpretation while adapting to actual workloads. Prominent examples of JIT compilers include the HotSpot JVM, released by in 1999, which uses a tiered system: the C1 client compiler for fast, lightweight optimization and the C2 server compiler for thorough analysis, significantly reducing interpretation overhead in Java applications. The .NET (CLR) applies JIT to convert (CIL) into platform-specific native code at method invocation, supporting managed execution across environments. , a trace-based JIT for the scripting language developed by Mike Pall since 2005, combines an assembler-based interpreter with advanced SSA optimizations to achieve near-native speeds in dynamic scenarios like games and embedded systems, while maintaining compatibility with Lua 5.1. JIT compilers offer advantages in runtime adaptability, such as tailoring optimizations to observed branch probabilities and data types, which can yield higher peak performance than static methods for long-running or polymorphic code. This enables techniques like speculative inlining based on profile data, boosting throughput in applications with unpredictable paths. However, drawbacks include initial startup delays from on-the-fly compilation, potentially increasing launch times by seconds in large programs, and elevated memory consumption from caching multiple code versions and profiles. Essential techniques in JIT compilers encompass adaptive optimization, which iteratively recompiles code using evolving runtime profiles to refine decisions like , and deoptimization, a mechanism to discard optimized code and resume from a baseline interpreter if assumptions (e.g., constant values or type stability) invalidate. Deoptimization relies on inserted anchors—synchronization points that capture stack state—and assume checks to trigger on-stack replacement, preserving correctness in speculative scenarios like constant propagation or inlining. These methods, formalized in verified JIT frameworks, balance aggressive with reliable fallback, as seen in systems handling dynamic languages. In the 2020s, developments for in browsers have advanced runtime compilation support, integrating profiling, on-stack replacement, and recompilation to generate specialized native code efficiently while adhering to sandbox constraints. (PGO) has also matured in JIT environments, with .NET 9 expanding runtime profiling with Dynamic PGO to optimize more code patterns, such as type checks and casts.

Specialized Compilers

Specialized compilers are tailored for particular domains, architectures, or development needs, extending beyond general-purpose tools to address unique constraints such as target platform differences or specialized hardware. Cross-compilers, for instance, run on a host platform but produce executable code for a different target platform, enabling development for embedded systems or mobile devices without requiring the target hardware. The GNU Compiler Collection (GCC) supports cross-compilation through target-specific configurations, allowing developers to specify the host and target architectures via command-line options like --target=arm-linux-gnueabihf. A prominent example is the Android Native Development Kit (NDK), which uses cross-compilers to build C/C++ code for ARM-based Android devices from x86 hosts. Source-to-source compilers, also known as transpilers, translate code from one high-level language to another, often to ensure compatibility with existing environments or to leverage specific features. Babel transpiles modern (ES6+) to older versions (ES5) for broader browser support, using plugins to handle syntax transformations without altering semantics. Similarly, the TypeScript compiler (tsc) transpiles TypeScript—a superset of with static typing—to plain , enabling type-checked development while producing runnable code for web and environments. Domain-specific compilers optimize for niche hardware or computational paradigms, incorporating specialized optimizations and backends. For graphics processing units (GPUs), NVIDIA's NVCC (CUDA Compiler Driver) compiles C++ code to parallel executable kernels for NVIDIA GPUs, handling device-specific instructions and through a phased compilation process that separates host and device code. In , Microsoft's Q# compiler translates quantum algorithms written in Q#—a —to executable operations for quantum simulators or hardware like Azure Quantum, supporting operations like entanglement and measurement. For , the XLA (Accelerated Linear Algebra) compiler optimizes and JAX models by fusing operations into efficient kernels for CPUs, GPUs, or TPUs, reducing execution time through graph-level transformations without requiring code changes. Other specialized variants include incremental compilers, which recompile only modified program portions to accelerate iterative development, and bootstrapping compilers, which compile their own source code to enable self-hosting. Rust's compiler employs incremental compilation by tracking dependencies and caching intermediate results, allowing rapid rebuilds after small changes. Bootstrapping, as in GCC's build process, starts with a minimal compiler written to produce a full self-compiling version. Modern examples highlight practical applications: facilitates cross-compilation via rustup's target addition (e.g., rustup target add aarch64-unknown-linux-gnu), enabling seamless builds for multiple architectures including for embedded systems. Zig serves as a drop-in C/C++ compiler with built-in C interop, using @cImport to seamlessly link C headers and libraries while supporting cross-compilation out-of-the-box for targets like Windows from hosts. Emerging trends as of 2025 focus on embedded AI compilers for edge devices, optimizing neural networks for low-power microcontrollers with techniques like quantization and operator fusion. Compiler-aware hardware designs integrate AI-specific instructions into edge processors, enabling efficient on resource-constrained devices. These advancements support autonomous in IoT without cloud reliance.

Advanced Topics

Optimization Techniques

Optimization techniques in compilers transform intermediate representations to improve program performance metrics such as execution time and code size while preserving semantics. These techniques are broadly classified into local optimizations, which operate on small code segments; global optimizations, which span basic blocks within a function; and interprocedural optimizations, which analyze across function boundaries to enable more aggressive transformations. Local optimizations, such as , examine short sequences of instructions—typically 1 to 4 machine words—and replace them with equivalent but more efficient code, such as eliminating redundant loads or jumps. This approach, first systematically explored in early compiler designs, can reduce instruction count in machine-specific code generation phases. Global optimizations leverage across an entire function to eliminate , propagate constants, or reorder computations for better register utilization. Interprocedural analysis extends this by constructing call graphs and summary information, allowing optimizations like inlining or alias elimination that cross function boundaries, improving execution speed in large programs through reduced function call overhead. Loop optimizations form a critical subset, targeting iterative structures common in numerical and data-processing code. replicates the loop body to reduce overhead from increment and test operations; for instance, the following C loop:

for (int i = 0; i < 4; i++) { a[i] = b[i] + c; }

for (int i = 0; i < 4; i++) { a[i] = b[i] + c; }

may be transformed into:

a[0] = b[0] + c; a[1] = b[1] + c; a[2] = b[2] + c; a[3] = b[3] + c;

a[0] = b[0] + c; a[1] = b[1] + c; a[2] = b[2] + c; a[3] = b[3] + c;

This eliminates branch instructions but increases code size, trading compactness for potential speedups on small fixed iterations by enabling better . Loop fusion merges adjacent loops with compatible iteration spaces to enhance data locality and cache efficiency, reducing accesses in array-based computations and improving performance in memory-bound workloads. Advanced techniques build on these foundations to exploit hardware features. Vectorization, particularly using SIMD instructions, automatically converts scalar loops into vector operations that process multiple elements in parallel, as implemented in compilers like Intel's ICC, which can achieve significant speedups on supported architectures for aligned array computations. Auto-parallelization identifies independent loop iterations and inserts threading directives, such as pragmas, to distribute work across cores without manual intervention, though effectiveness depends on loop dependence analysis and can deliver performance gains on multicore systems for data-parallel kernels. (PGO) uses runtime execution profiles from instrumented binaries to inform decisions like branch prediction or hot-path inlining, resulting in 5-15% average speedups across diverse workloads by aligning optimizations with actual usage patterns. Recent integrations of enhance traditional heuristics. In LLVM's MLGO framework, models, trained on optimization sequences from prior compilations, guide phase ordering to select pass orders that minimize execution time, demonstrating 1.3-2.5% geomean on the SPEC CPU2017 benchmark suite over default heuristics. Neural networks applied to , such as graph neural networks modeling interference graphs, learn spilling and coloring decisions to reduce live ranges and evictions. Performance is quantified using , defined as: Speedup=ToriginalToptimized\text{Speedup} = \frac{T_{\text{original}}}{T_{\text{optimized}}} where TT denotes execution time; values above 1 indicate improvement, with SPEC benchmarks like CPU2006 or providing standardized metrics for compiler optimizations. Amdahl's law bounds parallel optimization potential: S=1f+(1f)pS = \frac{1}{f + \frac{(1-f)}{p}} where ff is the fraction of sequential code (0 ≤ ff ≤ 1) and pp is the number of processors. This formula derives from total execution time T=f+1fpT = f + \frac{1-f}{p} (normalizing original time to 1), highlighting that speedup approaches 1/f1/f as pp \to \infty, emphasizing the need to minimize sequential portions for scalable gains; for example, with f=0.05f=0.05 and p=16p=16, S9.1S \approx 9.1. As of 2025, further advancements include large language models (LLMs) for compiler optimization, enabling self-improving systems that iteratively refine heuristics based on performance feedback, and techniques like Iterative BC-Max for binary size reduction. These approaches show promise in achieving additional performance gains, such as up to 1.75x speedups in some cases using models like CodeLlama.

Parallel and Distributed Compilation

Parallel compilation techniques enable compilers to leverage multi-core processors by executing independent phases or tasks concurrently, significantly reducing overall build times for large codebases. For instance, the compiler driver supports parallel job execution through a built-in scheduler that compiles independent translation units simultaneously, allowing developers to specify the number of jobs via build system flags like -j in Make or . This approach is particularly effective for phases such as , , and code generation, where dependencies between modules can be modeled to avoid conflicts. Build systems like employ dependency graphs to represent task interdependencies, enabling efficient parallel scheduling by default based on the system's CPU count, which ensures maximal utilization without manual intervention. Distributed compilation extends parallelism across networked machines, distributing workload to remote hosts for even greater scalability in team or CI environments. Tools like Bazel facilitate this through remote caching, where build actions—defined by inputs, outputs, and commands—are stored on a shared server; subsequent builds retrieve cached artifacts via HTTP, avoiding redundant computations and ensuring reproducible results across machines. Complementing this, ccache acts as a local or remote compiler cache for incremental builds, hashing source files and compiler arguments to reuse precompiled objects, which accelerates recompilations after minor changes by bypassing full preprocessing and compilation steps. In networked setups, DistCC distributes preprocessed C/C++ source code to volunteer machines running a daemon, compiling in parallel without requiring shared filesystems or identical environments, thus scaling builds nearly linearly for small clusters. Implementing parallel and distributed compilation introduces challenges, including race conditions in shared data structures like symbol tables, where concurrent accesses during semantic or linking can lead to inconsistent results if not synchronized via locks or atomic operations. Load balancing poses another issue in distributed systems, as uneven task distribution across heterogeneous nodes—due to varying CPU speeds or network latency—can cause idle resources and prolonged build times, necessitating dynamic scheduling algorithms to monitor and reassign workloads. Prominent tools include the Ninja build system, which optimizes parallel execution through its lightweight dependency resolution, achieving sub-second incremental builds for projects with tens of thousands of files like Chromium. In LLVM, parallel code generation during Link-Time Optimization (LTO) distributes backend tasks across threads, speeding up the transformation of intermediate representation (IR) to machine code for large modules. These techniques yield substantial benefits, such as reducing Chromium's full rebuild times from hours to under an hour on multi-core systems through jumbo builds that minimize inter-module dependencies, enabling higher parallelism.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.