Recent from talks
Nothing was collected or created yet.
Optimizing compiler
View on Wikipedia| Program execution |
|---|
| General concepts |
| Types of code |
| Compilation strategies |
| Notable runtimes |
|
| Notable compilers & toolchains |
|
An optimizing compiler is a compiler designed to generate code that is optimized in aspects such as minimizing program execution time, memory usage, storage size, and power consumption.[1] Optimization is generally implemented as a sequence of optimizing transformations, a.k.a. compiler optimizations – algorithms that transform code to produce semantically equivalent code optimized for some aspect.
Optimization is limited by a number of factors. Theoretical analysis indicates that some optimization problems are NP-complete, or even undecidable.[2] Also, producing perfectly optimal code is not possible since optimizing for one aspect often degrades performance for another (see: superoptimization). Optimization is a collection of heuristic methods for improving resource usage in typical programs.[3]: 585
Categorization
[edit]Local vs. global scope
[edit]Scope describes how much of the input code is considered to apply optimizations.
Local scope optimizations use information local to a basic block.[4] Since basic blocks contain no control flow statements, these optimizations require minimal analysis, reducing time and storage requirements. However, no information is retained across jumps.
Global scope optimizations, also known as intra-procedural optimizations, operate on individual functions.[4] This gives them more information to work with but often makes expensive computations necessary. Worst-case assumptions need to be made when function calls occur or global variables are accessed because little information about them is available.
Peephole optimization
[edit]Peephole optimizations are usually performed late in the compilation process after machine code has been generated. This optimization examines a few adjacent instructions (similar to "looking through a peephole" at the code) to see whether they can be replaced by a single instruction or a shorter sequence of instructions.[3]: 554 For instance, a multiplication of a value by two might be more efficiently executed by left-shifting the value or by adding the value to itself (this example is also an instance of strength reduction).
Inter-procedural optimization
[edit]Interprocedural optimizations analyze all of a program's source code. The more information available, the more effective the optimizations can be. The information can be used for various optimizations, including function inlining, where a call to a function is replaced by a copy of the function body.
Link-time optimization
[edit]Link-time optimization (LTO), or whole-program optimization, is a more general class of interprocedural optimization. During LTO, the compiler has visibility across translation units which allows it to perform more aggressive optimizations like cross-module inlining and devirtualization.
Machine and object code optimization
[edit]Machine code optimization involves using an object code optimizer to analyze the program after all machine code has been linked. Techniques such as macro compression, which conserves space by condensing common instruction sequences, become more effective when the entire executable task image is available for analysis.[5]
Language-independent vs. language-dependent
[edit]Most high-level programming languages share common programming constructs and abstractions, such as branching constructs (if, switch), looping constructs (for, while), and encapsulation constructs (structures, objects). Thus, similar optimization techniques can be used across languages. However, certain language features make some optimizations difficult. For instance, pointers in C and C++ make array optimization difficult; see alias analysis. However, languages such as PL/I that also support pointers implement optimizations for arrays. Conversely, some language features make certain optimizations easier. For example, in some languages, functions are not permitted to have side effects. Therefore, if a program makes several calls to the same function with the same arguments, the compiler can infer that the function's result only needs to be computed once. In languages where functions are allowed to have side effects, the compiler can restrict such optimization to functions that it can determine have no side effects.
Machine-independent vs. machine-dependent
[edit]Many optimizations that operate on abstract programming concepts (loops, objects, structures) are independent of the machine targeted by the compiler, but many of the most effective optimizations are those that best exploit special features of the target platform. Examples are instructions that do several things at once, such as decrement register and branch if not zero.
The following is an instance of a local machine-dependent optimization. To set a register to 0, the obvious way is to use the constant '0' in an instruction that sets a register value to a constant. A less obvious way is to XOR a register with itself or subtract it from itself. It is up to the compiler to know which instruction variant to use. On many RISC machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is shorter and probably faster, as there will be no need to decode an immediate operand, nor use the internal "immediate operand register"; the same applies on IBM System/360 and successors for the subtract variant.[6] A potential problem with this is that XOR or subtract may introduce a data dependency on the previous value of the register, causing a pipeline stall, which occurs when the processor must delay execution of an instruction because it depends on the result of a previous instruction. However, processors often treat the XOR of a register with itself or the subtract of a register from itself as a special case that does not cause stalls.
Factors affecting optimization
[edit]This section possibly contains original research. (August 2020) |
- Target machine
- Whether particular optimizations can and should be applied may depend on the characteristics of the target machine. Some compilers such as GCC and Clang parameterize machine-dependent factors so that they can be used to optimize for different machines.[7]
- Target CPU architecture
- Number of registers: Registers can be used to optimize for performance. Local variables can be stored in registers instead of the stack. Temporary/intermediate results can be accessed in registers instead of slower memory.
- RISC vs. CISC: CISC instruction sets often have variable instruction lengths,[8] often have a larger number of possible instructions that can be used, and each instruction could take differing amounts of time. RISC instruction sets attempt to limit the variability in each of these: instruction sets are usually constant in length, with few exceptions, there are usually fewer combinations of registers and memory operations, and the instruction issue rate (the number of instructions completed per time period, usually an integer multiple of the clock cycle) is usually constant in cases where memory latency is not a factor. There may be several ways of carrying out a certain task, with CISC usually offering more alternatives than RISC. Compilers have to know the relative costs among the various instructions and choose the best instruction sequence (see instruction selection).
- Pipelines: A pipeline is a CPU broken up into an assembly line. It allows the use of parts of the CPU for different instructions by breaking up the execution of instructions into various stages: instruction decode, address decode, memory fetch, register fetch, compute, register store, etc. One instruction could be in the register store stage, while another could be in the register fetch stage. Pipeline conflicts occur when an instruction in one stage of the pipeline depends on the result of another instruction ahead of it in the pipeline but not yet completed. Pipeline conflicts can lead to pipeline stalls: where the CPU wastes cycles waiting for a conflict to resolve. Compilers can schedule, or reorder, instructions so that pipeline stalls occur less frequently.
- Number of functional units: Some CPUs have several ALUs and FPUs that allow them to execute multiple instructions simultaneously. There may be restrictions on which instructions can pair with which other instructions ("pairing" is the simultaneous execution of two or more instructions), and which functional unit can execute which instruction. They also have issues similar to pipeline conflicts. Instructions can be scheduled so that the functional units are fully loaded.
- Machine architecture
- CPU cache size and type (direct mapped, 2-/4-/8-/16-way associative, fully associative): Techniques such as inline expansion and loop unrolling may increase the size of the generated code and reduce code locality. The program may slow down drastically if a highly used section of code (like inner loops in various algorithms) no longer fits in the cache as a result of optimizations that increase code size. Also, caches that are not fully associative have higher chances of cache collisions even in an unfilled cache.
- Cache/memory transfer rates: These give the compiler an indication of the penalty for cache misses. This is used mainly in specialized applications.
- Intended use
- Debugging: During development, optimizations are often disabled to speed compilation or to make the executable code easier to debug. Optimizing transformations, particularly those that reorder code, can make it difficult to relate the executable code to the source code.
- General-purpose use: Prepackaged software is often expected to run on a variety of machines that may share the same instruction set but have different performance characteristics. The code may not be optimized to any particular machine or may be tuned to work best on the most popular machine while working less optimally on others.
- Special-purpose use: If the software is compiled for machines with uniform characteristics, then the compiler can heavily optimize the generated code for those machines.
- Notable cases include code designed for parallel and vector processors, for which special parallelizing compilers are used.
- Firmware for an embedded system can be optimized for the target CPU and memory. System cost or reliability may be more important than the code speed. For example, compilers for embedded software usually offer options that reduce code size at the expense of speed. The code's timing may need to be predictable, rather than as fast as possible, so code caching might be disabled, along with compiler optimizations that require it.
Common themes
[edit]Optimization includes the following, sometimes conflicting themes.
- Optimize the common case
- The common case may have unique properties that allow a fast path at the expense of a slow path. If the fast path is taken more often, the result is better overall performance.
- Avoid redundancy
- Reuse results that are already computed and store them for later use, instead of recomputing them.
- Less code
- Remove unnecessary computations and intermediate values. Less work for the CPU, cache, and memory usually results in faster execution. Alternatively, in embedded systems, less code brings a lower product cost.
- Fewer jumps by using straight line code, also called branch-free code
- Less complicated code. Jumps (conditional or unconditional branches) interfere with the prefetching of instructions, thus slowing down code. Using inlining or loop unrolling can reduce branching, at the cost of increasing binary file size by the length of the repeated code. This tends to merge several basic blocks into one.
- Locality
- Code and data that are accessed closely together in time should be placed close together in memory to increase spatial locality of reference.
- Exploit the memory hierarchy
- Accesses to memory are increasingly more expensive for each level of the memory hierarchy, so place the most commonly used items in registers first, then caches, then main memory, before going to disk.
- Parallelize
- Reorder operations to allow multiple computations to happen in parallel, either at the instruction, memory, or thread level.
- More precise information is better
- The more precise the information the compiler has, the better it can employ any or all of these optimization techniques.
- Runtime metrics can help
- Information gathered during a test run can be used in profile-guided optimization. Information gathered at runtime, ideally with minimal overhead, can be used by a JIT compiler to dynamically improve optimization.
- Strength reduction
- Replace complex, difficult, or expensive operations with simpler ones. For example, replacing division by a constant with multiplication by its reciprocal, or using induction variable analysis to replace multiplication by a loop index with addition.
Specific techniques
[edit]Loop optimizations
[edit]Loop optimization acts on the statements that make up a loop, such as a for loop, for example loop-invariant code motion. Loop optimizations can have a significant impact because many programs spend a large percentage of their time inside loops.[3]: 596
Some optimization techniques primarily designed to operate on loops include:
- Induction variable analysis
- Roughly, if a variable in a loop is a simple linear function of the index variable, such as
j := 4*i + 1, it can be updated appropriately each time the loop variable is changed. This is a strength reduction and also may allow the index variable's definitions to become dead code.[3]: 596–598 This information is also useful for bounds-checking elimination and dependence analysis, among other things.
- Loop fission or loop distribution
- Loop fission attempts to break a loop into multiple loops over the same index range with each new loop taking only a part of the original loop's body. This can improve locality of reference to both the data being accessed within the loop and the code in the loop's body.
- Loop fusion or loop combining or loop ramming or loop jamming
- Another technique that attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times regardless of whether that number is known at compile time, their bodies can be combined as long as they do not refer to each other's data.
- Loop inversion
- This technique changes a standard while loop into a do/while (also known as repeat/until) loop wrapped in an if conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a pipeline stall. Additionally, if the initial condition is known at compile-time and is known to be side-effect-free, the if guard can be skipped.
- Loop interchange
- These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve the locality of reference, depending on the array's layout.
- Loop-invariant code motion
- If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins.[3]: 596 This is particularly important with the address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with loop inversion, because not all code is safe to be hoisted outside the loop.
- Loop nest optimization
- Some pervasive algorithms such as matrix multiplication have very poor cache behavior and excessive memory accesses. Loop nest optimization increases the number of cache hits by operating over small blocks and by using a loop interchange.
- Loop reversal
- Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization that can help eliminate dependencies and thus enable other optimizations. Furthermore, on some architectures, loop reversal contributes to smaller code, as when the loop index is being decremented, the condition that needs to be met for the running program to exit the loop is a comparison with zero. This is often a special, parameter-less instruction, unlike a comparison with a number, which needs the number to compare to. Therefore, the amount of bytes needed to store the parameter is saved by using the loop reversal. Additionally, if the comparison number exceeds the size of word of the platform, in standard loop order, multiple instructions would need to be executed to evaluate the comparison, which is not the case with loop reversal.
- Loop unrolling
- Unrolling duplicates the body of the loop multiple times, to decrease the number of times the loop condition is tested and the number of jumps; tests and jumps can hurt performance by impairing the instruction pipeline. A "fewer jumps" optimization. Completely unrolling a loop eliminates all overhead, but requires that the number of iterations be known at compile time.
- Loop splitting
- Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops that have the same bodies but iterate over different contiguous portions of the index range. A useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop.
- Loop unswitching
- Unswitching moves a conditional from inside a loop to outside the loop by duplicating the loop's body inside each of the if and else clauses of the conditional.
- Software pipelining
- The loop is restructured in such a way that work done in an iteration is split into several parts and done over several iterations. In a tight loop, this technique hides the latency between loading and using values.
- Automatic parallelization
- A loop is converted into multi-threaded or vectorized (or even both) code to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine, including multi-core machines.
Prescient store optimizations
[edit]Prescient store optimizations allow store operations to occur earlier than would otherwise be permitted in the context of threads and locks. The process needs some way of knowing ahead of time what value will be stored by the assignment that it should have followed. The purpose of this relaxation is to allow compiler optimization to perform certain kinds of code rearrangements that preserve the semantics of properly synchronized programs.[9]
Data-flow optimizations
[edit]Data-flow optimizations, based on data-flow analysis, primarily depend on how certain properties of data are propagated by control edges in the control-flow graph. Some of these include:
- Common subexpression elimination
- In the expression
(a + b) - (a + b)/4, "common subexpression" refers to the duplicated(a + b). Compilers implementing this technique realize that(a + b)will not change, and so only calculate its value once.[3]: 592–594
- Constant folding and propagation
- Replacing expressions consisting of constants (e.g.,
3 + 5) with their final value (8) at compile time, rather than doing the calculation in run-time.[10] Used in most modern languages.
- Induction variable recognition and elimination
- See discussion above about induction variable analysis.
- Alias classification and pointer analysis
- In the presence of pointers, it is difficult to make any optimizations at all, since potentially any variable can have been changed when a memory location is assigned to. By specifying which pointers can alias which variables, unrelated pointers can be ignored.
- Dead-store elimination
- Removal of assignments to variables that are not subsequently read, either because the lifetime of the variable ends or because of a subsequent assignment that will overwrite the first value.
SSA-based optimizations
[edit]These optimizations are intended to be done after transforming the program into a special form called Static Single Assignment, in which every variable is assigned in only one place. Although some function without SSA, they are most effective with SSA. Many optimizations listed in other sections also benefit with no special changes, such as register allocation.
- Global value numbering
- GVN eliminates redundancy by constructing a value graph of the program, and then determining which values are computed by equivalent expressions. GVN can identify some redundancy that common subexpression elimination cannot, and vice versa.
- Sparse conditional constant propagation
- Combines constant propagation, constant folding, and dead-code elimination, and improves upon what is possible by running them separately.[11][12] This optimization symbolically executes the program, simultaneously propagating constant values and eliminating portions of the control-flow graph that this makes unreachable.
Code generator optimizations
[edit]- Register allocation
- The most frequently used variables should be kept in processor registers for the fastest access. To find which variables to put in registers, an interference-graph is created. Each variable is a vertex and when two variables are used at the same time (have an intersecting liverange) they have an edge between them. This graph is colored using for example Chaitin's algorithm using the same number of colors as there are registers. If the coloring fails one variable is "spilled" to memory and the coloring is retried.
- Instruction selection
- Most architectures, particularly CISC architectures and those with many addressing modes, offer several different ways of performing a particular operation, using entirely different sequences of instructions. The job of the instruction selector is to do a good job overall of choosing which instructions to implement which operators in the low-level intermediate representation with. For example, on many processors in the 68000 family and the x86 architecture, complex addressing modes can be used in statements like
lea 25(a1,d5*4), a0, allowing a single instruction to perform a significant amount of arithmetic with less storage. - Instruction scheduling
- Instruction scheduling is an important optimization for modern pipelined processors, which avoids stalls or bubbles in the pipeline by clustering instructions with no dependencies together, while being careful to preserve the original semantics.
- Rematerialization
- Rematerialization recalculates a value instead of loading it from memory, eliminating an access to memory. This is performed in tandem with register allocation to avoid spills.
- Code factoring
- If several sequences of code are identical, or can be parameterized or reordered to be identical, they can be replaced with calls to a shared subroutine. This can often share code for subroutine set-up and sometimes tail-recursion.[13]
- Trampolines
- Many[citation needed] CPUs have smaller subroutine call instructions to access low memory. A compiler can save space by using these small calls in the main body of code. Jump instructions in low memory can access the routines at any address. This multiplies space savings from code factoring.[13]
- Reordering computations
- Based on integer linear programming, restructuring compilers enhance data locality and expose more parallelism by reordering computations. Space-optimizing compilers may reorder code to lengthen sequences that can be factored into subroutines.
Functional language optimizations
[edit]Although many of these also apply to non-functional languages, they either originate in or are particularly critical in functional languages such as Lisp and ML.
- Tail-call optimization
- A function call consumes stack space and involves some overhead related to parameter passing and flushing the instruction cache. Tail-recursive algorithms can be converted to iteration through a process called tail-recursion elimination or tail-call optimization.
- Deforestation (data structure fusion)
- In languages where it is common for a sequence of transformations to be applied to a list, deforestation attempts to remove the construction of intermediate data structures.
- Partial evaluation
- Computations that produce the same output regardless of the dynamic input at runtime can be evaluated at compile time.
Other optimizations
[edit]- Bounds-checking elimination
- Many languages, such as Java, enforce bounds checking of all array accesses. This is a severe performance bottleneck on certain applications such as scientific code. Bounds-checking elimination allows the compiler to safely remove bounds checking in many situations where it can determine that the index must fall within valid bounds; for example, if it is a simple loop variable.
- Branch-offset optimization (machine dependent)
- Choose the shortest branch displacement that reaches the target.
- Code-block reordering
- Code-block reordering alters the order of the basic blocks in a program to reduce conditional branches and improve the locality of reference.
- Dead-code elimination
- Removes instructions that will not affect the behaviour of the program, for example, definitions that have no uses, called dead code. This reduces code size and eliminates unnecessary computation.
- Factoring out of invariants (loop invariants)
- If an expression is carried out both when a condition is met and is not met, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions (e.g., the assignment of a constant into a variable) appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once. This is also known as total redundancy elimination. A similar but more powerful optimization is partial-redundancy elimination (PRE).
- Inline expansion or macro expansion
- When some code invokes a procedure, it is possible to directly insert the body of the procedure inside the calling code rather than transferring control to it. This saves the overhead related to procedure calls, as well as providing an opportunity for many different parameter-specific optimizations, but comes at the cost of space; the procedure body is duplicated each time the procedure is called inline. Generally, inlining is useful in performance-critical code that makes a large number of calls to small procedures. This is a "fewer jumps" optimization. The statements of imperative programming languages are also an example of such an optimization. Although statements could be implemented with function calls they are almost always implemented with code inlining.
- Jump threading
- In this optimization, consecutive conditional jumps predicated entirely or partially on the same condition are merged.
- E.g.,
if (c) { foo; } if (c) { bar; }toif (c) { foo; bar; }, - and
if (c) { foo; } if (!c) { bar; }toif (c) { foo; } else { bar; }.
- Macro compression
- A space optimization that recognizes common sequences of code, creates subprograms ("code macros") that contain the common code, and replaces the occurrences of the common code sequences with calls to the corresponding subprograms.[5] This is most effectively done as a machine code optimization, when all the code is present. The technique was first used to conserve space in an interpretive byte stream used in an implementation of Macro Spitbol on microcomputers.[14] The problem of determining an optimal set of macros that minimizes the space required by a given code segment is known to be NP-complete,[5] but efficient heuristics attain near-optimal results.[15]
- Reduction of cache collisions
- (e.g., by disrupting alignment within a page)
- Stack-height reduction
- Rearrange an expression tree to minimize resources needed for expression evaluation.[clarification needed]
- Test reordering
- If we have two tests that are the condition for something, we can first deal with the simpler tests (e.g., comparing a variable to something) and only then with the complex tests (e.g., those that require a function call). This technique complements lazy evaluation, but can be used only when the tests are not dependent on one another. Short-circuiting semantics can make this difficult.
Interprocedural optimizations
[edit]Interprocedural optimization works on the entire program, across procedure and file boundaries. It works tightly with intraprocedural counterparts, carried out with the cooperation of a local part and a global part. Typical interprocedural optimizations are procedure inlining, interprocedural dead-code elimination, interprocedural constant propagation, and procedure reordering. As usual, the compiler needs to perform interprocedural analysis before its actual optimizations. Interprocedural analyses include alias analysis, array access analysis, and the construction of a call graph.
Interprocedural optimization is common in modern commercial compilers from SGI, Intel, Microsoft, and Sun Microsystems. For a long time, the open source GCC was criticized for a lack of powerful interprocedural analysis and optimizations, though this is now improving.[16] Another open-source compiler with full analysis and optimization infrastructure is Open64.
Due to the extra time and space required by interprocedural analysis, most compilers do not perform it by default. Users must use compiler options explicitly to tell the compiler to enable interprocedural analysis and other expensive optimizations.
Practical considerations
[edit]There can be a wide range of optimizations that a compiler can perform, ranging from simple and straightforward optimizations that take little compilation time to elaborate and complex optimizations that involve considerable amounts of compilation time.[3]: 15 Accordingly, compilers often provide options to their control command or procedure to allow the compiler user to choose how much optimization to request; for instance, the IBM FORTRAN H compiler allowed the user to specify no optimization, optimization at the registers level only, or full optimization.[3]: 737 By the 2000s, it was common for compilers, such as Clang, to have several compiler command options that could affect a variety of optimization choices, starting with the familiar -O2 switch.[17]
An approach to isolating optimization is the use of so-called post-pass optimizers (some commercial versions of which date back to mainframe software of the late 1970s).[18] These tools take the executable output by an optimizing compiler and optimize it even further. Post-pass optimizers usually work on the assembly language or machine code level (in contrast with compilers that optimize intermediate representations of programs). One such example is the Portable C Compiler (PCC) of the 1980s, which had an optional pass that would perform post-optimizations on the generated assembly code.[3]: 736
Another consideration is that optimization algorithms are complicated and, especially when being used to compile large, complex programming languages, can contain bugs that introduce errors in the generated code or cause internal errors during compilation. Compiler errors of any kind can be disconcerting to the user, but especially so in this case, since it may not be clear that the optimization logic is at fault.[19] In the case of internal errors, the problem can be partially ameliorated by a "fail-safe" programming technique in which the optimization logic in the compiler is coded such that a failure is trapped, a warning message issued, and the rest of the compilation proceeds to successful completion.[20]
History
[edit]Early compilers of the 1960s were often primarily concerned with simply compiling code correctly or efficiently, such that compile times were a major concern. One notable early optimizing compiler was the IBM FORTRAN H compiler of the late 1960s.[3]: 737 Another of the earliest and important optimizing compilers, that pioneered several advanced techniques, was that for BLISS (1970), which was described in The Design of an Optimizing Compiler (1975).[3]: 740, 779 By the late 1980s, optimizing compilers were sufficiently effective that programming in assembly language declined. This co-evolved with the development of RISC chips and advanced processor features such as superscalar processors, out-of-order execution, and speculative execution, which were designed to be targeted by optimizing compilers rather than by human-written assembly code.[citation needed]
List of static code analyses
[edit]See also
[edit]References
[edit]- ^ Godbolt, Matt (November 12, 2019). "Optimizations in C++ Compilers". ACM Queue. Vol. 17, no. 5.
- ^ "Lecture 15: NP-completeness, Optimization and Separation" (PDF). IE 511: Integer Programming, Spring 2021.
- ^ a b c d e f g h i j k Aho, Alfred V.; Sethi, Ravi; Ullman, Jeffrey D. (1986). Compilers: Principles, Techniques, and Tools. Reading, Massachusetts: Addison-Wesley. ISBN 0-201-10088-6.
- ^ a b Cooper, Keith D.; Torczon, Linda (2003) [2002-01-01]. Engineering a Compiler. Morgan Kaufmann. pp. 404, 407. ISBN 978-1-55860-698-2.
- ^ a b c Goss, Clinton F. (August 2013) [First published June 1986]. Machine Code Optimization – Improving Executable Object Code (PDF) (Ph.D. dissertation). Vol. Computer Science Department Technical Report #246. Courant Institute, New York University. arXiv:1308.4815. Bibcode:2013arXiv1308.4815G. Archived (PDF) from the original on 2022-10-09. Retrieved 22 Aug 2013.
- Clinton F. Goss (2013) [1986]. Machine Code Optimization - Improving Executable Object Code (PhD thesis). Courant Institute, New York University.
- ^ A Programmer's Introduction to IBM System/360 Assembly Language (PDF). IBM. p. 42. GC20-1645-5.
- ^ "GCC – Machine-Dependent Options". GNU Project.
- ^ "RISC vs. CISC". cs.stanford.edu. Retrieved 2024-10-15.
- ^ James Gosling; Bill Joy; Guy Steele. "17 Threads and Locks". The Java Language Specification (1.0 ed.). 17.8 Prescient Store Actions.
- ^ Muchnick, Steven; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation. Morgan Kaufmann. pp. 329–. ISBN 978-1-55860-320-2.
constant folding.
- ^ Wegman, Mark N.; Zadeck, F. Kenneth (April 1991). "Constant Propagation with Conditional Branches" (PDF). ACM Transactions on Programming Languages and Systems. 13 (2): 181–210. doi:10.1145/103135.103136.
- ^ Click, Clifford; Cooper, Keith. (March 1995). "Combining Analyses, Combining Optimizations" (PDF). ACM Transactions on Programming Languages and Systems. 17 (2): 181–196. doi:10.1145/201059.201061.
- ^ a b Cx51 Compiler Manual, version 09.2001, p. 155, Keil Software Incorporated.
- ^ Dewar, Robert B. K.; Golumbic, Martin Charles; Goss, Clinton F. (August 2013) [October 1979]. MICRO SPITBOL. Computer Science Department Technical Report. Vol. 11. Courant Institute of Mathematical Sciences. arXiv:1308.6096. Bibcode:2013arXiv1308.6096D.
- ^ Golumbic, Martin Charles; Dewar, Robert B. K.; Goss, Clinton F. (1980). "Macro Substitutions in MICRO SPITBOL – a Combinatorial Analysis". Proceedings of the 11th Southeastern Conference on Combinatorics, Graph Theory and Computing, Congressus Numerantium, Utilitas Math., Winnipeg, Canada. 11th Southeastern Conference on Combinatorics, Graph Theory and Computing. Vol. 29. pp. 485–495.
- ^ Glazunov, N. M. (November 25, 2012). "Foundations of Scientific Research". arXiv:1212.1651 [cs.OH].
- ^ Guelton, Serge (August 5, 2019). "Customize the compilation process with Clang: Optimization options". Red Hat.
- ^ Evans, Michael (December 1982). "Software engineering for the Cobol environment". Communications of the ACM. 25 (12): 874–882. doi:10.1145/358728.358732. Retrieved 2013-08-10.
- ^ Sun, Chengnian; et al. (July 18–20, 2016). "Toward understanding compiler bugs in GCC and LLVM". Proceedings of the 25th International Symposium on Software Testing and Analysis. Issta 2016. pp. 294–305. doi:10.1145/2931037.2931074. ISBN 9781450343909. S2CID 8339241.
- ^ Schilling, Jonathan L. (August 1993). "Fail-safe programming in compiler optimization". ACM SIGPLAN Notices. 28 (8): 39–42. doi:10.1145/163114.163118. S2CID 2224606.
External links
[edit]- Optimization manuals by Agner Fog – documentation about x86 processor architecture and low-level code optimization
Optimizing compiler
View on GrokipediaIntroduction
Definition and Purpose
An optimizing compiler is a specialized compiler that systematically analyzes and transforms source code or its intermediate representation into functionally equivalent machine code that exhibits improved attributes, such as reduced execution time, smaller code size, lower memory footprint, or decreased energy consumption.[8] These transformations occur primarily after the front-end phases of compilation, which include lexical analysis (tokenization of source code into meaningful units), parsing (verification of syntactic structure to build an abstract syntax tree), and semantic analysis (checking contextual meaning and generating an initial intermediate representation), providing a platform-independent form suitable for optimization.[9] The process then proceeds to the middle-end optimizer before final code generation in the back end.[8] The core purpose of an optimizing compiler is to bridge the gap between high-level, human-readable source code and the constraints of the target hardware, automatically generating more efficient executables without altering program behavior or observable outputs.[10] By applying these improvements, it minimizes runtime overheads inherent in source-language abstractions, such as unnecessary data movements or computations, while considering factors like processor architecture and resource availability to enhance overall system performance.[8] This enables developers to focus on algorithmic logic rather than low-level tuning, with benefits extending to portability across diverse platforms and reduced power usage in resource-constrained environments.[9] Basic examples of such optimizations include eliminating redundant computations, where identical expressions evaluated multiple times are replaced by a single computation and reuse of the result, thereby reducing processor cycles.[11] Another common approach involves reordering instructions to align data accesses with the processor's cache hierarchy, minimizing cache misses and improving data locality for faster execution.[12] Optimizations are often categorized by scope, such as local (within a basic block) versus global (across procedures), to systematically target these improvements.[8]Role in the Compilation Process
An optimizing compiler integrates into the overall compilation process by transforming source code into efficient machine code through a structured pipeline divided into front-end, middle-end, and back-end stages. The front-end handles language-specific tasks, including lexical analysis, syntax parsing, semantic analysis, and generation of an intermediate representation (IR), such as three-address code or LLVM IR, which abstracts the source program for further processing.[13][14] This stage ensures the code is syntactically and semantically valid but does not yet focus on performance enhancements. The core role of the optimizing compiler unfolds in the middle-end, where multiple iterative passes apply transformations to the IR to improve runtime efficiency, such as reducing execution time or memory usage, while preserving program semantics. These passes operate on a platform-independent IR, enabling optimizations that are agnostic to the source language and target architecture, and they typically follow intermediate code generation after semantic analysis has confirmed correctness.[13][14] For instance, in frameworks like LLVM, the middle-end pipeline includes sequences of simplification and optimization passes, often applied per-module or across linked modules in link-time optimization (LTO) variants.[14] Following the middle-end, the back-end performs target-specific code generation, including register allocation, instruction selection, and peephole optimizations tailored to the hardware architecture, culminating in assembly or machine code before final linking.[13][14] Throughout this pipeline, optimizations balance compile-time overhead—such as increased processing duration and memory demands—against runtime benefits, with decisions on pass ordering and repetition guided by heuristics to avoid excessive costs.[14] Optimizations may be intra-procedural, confined to individual functions, or inter-procedural, spanning multiple modules, though their detailed scopes vary by implementation.[13]Classification of Optimizations
Optimizations by Scope
Optimizations in compilers are categorized by scope based on the extent of the code region they analyze and transform, ranging from narrow, isolated segments to broader program structures that incorporate control flow. This classification highlights how the size of the analysis domain affects the complexity and potential impact of transformations, with smaller scopes enabling simpler, faster passes and larger scopes yielding more substantial improvements but requiring sophisticated analysis techniques. Local optimizations focus on straight-line code within basic blocks, ignoring inter-block dependencies, while global optimizations examine entire functions or procedures, leveraging data-flow information to handle branches and loops.[15][16] Local optimizations operate exclusively within a basic block, defined as a maximal sequence of consecutive instructions with a single entry point and a single exit point, containing no internal branches or jumps. These transformations simplify expressions or eliminate redundancies without considering control flow outside the block, making them computationally inexpensive and suitable for early compilation passes. For instance, constant folding evaluates constant expressions at compile time, replacing operations like2 + 3 with 5 directly in the code. Similarly, dead code elimination removes instructions whose results are never used, such as an assignment to a variable that is not referenced later in the block. These techniques enhance code density and execution speed within isolated segments but cannot address redundancies that span multiple blocks.[16][15]
In contrast, global optimizations analyze and transform code across an entire function or procedure, taking into account the control-flow graph that connects multiple basic blocks through branches and loops. This broader scope requires data-flow analysis to track information such as reaching definitions—which variables are defined before a point—and available expressions—to identify reusable computations—enabling transformations that local methods overlook. A key example is common subexpression elimination across branches, where an expression computed in one path, such as a + b, is reused in another if data-flow confirms its availability, avoiding redundant calculations. For instance, in code where a constant value reaches all paths, global optimization might propagate this constant to subsequent uses like y = x + 3 across the if-statement, simplifying it to y = 8 if the value reaches all paths safely. This distinction underscores that local optimizations ignore control dependencies, limiting their reach, whereas global ones use iterative data-flow algorithms to propagate facts across the function, often yielding greater performance gains at the cost of higher analysis overhead.[17][16][15]
The following illustrates a local optimization example in pseudocode:
Before: x = 5;
y = x + 3;
After: x = 5;
y = 8;
Before: x = 5;
y = x + 3;
After: x = 5;
y = 8;
x. For a global case involving propagation across an if-statement where the value reaches unconditionally:
Before: if (cond) {
x = 5;
} else {
x = 5;
}
y = x + 3;
After: x = 5;
y = 8;
Before: if (cond) {
x = 5;
} else {
x = 5;
}
y = x + 3;
After: x = 5;
y = 8;
Optimizations by Phase and Target
Optimizations in compilers are often categorized by the phase of the compilation process in which they occur or by the form of the code they target, such as intermediate representations (IR), assembly, or object code. This classification emphasizes the timing and visibility of the code during optimization, which influences the scope and effectiveness of transformations. Early phases, typically operating on high-level IR, focus on structural simplifications, while later phases, such as post-codegen, apply target-specific adjustments to low-level code. This approach allows compilers to progressively refine code as more details about the target architecture become available.[18] Peephole optimization exemplifies a late-phase technique that performs local pattern-matching on short sequences of instructions, usually after code generation at the assembly or machine code level. It scans for inefficient patterns and replaces them with more efficient equivalents, such as substituting a multiplication by a power of two with a left shift operation to leverage hardware efficiency. This method is particularly effective for machine-dependent tweaks, as it operates on a "peephole" view of 1 to 10 instructions without requiring global analysis, making it computationally lightweight yet impactful for reducing instruction count and improving execution speed. Typically implemented as a post-codegen pass, peephole optimization can be retargetable by defining rules based on machine descriptions, as demonstrated in early systems that achieved significant code size reductions on certain architectures. As a local technique within the scope classification, it contrasts with broader analyses but integrates well into multi-pass compilers.[18][18][19] Machine and object code optimizations target low-level representations after IR transformations, focusing on architecture-specific improvements like instruction scheduling to exploit pipelining and reduce stalls. Instruction scheduling reorders independent operations to fill pipeline delays, such as delaying a load instruction's dependent use to allow intervening computations, thereby increasing throughput on superscalar or pipelined processors. These optimizations occur post-codegen or during assembly emission, where target details like latency and resource constraints are fully known, enabling heuristics like list scheduling or more advanced methods for VLIW architectures. For instance, on pipelined machines, such scheduling has been shown to improve performance by 15-30% in benchmark suites by minimizing interlocks without altering program semantics. Object code tweaks may also include peephole variants or relaxation of alignments to fit cache lines better.[20][20][20] Interprocedural optimizations operate across function boundaries during middle-to-late phases, using call graph analysis to propagate information like constants or eliminate redundant calls, such as inlining small functions to reduce overhead. These require visibility beyond single procedures but are applied before final code generation to enable transformations like dead code elimination across modules. Early implementations focused on constant propagation, modeling value flows through calls to improve subsequent local optimizations, achieving measurable speedups in programs with heavy inter-function interactions. Details of advanced interprocedural techniques, such as whole-program analysis, are covered elsewhere. Link-time optimization (LTO) represents a distinct late-phase approach performed during the linking stage on whole-program object code, enabling inter-module transformations that separate compilation obscures. By retaining intermediate representations in object files, LTO allows the linker to invoke the optimizer for actions like cross-module inlining or dead function removal, providing global visibility without full recompilation. First introduced in production compilers such as GCC 4.5 in 2010, LTO typically yields 5-10% performance gains in large applications by optimizing across compilation units, though it increases link-time significantly due to whole-program processing. This requires compiler support for embedding IR in objects, ensuring compatibility with standard build flows.[21][21][22]Optimizations by Dependency
Optimizations in compilers can be classified based on their dependency on the source programming language or the target hardware architecture, which determines their portability and applicability across different environments. Language-dependent optimizations leverage specific semantics or rules of the programming language to enable transformations that would not be safe or possible otherwise, while language-independent optimizations operate on generic intermediate representations (IR) without relying on such details. Similarly, machine-dependent optimizations exploit characteristics of the target hardware to achieve higher performance, whereas machine-independent ones produce code that is portable across architectures. Language-independent optimizations work on any IR abstracted from the source language, focusing on universal code patterns such as constant propagation, where constant values are substituted into expressions to simplify computations and enable further reductions. These optimizations are reusable across compilers for different languages because they do not assume language-specific behaviors. In contrast, language-dependent optimizations exploit unique language semantics to perform aggressive transformations; for instance, in Fortran, compilers assume no pointer aliasing between distinct variables, allowing for reorderings and vectorizations that improve numerical computations without violating program correctness. Another example is in Java, where optimizations can avoid explicit memory management by leveraging the language's garbage collection semantics, such as eliding unnecessary object allocations or promotions that would trigger collection pauses. Machine-independent optimizations, like loop fusion, which combines adjacent loops to reduce overhead and improve data locality, are portable and can be applied regardless of the target processor, making them suitable for early compilation phases. These contribute to consistent performance gains across hardware without customization. Machine-dependent optimizations, however, tailor code to specific hardware features; vectorization, for example, uses SIMD (Single Instruction, Multiple Data) instructions available on certain CPUs to process multiple data elements in parallel, significantly accelerating loops in compute-intensive applications. Additionally, machine-dependent techniques may reorder instructions to exploit CPU cache hierarchies, minimizing memory access latencies on architectures with multi-level caches. A core advantage of independent optimizations—whether language or machine—is their reusability across diverse compilers and platforms, facilitating modular compiler design and broader applicability. Dependent optimizations, while offering peak performance tailored to specific contexts, reduce portability, as code optimized for one language feature or hardware trait may not transfer effectively to others, often requiring separate implementations in the compiler backend. This trade-off underscores the balance between generality and specialization in optimization strategies.Fundamental Concepts
Factors Influencing Optimization
The effectiveness and applicability of optimizations in a compiler are shaped by several key factors related to the input program, the compilation environment, and the target system. Code characteristics play a central role, as the quality and structure of the source code determine the potential for improvement. For instance, high-level languages like C++ or Fortran provide compilers with more opportunities for aggressive optimizations, such as automatic memory management and loop transformations, compared to low-level assembly code where much of the control is already optimized by the programmer.[16] Programs written in high-level languages often exhibit inefficiencies, like redundant computations or poor locality, that compilers can exploit to generate more efficient machine code, whereas assembly inputs limit such interventions to minimal adjustments.[16] Additionally, features like array accesses in nested loops or correlated variable values in expressions can either enable precise optimizations, such as constant propagation, or hinder them due to imprecision in analysis.[16] Compiler constraints further influence the scope and depth of optimizations applied. A primary limitation is the compile-time budget, as many optimization passes, including iterative data-flow analyses, can incur exponential time complexity in the worst case, necessitating trade-offs between thoroughness and efficiency.[16] For example, one-pass compilation strategies prioritize speed over exhaustive analysis, while multi-pass approaches allow deeper optimizations but extend compilation duration.[16] Regardless of these constraints, all optimizations must rigorously preserve program semantics, ensuring that the transformed code produces identical observable behavior, including side effects from function calls and exception handling, to maintain correctness.[16] Violations, such as reordering operations that alter side effects, are avoided through conservative analyses or compensation code.[16] Hardware factors tied to the target architecture significantly dictate optimization choices, often classifying them as machine-dependent in broader dependency schemes. Aspects like cache sizes—typically 16-64 KB for L1 caches with 1-4 ns access times—affect locality-enhancing transformations, such as loop blocking to minimize cache misses in matrix operations.[16][23] Similarly, the number of available registers influences register allocation strategies, while pipeline structures and instruction sets (e.g., RISC vs. CISC) guide instruction scheduling to exploit parallelism and reduce latency from high-latency operations like multiplication.[16] Optimizations must align with these hardware specifics to achieve performance gains; for example, SIMD instructions enable vectorization only on supported architectures.[16] User directives provide explicit control over optimization aggressiveness, allowing developers to balance performance, size, and compilation time. In tools like GCC, flags such as -O0 disable most optimizations for fast compilation and debugging, while -O1 enables basic passes like constant propagation with minimal time overhead; -O2 activates aggressive techniques including inlining and vectorization; -O3 adds advanced loop interchanges for peak speed at higher cost; and -Os prioritizes code size over execution time.[24] These levels determine which optimizations are invoked, directly impacting the compiler's behavior on a given codebase.[24]Common Optimization Themes
Optimizing compilers pursue several core performance metrics to enhance the efficiency of generated code. Primary among these is speed, achieved by reducing the number of executed instructions or CPU cycles through techniques that streamline computation without altering program semantics. For instance, optimizations target minimizing execution time, often yielding double-digit percentage improvements in benchmarks. Code size is another key metric, focusing on minimizing the footprint of the binary by eliminating unnecessary elements, such as unreferenced functions or redundant data. In resource-constrained environments like embedded systems, power consumption becomes critical, with optimizations designed to lower energy use during execution by reducing computational overhead and memory accesses.[10][24] A foundational theme in compiler optimization is ensuring safety and correctness, where transformations must preserve the program's observable behavior to avoid introducing errors. This is particularly vital in languages like C++, where undefined behavior (UB)—such as integer overflow or invalid pointer dereferences—allows compilers to assume non-occurrence for aggressive optimizations, but exploiting UB can risk program crashes or incorrect results if assumptions fail. For example, compilers may reorder operations or eliminate checks around UB, potentially improving performance but only if the code adheres strictly to defined behavior; violations can lead to regressions in safety. Thus, optimizations respect language standards to maintain functional equivalence, prioritizing verifiable correctness over unchecked speed gains.[25][26] Optimizations inherently involve trade-offs, notably between compile-time and runtime costs. Aggressive strategies, such as extensive inlining or interprocedural analysis, can significantly boost runtime performance but increase compilation time and memory usage, sometimes by orders of magnitude for large codebases. Conversely, milder optimizations favor faster builds at the expense of suboptimal runtime efficiency. Hardware factors, like varying instruction latencies across architectures, further influence these trade-offs by dictating how themes such as speed versus size are balanced in practice.[24][27] Recurring across optimization strategies is the theme of redundancy elimination and resource efficiency, serving as universal goals to prune wasteful computations and allocations. By identifying and removing duplicate operations—such as recomputed values or unused code—compilers achieve broader efficiency, applicable from local peephole tweaks to global analyses, ultimately conserving CPU cycles, memory, and energy. This principle underpins many transformations, ensuring resource use aligns closely with the program's intent. A specific enabler of such freedoms is the as-if optimization rule in standards like C and C++, which permits any transformation as long as the observable output matches execution without optimizations, including I/O order and volatile accesses. For C, this is codified in ISO/IEC 9899:2011, section 5.1.2.3, while C++ echoes it in ISO/IEC 14882:2014, [intro.execution]/5, allowing compilers to disregard strict sequentiality for efficiency when behavior remains equivalent.[28]Core Techniques
Loop and Control Flow Optimizations
Loop optimizations target repetitive structures in code to minimize execution overhead, while control flow optimizations simplify branching to enhance predictability and parallelism. These techniques operate primarily on intermediate representations during the middle-end of compilation, relying on data-flow analysis to identify opportunities for transformation. By reducing the frequency of loop control instructions and branch decisions, they decrease branch mispredictions and expose more instruction-level parallelism (ILP), leading to improved runtime performance on modern processors.[29] Loop invariant code motion hoists computations that do not change across loop iterations outside the loop body, avoiding redundant evaluations. For instance, if an expression like remains constant within a loop, it is computed once before the loop and reused, reducing arithmetic operations inside iterations. This optimization, part of partial redundancy elimination frameworks, ensures safe movement by verifying invariance through data-flow analysis. Seminal work formalized this as lazy code motion, which computes optimal placement positions to minimize computations while preserving semantics. Loop unrolling replicates the loop body multiple times, reducing the number of iterations and associated overhead from increment, test, and branch instructions. For a loop iterating from 0 to 3, unrolling by a factor of 4 expands it into explicit statements without the loop construct, eliminating branch checks entirely for small fixed counts. This exposes more ILP by scheduling independent operations across unrolled iterations and amortizes loop control costs, though it increases code size and may require subsequent register allocation adjustments. Early implementations demonstrated speedups in retargetable compilers by aggressively unrolling based on dependence analysis.[30] Strength reduction replaces computationally expensive operations within loops, such as multiplications, with cheaper equivalents like additions, particularly for induction variables. In a loop where , it transforms the multiplication into an incremental addition after initial setup, leveraging the loop's repetitive nature. This technique targets linear functions in loop indices, minimizing cycle costs on processors where multiplication latency exceeds addition. The original algorithm used an indexed temporary table to identify and eliminate strong operators in strongly connected regions, forming a foundation for modern implementations.[31][31] Control flow optimizations streamline branching constructs to reduce decision overhead and misprediction penalties. For if-then-else chains, compilers may apply if-conversion, transforming conditional branches into predicated straight-line code using conditional moves or masks, which eliminates branches at the cost of executing both paths. Switch statements are optimized via jump tables or binary search trees for dense cases, enabling constant-time dispatch and avoiding linear scans in if-else equivalents. These transformations lower branch misprediction rates by flattening control flow, with feedback-directed variants promoting frequent cases to reduce average path lengths.[32][33] Loop trip count estimation informs these optimizations by approximating the number of iterations, often via the formula for simple arithmetic progressions. Accurate estimation enables decisions like unrolling factors or peeling for non-constant counts, preventing over-optimization on short loops. Predictors use static analysis or profiles to refine estimates, crucial for techniques like vectorization where misalignment from imprecise counts can degrade performance.[34] Collectively, these optimizations mitigate branch mispredictions—common in loops due to back-edge predictability but costly on termination—and boost ILP by increasing basic block sizes and reducing control dependencies. Studies show unrolling and invariant motion can reduce misprediction rates in loop-heavy workloads, enhancing superscalar execution.[29][35]Data-Flow and SSA-Based Optimizations
Data-flow analysis forms the foundation for many optimizations in compilers by computing information about how values propagate through a program's control flow graph (CFG), enabling transformations that eliminate redundancy and improve efficiency. Introduced in seminal work on program optimization, this technique models data dependencies using lattice-based frameworks to solve sets of equations iteratively across basic blocks.[36] Forward data-flow analyses propagate information from entry to exit points, such as reaching definitions, which identify all possible definitions of a variable that may reach a given program point. The standard equations for a forward analysis like reaching definitions are: Here, represents definitions generated within block , and denotes definitions that overwrite prior ones; these are solved using fixed-point iteration until convergence.[36] Backward analyses, conversely, propagate information from exits to entries, exemplified by live-variable analysis, which determines variables that may be used along any path from a program point to the end. For liveness, the equations are: where is the set of variables used in before any definition, and is the set of variables defined in .[36] These analyses are distributive and monotonic, ensuring termination in finite lattices, and are typically implemented with bit-vector representations for efficiency in practice.[37] Constant propagation leverages data-flow information to replace variables with their known constant values, reducing computational overhead and creating opportunities for further simplifications like constant folding. In a basic form, it uses reaching definitions to substitute constants where a variable is invariably defined by a constant expression prior to its use, propagating forward through the CFG while handling conditional branches conservatively.[38] For instance, if analysis shows that a variable is always assigned 42 before every use, all references to can be replaced with 42, potentially enabling dead branch elimination if branches become unconditional. Advanced variants, such as conditional constant propagation, refine this by tracking path-sensitive constants, improving precision at the cost of more complex lattices.[38] This optimization is particularly effective when interleaved with other passes, as it exposes redundancies in arithmetic operations. Static Single Assignment (SSA) form enhances data-flow analyses by renaming variables so each is assigned exactly once, transforming the intermediate representation into a form where dependencies are explicit and optimizations like constant propagation become straightforward. Developed as a unified framework for data and control flow, SSA introduces -functions at merge points (e.g., the join of control paths) to select values based on the incoming branch: , where and are versions from predecessor blocks.[39] This renaming eliminates the need to track multiple definitions per variable in analyses, simplifying computations for reaching definitions and liveness, as each use dominantly depends on a unique assignment. Converting to SSA typically involves dominance frontiers to place -functions precisely, enabling sparse representations that scale to large programs.[39] While SSA requires insertion of -nodes at control merges, it facilitates precise value numbering and copy propagation without recomputing full data-flow sets. Dead code elimination removes computations whose results are never used, relying on data-flow analyses to identify unreachable or unused code. Using live-variable analysis, assignments to variables that are never live after the definition can be deleted, as they contribute no observable effect; for example, if a variable is defined but not used along any path, the defining statement is eliminated.[40] In SSA form, this is accelerated by pruning assignments without uses and removing -functions that become trivial (e.g., ), propagating deletions backward through dependencies. Reaching definitions complement this by identifying partially dead code, such as definitions overshadowed by later ones on all paths. These techniques preserve program semantics while reducing code size, with SSA-based variants achieving higher precision in modern compilers by avoiding conservative approximations in multi-path scenarios.[39]Register and Code Generation Optimizations
Register allocation aims to map program variables, typically represented as virtual registers in intermediate code, to a limited number of physical machine registers, thereby reducing the need for slower memory accesses and improving execution speed. This process constructs an interference graph where nodes correspond to live ranges of variables, and an edge exists between two nodes if their live ranges overlap, indicating that the variables cannot share the same physical register.[41] The resulting graph coloring problem—assigning colors (registers) to nodes such that no adjacent nodes share the same color—is NP-hard, necessitating heuristic approaches to approximate optimal solutions efficiently.[41] A seminal heuristic, introduced by Gregory Chaitin, performs greedy coloring on the interference graph while estimating spill costs to decide which variables to temporarily store in memory if the graph proves non-colorable with the available registers. Chaitin's algorithm builds the interference graph from liveness information, simplifies it by removing low-degree nodes, and colors in a heuristic order based on degree and cost, achieving near-optimal allocation in practice for many architectures.[41] This method minimizes spills—insertions of load and store instructions—improving execution speed in register-constrained environments, as demonstrated in early implementations on minicomputers.[41] Instruction selection translates intermediate representations, such as expression trees, into sequences of target machine instructions, selecting opcodes that best match the semantics while considering costs like execution latency and code size. A foundational technique uses bottom-up tree pattern matching, where the compiler matches subtrees of the expression against predefined instruction patterns and selects the lowest-cost covering.[42] Alfred V. Aho and Stephen C. Johnson developed a dynamic programming algorithm for this, enabling optimal code generation for expression trees on register machines in linear time for a broad class of architectures, by computing minimum-cost labelings bottom-up from leaves to root.[42] This approach ensures efficient mapping, such as choosing a single multiply-add instruction over separate operations when available. Instruction scheduling reorders the selected instructions within basic blocks or traces to exploit hardware parallelism, such as pipelining or multiple execution units, while preserving program semantics and data dependencies. List scheduling, a priority-based heuristic, maintains a ready list of schedulable instructions and selects the highest-priority one (e.g., longest latency or highest resource utilization) for each issue slot, effectively reducing stalls in superscalar processors.[4] This technique, adapted from critical path methods, can improve instruction-level parallelism in loop-heavy code on in-order pipelines, as seen in early vectorizing compilers.[4] Peephole optimizations apply local transformations to short sequences of generated code, identifying inefficient patterns and replacing them with more efficient equivalents without altering functionality. Introduced by William M. McKeeman, this method scans assembly-like code through a "peephole" window of fixed size (e.g., 3-5 instructions) to eliminate redundancies, such as consecutive loads from the same address or unnecessary moves. Common patterns include replacingload r1, mem; load r1, mem with a single load, or optimizing branch sequences, which collectively reduce code size and improve performance through better cache locality and fewer instructions. These optimizations are machine-specific and often applied iteratively during code generation for cumulative gains.
Advanced and Specialized Optimizations
Interprocedural and Link-Time Optimizations
Interprocedural optimizations extend analysis and transformation beyond individual procedures, enabling the compiler to propagate information and apply transformations across function boundaries to uncover opportunities unavailable in intraprocedural analysis. Unlike local optimizations confined to a single function, interprocedural techniques require constructing representations of program structure, such as call graphs, to model control flow between procedures and facilitate data-flow propagation. This broader scope allows for more aggressive code improvements, such as eliminating call overhead and refining assumptions about data usage, but introduces challenges in scalability and precision due to the exponential growth in analysis complexity. Seminal work in this area, including data-flow frameworks for interprocedural constant propagation, demonstrated potential runtime speedups of up to 20% by integrating alias, modification, and use information across procedures.[43] Function inlining is a foundational interprocedural technique that replaces a procedure call with the body of the called procedure, eliminating invocation overhead and exposing the inlined code to the caller's optimization context. This substitution enables subsequent transformations, such as constant propagation or dead code elimination, that were previously blocked by the call boundary, often yielding measurable performance gains. Empirical studies on C and C++ programs show inlining improves execution time by 2-3% in C benchmarks and up to 46% in C++ due to shallower call stacks and smaller function sizes, though excessive inlining risks code bloat. Heuristics in modern compilers balance these trade-offs by prioritizing small, frequently called procedures while limiting total code growth.[44] Whole-program analysis underpins interprocedural optimizations by constructing a call graph—a directed graph with procedures as nodes and calls as edges—to enable propagation of facts like constants or live variables across the entire program. Scalable algorithms, such as propagation-based methods using class, type, or method-specific sets, construct precise call graphs for large object-oriented programs, reducing polymorphic call sites by up to 26% compared to simpler class hierarchy approximations and supporting optimizations like inlining without runtime overhead. These techniques process the graph in topological order, summarizing effects at each procedure to avoid redundant computation, and have been implemented in tools like the Jikes Bytecode Toolkit for Java applications exceeding 300,000 lines.[45] Link-time optimization (LTO) extends interprocedural analysis to the linking phase, where object files are merged for whole-program visibility, allowing transformations across compilation units that separate compilation obscures. In LTO frameworks like GCC's, the compiler preserves intermediate representations (e.g., GIMPLE) in object files, enabling devirtualization by resolving virtual calls through merged type information and inter-module dead code elimination by removing unreferenced internal-linkage entities. This results in reduced code size and improved runtime performance, such as through cross-file constant folding, while reusing existing optimizer infrastructure for efficiency; for instance, LTO supports subset analysis without requiring full program recompilation.[21] Clone-and-specialize, or procedure cloning, addresses limitations of inlining by duplicating a procedure into specialized versions tailored to specific call sites, propagating caller-specific facts like constants without inserting the entire body. The compiler identifies cloning opportunities via interprocedural analysis on the call graph, creating copies that refine parameters (e.g., specializing a loop bound) while bounding code growth through heuristics that limit instances per procedure. This technique enhances optimizations like constant propagation, as demonstrated in early implementations where cloning exposed additional facts across boundaries, improving overall program efficiency without the bloat of aggressive inlining. Seminal algorithms employ a three-phase process—candidate identification, benefit estimation, and selective replication—to ensure net gains in performance.[46] Interprocedural optimizations often rely on escape analysis to safely handle pointers and objects across boundaries, determining whether data escapes its allocation scope to enable transformations like stack allocation or synchronization removal. In languages like Java, interprocedural escape analysis uses connection graphs and data-flow algorithms over the call graph to classify objects as non-escaping (local to a procedure or thread), supporting optimizations such as eliminating 51% of lock operations (median) and stack-allocating 19% of objects, with execution time reductions of 7% (median). This flow-sensitive, context-sensitive approach summarizes method effects for reuse, converging on precise summaries to avoid over-conservative assumptions in pointer-heavy code.[47]Language-Specific Optimizations
Optimizing compilers leverage the unique semantic guarantees and paradigms of specific programming languages to perform targeted optimizations that would be unsafe or infeasible in more general-purpose settings. These language-specific techniques exploit properties such as immutability, aliasing restrictions, or runtime models like garbage collection to enhance performance while preserving correctness. By tailoring optimizations to the language's design, compilers can eliminate overheads inherent to the paradigm, such as recursion in functional languages or dynamic dispatch in object-oriented ones.[48] In functional languages, tail call optimization (TCO) transforms tail-recursive calls into iterative loops, preventing stack overflow in deeply recursive code and ensuring constant stack space usage. This technique, which compiles tail calls without adding new stack frames, was foundational in early Lisp implementations and formalized for proper tail recursion in Scheme to guarantee space efficiency. For instance, in languages like Haskell or Scheme, TCO enables efficient processing of recursive algorithms common in functional programming, such as tree traversals, by converting them to loops at compile time.[49][50] Object-oriented languages benefit from optimizations like virtual call devirtualization, which replaces dynamic method dispatch with direct calls when the compiler can prove a single possible target, reducing invocation overhead. This is particularly effective in just-in-time compilers for Java, where class hierarchy analysis identifies monomorphic or bimorphic call sites, enabling inlining and further optimizations. Complementing this, escape analysis determines whether objects escape their allocation scope, allowing stack allocation instead of heap allocation and synchronization elimination for thread-local objects. These techniques, applied in Java virtual machines, can significantly reduce memory management costs in object-heavy workloads by exploiting the language's object model.[51][52] Languages with strict aliasing rules, such as Fortran, enable aggressive array optimizations by assuming no overlapping references between arrays or pointers, allowing vectorization and loop unrolling without disambiguation checks. Fortran compilers exploit this guarantee—defined in the language standard—to reorder array accesses freely, yielding significant speedups in numerical computations where aliasing is prohibited for distinct variables. In contrast, Java's array accesses require bounds checks for safety, but optimizing compilers eliminate redundant checks using range analysis and induction variable tracking, removing many redundant checks in loop-heavy code.[53] A core advantage in purely functional languages like Haskell arises from immutability guarantees, which enable automatic parallelism without race conditions, as data cannot be modified post-creation. Compilers can thus fuse immutable array operations and vectorize them for SIMD execution, or schedule parallel evaluations of independent expressions, improving throughput on multicore systems by factors of 2-4x for data-parallel tasks. This semantic purity contrasts with mutable languages, where aliasing complicates such transformations.[54] For garbage-collected languages like Java or ML, optimizations shift focus from manual memory management to runtime-assisted allocation, incorporating write barriers and root set analysis to minimize collection pauses. Compilers generate code that cooperates with the GC, such as precise stack maps for thread roots, enabling moving collectors without conservative scanning and reducing overall heap pressure through scalar replacement of non-escaping objects. These adaptations ensure that language-level abstractions do not incur prohibitive costs, with optimizations like biased locking further eliding synchronization in single-threaded scenarios.[48]Prescient and Other Emerging Optimizations
Prescient stores represent a speculative optimization technique in compiler design for multi-threaded environments, particularly in languages like Java, where store actions can be reordered to precede the corresponding assign actions under certain conditions. This allows compilers to perform early writes assuming future reads, enabling reorderings that improve memory access patterns and prefetching without violating thread safety in properly synchronized programs. By relaxing ordering constraints in the memory model, prescient stores facilitate aggressive intermediate code optimizations, such as eliminating redundant assignments, while maintaining semantic equivalence through formal verification using operational semantics. However, early formulations of prescient stores in the Java Memory Model led to anomalies like coherence violations, prompting refinements to enforce dependence ordering and barriers for safer speculation.[55][56] Profile-guided optimization (PGO), also known as feedback-directed optimization, leverages runtime execution profiles collected from instrumented test runs to inform static compilation decisions, bridging the gap between compile-time analysis and dynamic behavior. The process typically involves three phases: instrumenting the code to generate profile data (e.g., branch frequencies, call counts), running representative workloads to populate profile files, and recompiling with the data to apply targeted transformations. This enables enhancements like precise inlining of hot functions, speculative resolution of virtual calls to frequent targets, improved register allocation for hot paths, and reordering of basic blocks or conditionals based on execution likelihood, often yielding measurable speedups in real-world applications. For instance, PGO can prioritize speed optimizations in high-execution-time regions while minimizing code size elsewhere, with implementations in compilers like GCC and LLVM demonstrating consistent performance gains through better cache locality and reduced branch mispredictions.[57][58] Auto-vectorization addresses the challenge of exploiting SIMD parallelism in loops by automatically analyzing and transforming scalar code into vector instructions during compilation, without requiring explicit programmer annotations. In frameworks like LLVM, the loop vectorizer identifies amenable loops—those with independent iterations, uniform operations, and predictable memory accesses—and widens instructions to process multiple data elements simultaneously using SIMD units, incorporating cost models to balance vectorization factors against overheads like runtime checks. Key capabilities include handling reductions (e.g., summing array elements), if-conversions for predicated execution, and partial unrolling to align with hardware vector widths, while the SLP vectorizer complements this by packing independent scalar operations across basic blocks into vectors. This technique significantly boosts throughput on modern architectures with wide SIMD support, such as AVX, by increasing instruction-level parallelism in data-parallel regions.[59][60] Speculative execution optimizations extend traditional analyses by incorporating control and data speculations, allowing compilers to hoist or eliminate operations under assumptions about runtime conditions that are verified dynamically. A unified framework based on speculative static single assignment (SSA) form integrates alias profiling and heuristics to enable aggressive transformations, such as speculative partial redundancy elimination, register promotion, and strength reduction, even in the presence of uncertain dependencies. These optimizations are particularly effective on explicitly parallel instruction computing (EPIC) architectures, where runtime checks (e.g., via advanced load tables) recover from mis-speculations, leading to improved performance on benchmarks like SPEC2000 by exposing more parallelism without excessive overhead. Partial inlining builds on this by selectively incorporating hot portions of callee functions into callers, outlining cold code into separate routines to reduce overhead while preserving optimization opportunities across boundaries. This region-based approach, guided by profiling, enhances instruction-level parallelism in ILP-focused compilers by minimizing call costs and improving code locality, as demonstrated in demand-driven inliners that target frequently executed paths.[61][62] Emerging optimizations increasingly incorporate machine learning to autotune compiler parameters and phase orderings, moving beyond static heuristics to predict optimal configurations based on program characteristics and hardware profiles. Techniques like ML-driven autotuning use models trained on diverse benchmarks to select transformation sequences, addressing the combinatorial explosion in optimization spaces and achieving gains in scenarios where traditional methods falter, such as irregular workloads. Frameworks that bridge ML and compilers, such as those enabling Python-based model integration, facilitate rapid experimentation and deployment of these predictive optimizations, with recent advancements showing improved modularity and transparency in production settings as of 2024.[63]Implementation and Practical Aspects
Challenges in Optimization
One of the primary challenges in optimizing compilers is ensuring correctness, as aggressive optimizations can introduce subtle bugs, particularly when handling undefined behavior in languages like C and C++. For instance, compilers may eliminate code or make assumptions based on undefined behavior, such as signed integer overflow, leading to unexpected program termination or security vulnerabilities in real-world applications like the Linux kernel.[64] A study of over 100 compiler-introduced bugs revealed that many stem from erroneous optimizations that misinterpret source code semantics, often requiring developers to modify code to evade the optimizer rather than fixing the compiler itself.[65] These risks are exacerbated by "optimization-unstable code," where seemingly benign constructs are unexpectedly discarded, affecting program reliability across benchmarks.[66] Performance pitfalls arise when optimizations, intended to improve speed, inadvertently degrade overall execution due to side effects like increased code size or cache pressure. Excessive function inlining, for example, can cause code bloat by duplicating large function bodies at multiple call sites, leading to larger binaries that suffer from instruction cache misses and slower load times, particularly in resource-constrained environments.[10] Research on inlining decisions shows that suboptimal heuristics can inflate code size by up to 20% in some workloads, negating runtime gains and even harming performance on modern processors with limited cache hierarchies.[67] Scalability poses significant hurdles for optimizations involving large codebases, where whole-program analyses often exhibit exponential time complexity in the worst case due to intricate dependencies like call graphs or pointer aliasing. For object-oriented languages, representing class hierarchies and method invocations can lead to exponential growth in analysis states, making full interprocedural optimization impractical for programs exceeding millions of lines of code.[68] To mitigate this, compilers limit analysis scope or use approximations, but these trade-offs can miss opportunities in massive projects, as seen in efforts to scale static analysis tools for modern C++ codebases.[69] Debugging optimized code presents additional difficulties, as transformations like instruction reordering, dead code elimination, and variable consolidation disrupt the expected correlation between source lines and machine instructions. Variables may not retain their values at apparent points in the code, or entire loops might be unrolled in ways that obscure control flow, complicating breakpoint placement and stack trace interpretation.[70] Compiler vendors address this partially through debug-friendly optimization flags, but in highly optimized builds, developers often resort to non-optimized compilations for diagnosis, which delays production tuning.[71] Many compiler optimizations rely on heuristics to approximate solutions to NP-complete problems, such as optimal instruction scheduling or register allocation, bridging the gap between theoretical intractability and practical feasibility. These heuristics, often tuned via machine learning or empirical profiling—and increasingly incorporating advanced AI techniques like large language models for performance modeling as of 2024—provide near-optimal results in polynomial time but can vary in effectiveness across workloads, highlighting the ongoing challenge of balancing precision and efficiency.[72][73][74] Compile-time overhead from such analyses further influences optimization aggressiveness, as excessive runtime can render builds untenable for large-scale development.[75]Tools and Evaluation Metrics
Major optimizing compilers include the GNU Compiler Collection (GCC) and the LLVM infrastructure with its Clang frontend. GCC, developed by the GNU Project, supports a range of optimization flags that progressively apply transformations to improve program performance and reduce code size. These include -O0 for no optimization (default for debugging), -O1 for basic optimizations like constant propagation and dead code elimination, -O2 for more aggressive passes including inlining and loop unrolling, -O3 for maximum speed optimizations such as vectorization, and -Os for size-optimized compilation that prioritizes smaller binaries over peak performance.[24] Similarly, Clang, built on LLVM, mirrors these levels (-O0 to -O3, -Os, and additional flags like -Ofast for even more aggressive tuning) while leveraging LLVM's modular optimizer. The LLVM optimizer, invoked via theopt tool, applies a sequence of passes such as instruction combining, loop optimizations, and dead code elimination, allowing developers to run specific transformations independently or in pipelines corresponding to optimization levels. Recent updates, such as Intel's oneAPI DPC++/C++ Compiler version 2025.0, enhance optimization reports for better developer insights into applied transformations.[76][77][78]
Profiling tools are essential for guiding and verifying optimizations by identifying performance bottlenecks. GNU gprof, a standard profiler integrated with GCC, instruments code during compilation (using the -pg flag) to generate execution profiles showing function call graphs, time spent in routines, and invocation counts, enabling targeted optimizations like hot-spot identification.[79] Other tools, such as LLVM's built-in profilers or hardware-based options like perf, complement this by providing sampling-based insights into runtime behavior without full instrumentation overhead.
Evaluation of optimizing compilers relies on metrics that quantify improvements in execution efficiency. Primary dynamic metrics include runtime speedup, measured as the ratio of execution time without optimizations to time with them, often achieving 10-50% gains on compute-intensive workloads depending on the level applied. Code size reduction is another key metric, with -Os flags typically shrinking binaries by 10-20% through techniques like function inlining limits and alignment adjustments, which is critical for embedded systems. Energy consumption, an increasingly important metric for mobile and data-center applications, is assessed via power profiling tools that track joules per execution; optimizations can reduce energy use by up to 30% on benchmarks by minimizing instructions and cache misses in some cases, though aggressive flags like -O3 sometimes increase it due to higher clock speeds.[24][80]
Standard evaluation practices involve A/B testing, where programs are compiled with and without specific optimizations (e.g., -O0 vs. -O3) and compared on representative workloads to isolate impact. Static metrics, such as instruction count or basic block size in intermediate representations, provide pre-execution insights into optimization potential, while source-level measures like cyclomatic complexity help assess control flow simplification before compilation. Benchmark suites ensure reproducible assessments; the SPEC CPU 2017 suite, comprising integer (SPECint) and floating-point (SPECfp) workloads from real applications, is widely used to evaluate compiler performance across systems, reporting geometric means of speedups normalized to a reference machine. Other suites like PolyBench focus on kernel-level optimizations for research. These tools and metrics bridge theoretical optimizations with practical deployment, allowing developers to balance speed, size, and resource use.[81]
