Recent from talks
Nothing was collected or created yet.
Loop fission and fusion
View on WikipediaLoop fission (or loop distribution) is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop's body.[1][2] The goal is to break down a large loop body into smaller ones to achieve better utilization of locality of reference. This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor.
Conversely, loop fusion (or loop jamming) is a compiler optimization and loop transformation which replaces multiple loops with a single one.[3][2] Loop fusion does not always improve run-time speed. On some architectures, two loops may actually perform better than one loop because, for example, there is increased data locality within each loop. One of the main benefits of loop fusion is that it allows temporary allocations to be avoided, which can lead to huge performance gains in numerical computing languages such as Julia when doing elementwise operations on arrays (however, Julia's loop fusion is not technically a compiler optimization, but a syntactic guarantee of the language).[4]
Other benefits of loop fusion are that it avoids the overhead of the loop control structures, and also that it allows the loop body to be parallelized by the processor[5] by taking advantage of instruction-level parallelism. This is possible when there are no data dependencies between the bodies of the two loops (this is in stark contrast to the other main benefit of loop fusion described above, which only presents itself when there are data dependencies that require an intermediate allocation to store the results). If loop fusion is able to remove redundant allocations, performance increases can be large.[4] Otherwise, there is a more complex trade-off between data locality, instruction-level parallelism, and loop overhead (branching, incrementing, etc.) that may make loop fusion, loop fission, or neither, the preferable optimization.
Fission
[edit]Example in C
[edit]int a[100];
int b[100];
for (int i = 0; i < 100; i++) {
a[i] = 1;
b[i] = 2;
}
is equivalent to:
int a[100];
int b[100];
for (int i = 0; i < 100; i++) {
a[i] = 1;
}
for (int i = 0; i < 100; i++) {
b[i] = 2;
}
Fusion
[edit]Example in C++ and MATLAB
[edit]Consider the following MATLAB code:
x = 0:999; % Create an array of numbers from 0 to 999 (range is inclusive)
y = sin(x) + 4; % Take the sine of x (element-wise) and add 4 to each element
The same syntax can be achieved in C++ by using function and operator overloading:
import <cassert>;
import std;
using std::unique_ptr;
class FloatArray {
private:
size_t length;
std::unique_ptr<float[]> data;
// Internal constructor that produces an uninitialized array
explicit FloatArray(size_t n):
length{n}, data{new float[n]} { }
public:
// Factory method to produce an array over an integer range (the upper
// bound is exclusive, unlike MATLAB's ranges).
[[nodiscard]]
static FloatArray range(size_t start, size_t end) {
assert(end > start);
size_t length = end - start;
FloatArray a(length);
for (size_t i = 0; i < length; ++i) {
a[i] = start + i;
}
return a;
}
// Basic array operations
[[nodiscard]]
size_t size() const noexcept {
return length;
}
float& operator[](size_t i) noexcept {
return data[i];
}
const float& operator[](size_t i) const noexcept {
return data[i];
}
// Declare an overloaded addition operator as a free friend function (this
// syntax defines operator+ as a free function that is a friend of this
// class, despite it appearing as a member function declaration).
friend FloatArray operator+(const FloatArray& a, float b) {
FloatArray c(a.size());
for (size_t i = 0; i < a.size(); ++i) {
c[i] = a[i] + b;
}
return c;
}
// Similarly, we can define an overload for the sin() function. In practice,
// it would be unwieldy to define all possible overloaded math operations as
// friends inside the class like this, but this is just an example.
friend FloatArray sin(const FloatArray& a) {
FloatArray b(a.size());
for (size_t i = 0; i < a.size(); ++i) {
b[i] = std::sin(a[i]);
}
return b;
}
};
int main(int argc, char* argv[]) {
// Here, we perform the same computation as the MATLAB example
FloatArray x = FloatArray::range(0, 1000);
FloatArray y = sin(x) + 4;
// Print the result out - just to make sure the optimizer doesn't remove
// everything (if it's smart enough to do so).
std::println("The result is: ");
for (float f: y) {
std::println("{}", f);
}
return 0;
}
However, the above example unnecessarily allocates a temporary array for the result of sin(x). A more efficient implementation would allocate a single array for y, and compute y in a single loop. To optimize this, a C++ compiler would need to:
- Inline the
sinandoperator+function calls. - Fuse the loops into a single loop.
- Remove the unused stores into the temporary arrays (can use a register or stack variable instead).
- Remove the unused allocation and free.
All of these steps are individually possible. Even step four is possible despite the fact that functions like malloc and free have global side effects, since some compilers hardcode symbols such as malloc and free so that they can remove unused allocations from the code.[6] However, as of clang 12.0.0 and gcc 11.1, this loop fusion and redundant allocation removal does not occur - even on the highest optimization level.[7][8]
Some languages specifically targeted towards numerical computing such as Julia might have the concept of loop fusion built into it at a high level, where the compiler will notice adjacent elementwise operations and fuse them into a single loop.[9] Currently, to achieve the same syntax in general purpose languages like C++, the sin and operator+ functions must pessimistically allocate arrays to store their results, since they do not know what context they will be called from. This issue can be avoided in C++ by using a different syntax that does not rely on the compiler to remove unnecessary temporary allocations (e.g., using functions and overloads for in-place operations, such as operator+= or std::transform).
See also
[edit]References
[edit]- ^ Y.N. Srikant; Priti Shankar (3 October 2018). The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition. CRC Press. ISBN 978-1-4200-4383-9.
- ^ a b Kennedy, Ken & Allen, Randy. (2001). Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann. ISBN 1-55860-286-0.
- ^ Steven Muchnick; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation. Morgan Kaufmann. ISBN 978-1-55860-320-2.
loop fusion.
- ^ a b Johnson, Steven G. (21 January 2017). "More Dots: Syntactic Loop Fusion in Julia". julialang.org. Retrieved 25 June 2021.
- ^ "Loop Fusion". Intel. Retrieved 2021-06-25.
- ^ Godbolt, Matt. "Compiler Explorer - C++ (x86-64 clang 12.0.0)". godbolt.org. Retrieved 2021-06-25.
- ^ Godbolt, Matt. "Compiler Explorer - C++ (x86-64 clang 12.0.0)". godbolt.org. Retrieved 2021-06-25.
- ^ Godbolt, Matt. "Compiler Explorer - C++ (x86-64 gcc 11.1)". godbolt.org. Retrieved 2021-06-25.
- ^ "Functions · The Julia Language". docs.julialang.org. Retrieved 2021-06-25.
Loop fission and fusion
View on Grokipediafor (int i = 0; i < n; ++i) { a[i] = e1; b[i] = e2; } into two loops: for (int i = 0; i < n; ++i) { a[i] = e1; } followed by for (int i = 0; i < n; ++i) { b[i] = e2; }.[2][4] This technique is applicable when there are no loop-carried data dependencies that would violate independence, ensuring legality through dependence analysis.[4] Benefits include reduced register pressure in inner loops, better opportunities for vectorization or software pipelining, and simplified parallelization by isolating conditional or expensive computations into distinct loops.[3][4] In practice, fission can improve cache performance by minimizing working set size per loop and has been shown to yield performance gains, such as an 11% speedup in large-scale loop transformations through enhanced SIMD utilization.[4]
Loop Fusion, sometimes called loop jamming, consolidates sequential loops with compatible iteration spaces to eliminate redundant control structures and promote data reuse. An example merges two loops like for (let i = 0; i < 100; ++i) { b[i] = f(a[i]); } and for (let i = 0; i < 100; ++i) { c[i] = g(b[i]); } into for (let i = 0; i < 100; ++i) { b[i] = f(a[i]); c[i] = g(b[i]); }, potentially further simplifying if intermediate variables like b are eliminable.[1][2] Fusion requires that the loops be adjacent, share the same iteration range, and lack interfering dependencies between their bodies, often verified via data dependence graphs.[1] It enhances instruction-level parallelism, reduces loop overhead, and improves spatial locality by accessing related data in a single pass, which is particularly valuable in domains like machine learning compilers where memory bandwidth is a bottleneck.[1][3]
These transformations are inverse operations and are often applied in tandem within optimizing compilers to balance trade-offs, such as fission followed by selective fusion to align loops for hardware-specific features like caching or GPU execution.[3] They form part of broader loop optimization strategies, including unrolling, interchange, and invariant code motion, contributing to faster runtime, lower memory usage, and better scalability in numerical and array-intensive applications.[1][4] Modern compilers like those in LLVM or Sun Studio implement these automatically, but manual application by programmers can yield further gains in performance-critical code.[3]
Overview
Definitions and Basic Concepts
Loop nests refer to structures in programming where one or more loops are embedded within another loop, with the innermost loops executing completely for each iteration of the enclosing loops.[5] The iteration space of a loop or loop nest is defined as the set of all possible iteration points, often represented as a lattice of integer tuples corresponding to the indices of the loops.[6] Data dependencies in loops arise from relationships between statements where one statement writes to a memory location that is later read by another, constraining possible reorderings or transformations to preserve program semantics; these can be loop-independent (within the same iteration) or loop-carried (across different iterations).[7] Loop fission, also known as loop distribution, is a compiler optimization technique that splits the body of a single loop into multiple separate loops operating over the same index range, thereby isolating independent computations to enhance opportunities for further optimizations or improve data locality.[4] In contrast, loop fusion combines the bodies of two or more adjacent loops that access similar data structures into a single loop, aiming to minimize loop overhead and promote better reuse of data in caches.[8] These transformations emerged in the 1970s and 1980s as part of early research in parallel computing and compiler optimizations, with foundational work on loop transformations, including fusion (referred to as jamming), documented in seminal papers such as Allen and Cocke's catalog of optimizing transformations.[9] Understanding loop nests, iteration spaces, and data dependencies forms the prerequisite for applying fission and fusion effectively within compiler frameworks.Role in Compiler Optimizations
Loop fission and fusion serve as foundational transformations within the loop optimization phase of compiler pipelines, positioned after initial analyses such as data dependence testing and scalar evolution, and prior to lower-level optimizations like vectorization, parallelization, and code generation. In this placement, they restructure loop nests to expose opportunities for subsequent passes, ensuring that the code is in a form amenable to exploiting hardware features such as SIMD instructions or multi-core execution.[10] This sequencing is evident in frameworks like LLVM's middle-end, where loop canonicalization precedes these transformations to normalize loop structures.[11] These transformations synergize with other compiler optimizations by altering loop granularity and data access patterns, thereby facilitating techniques such as dead code elimination, strength reduction, and improved instruction scheduling. For instance, loop fission can isolate computationally intensive sections, reducing register pressure and enabling more aggressive scheduling of independent operations, while fusion promotes data reuse that aids in eliminating redundant computations. In pipelines like GCC's, such restructuring complements invariant code motion and induction variable simplification, enhancing overall instruction-level parallelism without altering program semantics.[12] Performance impacts from fission and fusion generally include reductions in memory bandwidth demands and overhead from loop control, leading to measurable gains in execution time—often 10-20% in locality-sensitive workloads—through better cache utilization and parallelism exposure.[13] Their adoption traces back to the 1990s in optimizing compilers for high-performance computing, with integration into production systems like GCC around 2009 and LLVM in the 2020s.[9]Loop Fission
Principles and Algorithms
Loop fission, also known as loop distribution, operates on the principle of decomposing the body of a single loop into multiple independent loops that iterate over the same index range, each executing a subset of the original statements. This transformation separates independent operations to enable targeted optimizations on smaller, more focused loops, improving aspects such as data locality and hardware utilization while preserving program semantics. It is particularly effective for reducing the working set size in memory-intensive loops, allowing data to fit better in caches, and serves as the inverse of loop fusion in compiler optimization pipelines.[4][2] Safe loop fission requires verifying independence through dependence analysis to ensure no data dependencies are violated by the split. Key conditions include the absence of loop-carried dependencies between statements being separated; for instance, statements must not have flow, anti-, or output dependencies that span across the fission boundary. Dependence analysis techniques, such as the Greatest Common Divisor (GCD) test for array subscript dependencies or the Banerjee test for linear dependencies, are used to classify dependencies as loop-independent (safe to split) or loop-carried (prohibiting fission). In cases of scalar dependencies, scalar expansion—replacing scalars with arrays indexed by the loop variable—may be applied to enable fission. The polyhedral model provides a more advanced framework, representing loops as iteration domains and using affine schedules to partition statements without introducing cycles in the dependence graph.[4][1] The algorithmic process for loop fission typically follows these structured steps:- Dependence Analysis: Construct a dependence graph for statements in the loop body; identify groups of independent statements using tools like the polyhedral representation or classic tests to ensure no carried dependencies exist between groups.[4]
- Statement Grouping and Splitting: Partition the loop body into independent subsets; create separate loops for each group, copying the original loop headers (initialization, condition, increment) to maintain identical iteration spaces.[2]
- Index and Variable Adjustment: Ensure all references to loop indices and variables are consistent across new loops; apply transformations if needed, such as renaming temporaries to avoid conflicts.[1]
- Post-Fission Optimization: Apply further passes like vectorization or unrolling to each new loop, followed by dead code elimination if any statements become redundant.[4]
Applicability Conditions and Benefits
Loop fission is applicable when the statements within a loop are independent, meaning there are no data dependencies (flow, anti-, or output) between them that are carried by the loop, ensuring the split does not alter execution order or results. The loops must share identical iteration spaces—same bounds, step size, and nesting depth—without requiring bound adjustments, though prologue/epilogue code can handle minor differences. Dependence analysis confirms legality, flagging cases like backward loop-carried dependencies as invalid. It is especially useful in serial or parallel code where splitting isolates expensive or conditional operations.[4][2] The primary benefits of loop fission include reduced register pressure by limiting live variables in each smaller loop, enabling more aggressive optimizations like vectorization or software pipelining on focused computations. It improves cache locality by minimizing the working set size per loop, reducing misses in data-intensive applications, and facilitates parallelization by creating independent tasks. In GPU or SIMD contexts, fission isolates vectorizable statements, avoiding predicated execution in mixed loops. Quantitative studies show performance gains, such as an 11% speedup in benchmarks through enhanced SIMD utilization and up to 30% execution time reduction in nested numerical loops via better data reuse.[4][1] Trade-offs exist, including higher loop initiation overhead and increased code size from duplicated control structures, which may degrade performance in small loops or resource-constrained environments like embedded systems. Fission can also complicate debugging due to altered code structure.[2]Implementation Examples
A practical implementation of loop fission in C involves splitting a loop with independent array assignments to improve locality and enable vectorization. For instance, in numerical computations, separating initialization from computation reduces cache pressure in the inner loop.[14] Unfused loop:for (int i = 0; i < n; ++i) {
a[i] = e1;
b[i] = e2;
}
for (int i = 0; i < n; ++i) {
a[i] = e1;
b[i] = e2;
}
for (int i = 0; i < n; ++i) {
a[i] = e1;
}
for (int i = 0; i < n; ++i) {
b[i] = e2;
}
for (int i = 0; i < n; ++i) {
a[i] = e1;
}
for (int i = 0; i < n; ++i) {
b[i] = e2;
}
e1 and e2 are independent expressions; using std::vector for dynamic arrays preserves the structure while allowing runtime sizing. In practice, the first loop may fit entirely in L1 cache, speeding up access.[2][14]
In Java, loop fission can optimize bit operations by separating computation from aggregation, as seen in bitmap libraries. An example splits XOR and bit counting:
Fused loop:
for (int i = 0; i < left.length && i < right.length; i++) {
left[i] ^= right[i];
count += Long.bitCount(left[i]);
}
for (int i = 0; i < left.length && i < right.length; i++) {
left[i] ^= right[i];
count += Long.bitCount(left[i]);
}
for (int i = 0; i < left.length && i < right.length; i++) {
left[i] ^= right[i];
}
int count = 0;
for (long l : left) {
count += Long.bitCount(l);
}
for (int i = 0; i < left.length && i < right.length; i++) {
left[i] ^= right[i];
}
int count = 0;
for (long l : left) {
count += Long.bitCount(l);
}
Loop Fusion
Principles and Algorithms
Loop fusion operates on the principle of merging the bodies of adjacent loops that operate over compatible iteration spaces, thereby reducing the overhead associated with multiple loop controls and promoting better data reuse across computations that access shared data structures. This transformation is particularly effective in serial programs for minimizing memory traffic by allowing data to remain in faster storage levels like registers or caches longer, as subsequent loop iterations can reuse values produced by prior ones without intermediate stores and loads. The core goal is to expose opportunities for locality improvements while preserving the program's semantics, often as the inverse of loop fission in optimization pipelines.[17] Compatibility checks form the foundation for safe loop fusion, ensuring that the transformation does not alter program behavior. Primarily, loops must exhibit no anti-dependencies, where a write in one loop could overwrite a value read in a subsequent loop, as fusion would reverse the order and introduce errors; dependence analysis, typically via GCD tests or more advanced methods like the Banerjee test, verifies that all dependencies are either loop-independent or flow/output dependencies that remain valid post-fusion. Additionally, the loops must share identical iteration spaces, meaning they have the same nesting depth, and their bounds must align such that the number of iterations matches exactly—loops with differing bounds require prior transformations like padding or affine partitioning to enable fusion. In the polyhedral model, compatibility extends to affine mappings of iteration domains, where scattering functions ensure no dependency cycles are introduced.[18][17][19] The algorithmic process for loop fusion typically proceeds in structured steps to systematically identify and apply the transformation:- Dependence Analysis: Examine pairs of adjacent loops to compute data dependencies using tools like the dependence graph or polyhedral representations; flag incompatibilities such as anti-dependencies or loop-carried dependencies that would be violated by merging.[17]
- Iteration Space Alignment: Verify and adjust loop headers (initialization, condition, increment) to ensure identical iteration vectors; this may involve affine transformations to map non-identical bounds onto a common space, such as shifting indices or applying unimodular matrices in polyhedral frameworks.[18][19]
- Body Merging and Index Update: Insert the body of the second loop into the first, removing the second loop's control structure, and substitute any index variables with equivalents from the first loop to maintain consistency.[17]
- Expression Simplification: Post-fusion, apply algebraic simplifications to the merged loop body, such as common subexpression elimination, to reduce computational redundancy.[18]
Applicability Conditions and Benefits
Loop fusion is applicable when the loops in question are adjacent in the code or can be reordered without violating dependencies, ensuring that their iteration spaces are compatible—typically requiring identical trip counts or the ability to handle differences through additional prologue or epilogue code.[20][21] Furthermore, the loops must exhibit shared data accesses that do not introduce conflicts, such as antidependences, while preserving the original data dependences to maintain program semantics.[8] Compatibility checks, often performed via dependence analysis, confirm that fusion will not alter the program's output.[19] The primary benefits of loop fusion include a reduction in loop initiation and control overhead, as combining multiple loops eliminates redundant increments, decrements, and branch tests, thereby streamlining execution.[8] It also enhances cache locality by consolidating data accesses into a single pass, promoting better reuse of data in the memory hierarchy and minimizing cache misses.[21] Additionally, fusion can facilitate automatic parallelization by increasing the granularity of parallelizable work units, allowing compilers to better exploit parallelism in subsequent optimization passes.[8] In GPU compilers such as those for CUDA, loop fusion—often termed kernel fusion—further reduces memory bandwidth demands by merging operations that share data, which is particularly advantageous for memory-bound applications.[22] Quantitative evaluations in the polyhedral model demonstrate that loop fusion can yield significant reductions in execution time for nested loops in numerical codes, as observed across benchmarks like PolyBench, where fusion enables improved data reuse without sacrificing parallelism.[19][23] As of 2025, machine learning techniques are being integrated to guide loop fusion and other polyhedral optimizations, achieving notable speedups on benchmarks like PolyBench.[24] Despite these advantages, loop fusion introduces trade-offs, such as potential increases in register pressure due to the larger scope of live variables within the fused loop, which may lead to spills and degrade performance if register resources are limited.[21] It can also heighten code complexity, complicating further optimizations or debugging, particularly in GPU environments where fused kernels amplify instruction-level pressures.[25]Implementation Examples
One practical implementation of loop fusion in C++ involves merging two adjacent loops that operate on the same index range to reduce overhead and improve data locality. Consider an example from FPGA optimization where separate loops update arraysa and b based on a local memory array; the fused version interleaves the operations within a single inner loop.[26]
Unfused loops:
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 300; j++) {
a[j] = localMem[j] + 3;
}
for (int k = 0; k < 300; k++) {
b[k] = localMem[k] + 4;
}
}
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 300; j++) {
a[j] = localMem[j] + 3;
}
for (int k = 0; k < 300; k++) {
b[k] = localMem[k] + 4;
}
}
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 300; j++) {
int localMemVal = localMem[j];
a[j] = localMemVal + 3;
b[j] = localMemVal + 4;
}
}
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 300; j++) {
int localMemVal = localMem[j];
a[j] = localMemVal + 3;
b[j] = localMemVal + 4;
}
}
std::vector can replace raw arrays for dynamic sizing while preserving the fusion structure.[26]
In MATLAB, loop fusion is often applied during code generation to merge successive loops with identical iteration counts, enhancing performance in vectorized environments. A representative example computes the sum and product of an array's elements in separate loops, which the MATLAB Coder fuses into one for the generated C/C++ output.[27]
Unfused MATLAB code:
function [y_f_sum, y_f_prod] = SumAndProduct(Arr) %#codegen
y_f_sum = 0;
y_f_prod = 1;
for i = 1:[length](/page/Length)(Arr)
y_f_sum = y_f_sum + Arr(i);
end
for i = 1:[length](/page/Length)(Arr)
y_f_prod = y_f_prod * Arr(i);
end
end
function [y_f_sum, y_f_prod] = SumAndProduct(Arr) %#codegen
y_f_sum = 0;
y_f_prod = 1;
for i = 1:[length](/page/Length)(Arr)
y_f_sum = y_f_sum + Arr(i);
end
for i = 1:[length](/page/Length)(Arr)
y_f_prod = y_f_prod * Arr(i);
end
end
i) remains unchanged, while the inner loops' identical bounds (0 to 300) allow direct merging; body interleaving then combines the statements by introducing a temporary localMemVal to avoid recomputation, preserving semantics. Verification via execution traces—such as profiling memory accesses before and after—confirms reduced loads from localMem (from two per iteration to one) and equivalent outputs, as traces show identical array values post-fusion.[26]
Compiler support for loop fusion spans multiple languages and toolchains. In LLVM, the fusion pass operates within the loop transformation pipeline, greedily merging adjacent loops with matching trip counts after checks for dependencies and adjacency, typically enabled under optimization levels like -O2 or -O3 without a dedicated flag; this applies to C++ compilation via Clang. For MATLAB, the Coder performs automatic fusion during code generation for targets like C/C++, focusing on successive loops without user flags. In GCC, fusion is not explicitly flagged but occurs as part of broader loop optimizations under -O3, particularly for adjacent independent loops.[11][27][28]
