Hubbry Logo
Loop fission and fusionLoop fission and fusionMain
Open search
Loop fission and fusion
Community hub
Loop fission and fusion
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Loop fission and fusion
Loop fission and fusion
from Wikipedia

Loop fission (or loop distribution) is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop's body.[1][2] The goal is to break down a large loop body into smaller ones to achieve better utilization of locality of reference. This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor.

Conversely, loop fusion (or loop jamming) is a compiler optimization and loop transformation which replaces multiple loops with a single one.[3][2] Loop fusion does not always improve run-time speed. On some architectures, two loops may actually perform better than one loop because, for example, there is increased data locality within each loop. One of the main benefits of loop fusion is that it allows temporary allocations to be avoided, which can lead to huge performance gains in numerical computing languages such as Julia when doing elementwise operations on arrays (however, Julia's loop fusion is not technically a compiler optimization, but a syntactic guarantee of the language).[4]

Other benefits of loop fusion are that it avoids the overhead of the loop control structures, and also that it allows the loop body to be parallelized by the processor[5] by taking advantage of instruction-level parallelism. This is possible when there are no data dependencies between the bodies of the two loops (this is in stark contrast to the other main benefit of loop fusion described above, which only presents itself when there are data dependencies that require an intermediate allocation to store the results). If loop fusion is able to remove redundant allocations, performance increases can be large.[4] Otherwise, there is a more complex trade-off between data locality, instruction-level parallelism, and loop overhead (branching, incrementing, etc.) that may make loop fusion, loop fission, or neither, the preferable optimization.

Fission

[edit]

Example in C

[edit]
int a[100];
int b[100];
for (int i = 0; i < 100; i++) {
    a[i] = 1;
    b[i] = 2;
}

is equivalent to:

int a[100];
int b[100];
for (int i = 0; i < 100; i++) {
    a[i] = 1;
}
for (int i = 0; i < 100; i++) {
    b[i] = 2;
}

Fusion

[edit]

Example in C++ and MATLAB

[edit]

Consider the following MATLAB code:

x = 0:999; % Create an array of numbers from 0 to 999 (range is inclusive)
y = sin(x) + 4; % Take the sine of x (element-wise) and add 4 to each element

The same syntax can be achieved in C++ by using function and operator overloading:

import <cassert>;

import std;

using std::unique_ptr;

class FloatArray {
private:
    size_t length;
    std::unique_ptr<float[]> data;
    
    // Internal constructor that produces an uninitialized array
    explicit FloatArray(size_t n): 
        length{n}, data{new float[n]} { }
public:
    // Factory method to produce an array over an integer range (the upper
    // bound is exclusive, unlike MATLAB's ranges).
    [[nodiscard]]
    static FloatArray range(size_t start, size_t end) {
        assert(end > start);
        size_t length = end - start;
        FloatArray a(length);
        for (size_t i = 0; i < length; ++i) {
            a[i] = start + i;
        }
        return a;
    }
    
    // Basic array operations
    [[nodiscard]]
    size_t size() const noexcept { 
        return length; 
    }

    float& operator[](size_t i) noexcept { 
        return data[i]; 
    }

    const float& operator[](size_t i) const noexcept { 
        return data[i]; 
    }

    // Declare an overloaded addition operator as a free friend function (this
    // syntax defines operator+ as a free function that is a friend of this
    // class, despite it appearing as a member function declaration).
    friend FloatArray operator+(const FloatArray& a, float b) {
        FloatArray c(a.size());
        for (size_t i = 0; i < a.size(); ++i) {
            c[i] = a[i] + b;
        }
        return c;
    }
    
    // Similarly, we can define an overload for the sin() function. In practice,
    // it would be unwieldy to define all possible overloaded math operations as
    // friends inside the class like this, but this is just an example.
    friend FloatArray sin(const FloatArray& a) {
        FloatArray b(a.size());
        for (size_t i = 0; i < a.size(); ++i) {
            b[i] = std::sin(a[i]);
        }
        return b;
    }
};

int main(int argc, char* argv[]) {
    // Here, we perform the same computation as the MATLAB example
    FloatArray x = FloatArray::range(0, 1000);
    FloatArray y = sin(x) + 4;
    
    // Print the result out - just to make sure the optimizer doesn't remove
    // everything (if it's smart enough to do so).
    std::println("The result is: ");
    for (float f: y) {
        std::println("{}", f);
    }
    
    return 0;
}

However, the above example unnecessarily allocates a temporary array for the result of sin(x). A more efficient implementation would allocate a single array for y, and compute y in a single loop. To optimize this, a C++ compiler would need to:

  1. Inline the sin and operator+ function calls.
  2. Fuse the loops into a single loop.
  3. Remove the unused stores into the temporary arrays (can use a register or stack variable instead).
  4. Remove the unused allocation and free.

All of these steps are individually possible. Even step four is possible despite the fact that functions like malloc and free have global side effects, since some compilers hardcode symbols such as malloc and free so that they can remove unused allocations from the code.[6] However, as of clang 12.0.0 and gcc 11.1, this loop fusion and redundant allocation removal does not occur - even on the highest optimization level.[7][8]

Some languages specifically targeted towards numerical computing such as Julia might have the concept of loop fusion built into it at a high level, where the compiler will notice adjacent elementwise operations and fuse them into a single loop.[9] Currently, to achieve the same syntax in general purpose languages like C++, the sin and operator+ functions must pessimistically allocate arrays to store their results, since they do not know what context they will be called from. This issue can be avoided in C++ by using a different syntax that does not rely on the compiler to remove unnecessary temporary allocations (e.g., using functions and overloads for in-place operations, such as operator+= or std::transform).

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Loop fission and fusion are complementary compiler optimization techniques used to transform loops in for improved execution efficiency. Loop fission involves splitting a single loop into multiple independent loops that iterate over the same index range but execute subsets of the original statements, while loop fusion merges two or more adjacent loops into a single loop to combine their iterations. Loop Fission, also known as loop distribution, decomposes a loop to separate independent operations, enabling targeted optimizations on each resulting loop. For instance, a loop that computes multiple assignments can be split into separate loops for each assignment, such as transforming for (int i = 0; i < n; ++i) { a[i] = e1; b[i] = e2; } into two loops: for (int i = 0; i < n; ++i) { a[i] = e1; } followed by for (int i = 0; i < n; ++i) { b[i] = e2; }. This technique is applicable when there are no loop-carried data dependencies that would violate independence, ensuring legality through dependence analysis. Benefits include reduced register pressure in inner loops, better opportunities for vectorization or software pipelining, and simplified parallelization by isolating conditional or expensive computations into distinct loops. In practice, fission can improve cache performance by minimizing working set size per loop and has been shown to yield performance gains, such as an 11% speedup in large-scale loop transformations through enhanced SIMD utilization. Loop Fusion, sometimes called loop jamming, consolidates sequential loops with compatible iteration spaces to eliminate redundant control structures and promote data reuse. An example merges two loops like for (let i = 0; i < 100; ++i) { b[i] = f(a[i]); } and for (let i = 0; i < 100; ++i) { c[i] = g(b[i]); } into for (let i = 0; i < 100; ++i) { b[i] = f(a[i]); c[i] = g(b[i]); }, potentially further simplifying if intermediate variables like b are eliminable. Fusion requires that the loops be adjacent, share the same iteration range, and lack interfering dependencies between their bodies, often verified via data dependence graphs. It enhances , reduces loop overhead, and improves spatial locality by accessing related data in a single pass, which is particularly valuable in domains like compilers where is a bottleneck. These transformations are inverse operations and are often applied in tandem within optimizing compilers to balance trade-offs, such as fission followed by selective fusion to align loops for hardware-specific features like caching or GPU execution. They form part of broader strategies, including unrolling, interchange, and invariant code motion, contributing to faster runtime, lower usage, and better in numerical and array-intensive applications. Modern compilers like those in or implement these automatically, but manual application by programmers can yield further gains in performance-critical code.

Overview

Definitions and Basic Concepts

Loop nests refer to structures in programming where one or more loops are embedded within another loop, with the innermost loops executing completely for each iteration of the enclosing loops. The iteration space of a loop or loop nest is defined as the set of all possible iteration points, often represented as a lattice of integer tuples corresponding to the indices of the loops. Data dependencies in loops arise from relationships between statements where one statement writes to a location that is later read by another, constraining possible reorderings or transformations to preserve program semantics; these can be loop-independent (within the same iteration) or loop-carried (across different iterations). Loop fission, also known as loop distribution, is a optimization technique that splits the body of a single loop into multiple separate loops operating over the same index range, thereby isolating independent computations to enhance opportunities for further optimizations or improve data locality. In contrast, loop fusion combines the bodies of two or more adjacent loops that access similar data structures into a single loop, aiming to minimize loop overhead and promote better reuse of data in caches. These transformations emerged in the and as part of early research in and optimizations, with foundational work on loop transformations, including fusion (referred to as jamming), documented in seminal papers such as Allen and Cocke's catalog of optimizing transformations. Understanding loop nests, iteration spaces, and data dependencies forms the prerequisite for applying fission and fusion effectively within frameworks.

Role in Compiler Optimizations

Loop fission and fusion serve as foundational transformations within the phase of pipelines, positioned after initial analyses such as data dependence testing and scalar evolution, and prior to lower-level optimizations like vectorization, parallelization, and code generation. In this placement, they restructure loop nests to expose opportunities for subsequent passes, ensuring that the code is in a form amenable to exploiting hardware features such as SIMD instructions or multi-core execution. This sequencing is evident in frameworks like LLVM's middle-end, where loop canonicalization precedes these transformations to normalize loop structures. These transformations synergize with other optimizations by altering loop granularity and access patterns, thereby facilitating techniques such as , , and improved . For instance, loop fission can isolate computationally intensive sections, reducing register pressure and enabling more aggressive scheduling of independent operations, while fusion promotes reuse that aids in eliminating redundant computations. In pipelines like GCC's, such restructuring complements invariant code motion and induction variable simplification, enhancing overall without altering program semantics. Performance impacts from fission and fusion generally include reductions in demands and overhead from loop control, leading to measurable gains in execution time—often 10-20% in locality-sensitive workloads—through better cache utilization and parallelism exposure. Their adoption traces back to the in optimizing compilers for , with integration into production systems like GCC around 2009 and in the 2020s.

Loop Fission

Principles and Algorithms

Loop fission, also known as loop distribution, operates on the principle of decomposing the body of a single loop into multiple independent loops that iterate over the same index range, each executing a of the original statements. This transformation separates independent operations to enable targeted optimizations on smaller, more focused loops, improving aspects such as locality and hardware utilization while preserving program semantics. It is particularly effective for reducing the size in memory-intensive loops, allowing to fit better in caches, and serves as the inverse of loop fusion in optimization pipelines. Safe loop fission requires verifying independence through dependence analysis to ensure no data dependencies are violated by the split. Key conditions include the absence of loop-carried dependencies between statements being separated; for instance, statements must not have flow, anti-, or output dependencies that span across the fission boundary. Dependence analysis techniques, such as the (GCD) test for array subscript dependencies or the Banerjee test for linear dependencies, are used to classify dependencies as loop-independent (safe to split) or loop-carried (prohibiting fission). In cases of scalar dependencies, scalar expansion—replacing scalars with arrays indexed by the loop variable—may be applied to enable fission. The polyhedral model provides a more advanced framework, representing loops as iteration domains and using affine schedules to partition statements without introducing cycles in the dependence graph. The algorithmic process for loop fission typically follows these structured steps:
  1. Dependence Analysis: Construct a dependence graph for statements in the loop body; identify groups of independent statements using tools like the polyhedral representation or tests to ensure no carried dependencies exist between groups.
  2. Statement Grouping and Splitting: Partition the loop body into independent subsets; create separate loops for each group, copying the original loop headers (initialization, condition, increment) to maintain identical spaces.
  3. Index and Variable Adjustment: Ensure all references to loop indices and variables are consistent across new loops; apply transformations if needed, such as renaming temporaries to avoid conflicts.
  4. Post-Fission Optimization: Apply further passes like vectorization or unrolling to each new loop, followed by if any statements become redundant.
For basic cases, fission is valid when statements S1S_1 and S2S_2 in a loop satisfy no dependence relation S1S2S_1 \to S_2 or S2S1S_2 \to S_1 across . In polyhedral terms, the domains must be partitionable such that scattering functions map subsets without overlap in dependencies. Challenges in loop fission include increased loop overhead from additional control structures and potential , especially in deeply nested loops. Non-consecutive statements may require prior reordering, and handling irregular or imperfect loops (e.g., non-constant bounds) demands advanced techniques like affine partitioning, which can raise compilation time but enable broader applicability in modern compilers.

Applicability Conditions and Benefits

Loop fission is applicable when the statements within a loop are independent, meaning there are no dependencies (flow, anti-, or output) between them that are carried by the loop, ensuring the split does not alter execution order or results. The loops must share identical spaces—same bounds, step size, and nesting depth—without requiring bound adjustments, though / code can handle minor differences. Dependence confirms legality, flagging cases like backward loop-carried dependencies as invalid. It is especially useful in serial or parallel code where splitting isolates expensive or conditional operations. The primary benefits of loop fission include reduced register pressure by limiting live variables in each smaller loop, enabling more aggressive optimizations like vectorization or software pipelining on focused computations. It improves cache locality by minimizing the size per loop, reducing misses in data-intensive applications, and facilitates parallelization by creating independent tasks. In GPU or SIMD contexts, fission isolates vectorizable statements, avoiding predicated execution in mixed loops. Quantitative studies show performance gains, such as an 11% in benchmarks through enhanced SIMD utilization and up to 30% execution time reduction in nested numerical loops via better data . Trade-offs exist, including higher loop initiation overhead and increased code size from duplicated control structures, which may degrade performance in small loops or resource-constrained environments like embedded systems. Fission can also complicate debugging due to altered code structure.

Implementation Examples

A practical implementation of loop fission in C involves splitting a loop with independent array assignments to improve locality and enable vectorization. For instance, in numerical computations, separating initialization from computation reduces cache pressure in the inner loop. Unfused loop:

c

for (int i = 0; i < n; ++i) { a[i] = e1; b[i] = e2; }

for (int i = 0; i < n; ++i) { a[i] = e1; b[i] = e2; }

Fissioned loops:

c

for (int i = 0; i < n; ++i) { a[i] = e1; } for (int i = 0; i < n; ++i) { b[i] = e2; }

for (int i = 0; i < n; ++i) { a[i] = e1; } for (int i = 0; i < n; ++i) { b[i] = e2; }

This assumes e1 and e2 are independent expressions; using std::vector for dynamic arrays preserves the structure while allowing runtime sizing. In practice, the first loop may fit entirely in L1 cache, speeding up access. In , loop fission can optimize bit operations by separating computation from aggregation, as seen in bitmap libraries. An example splits XOR and bit counting: Fused loop:

java

for (int i = 0; i < left.length && i < right.length; i++) { left[i] ^= right[i]; count += Long.bitCount(left[i]); }

for (int i = 0; i < left.length && i < right.length; i++) { left[i] ^= right[i]; count += Long.bitCount(left[i]); }

Fissioned loops:

java

for (int i = 0; i < left.length && i < right.length; i++) { left[i] ^= right[i]; } int count = 0; for (long l : left) { count += Long.bitCount(l); }

for (int i = 0; i < left.length && i < right.length; i++) { left[i] ^= right[i]; } int count = 0; for (long l : left) { count += Long.bitCount(l); }

This enables autovectorization of the XOR loop (e.g., 256-bit SIMD), yielding performance improvements like 146 ns vs. 230 ns for arrays of size 256. To illustrate the process, in the C example, dependence analysis confirms no dependencies between assignments; splitting duplicates the loop header, and verification via profiling shows reduced cache misses (e.g., from shared to separate accesses) with identical outputs. Compiler support for loop fission is integrated in major toolchains. In LLVM, the loop distribution pass (part of the Polly polyhedral optimizer or standalone) splits independent statements under -O2/-O3, using dependence graphs for analysis. GCC applies fission within its loop optimization framework under -O3, particularly for vectorization-enabling splits. These occur automatically for adjacent independent statements without dedicated flags.

Loop Fusion

Principles and Algorithms

Loop fusion operates on the principle of merging the bodies of adjacent loops that operate over compatible spaces, thereby reducing the overhead associated with multiple loop controls and promoting better reuse across computations that access shared structures. This transformation is particularly effective in serial programs for minimizing traffic by allowing to remain in faster storage levels like registers or caches longer, as subsequent loop s can reuse values produced by prior ones without intermediate stores and loads. The core goal is to expose opportunities for locality improvements while preserving the program's semantics, often as the inverse of loop fission in optimization pipelines. Compatibility checks form the foundation for safe loop fusion, ensuring that the transformation does not alter program behavior. Primarily, loops must exhibit no anti-dependencies, where a write in one loop could overwrite a value read in a subsequent loop, as fusion would reverse the order and introduce errors; dependence analysis, typically via GCD tests or more advanced methods like the Banerjee test, verifies that all dependencies are either loop-independent or flow/output dependencies that remain valid post-fusion. Additionally, the loops must share identical iteration spaces, meaning they have the same nesting depth, and their bounds must align such that the number of iterations matches exactly—loops with differing bounds require prior transformations like or affine partitioning to enable fusion. In the polyhedral model, compatibility extends to affine mappings of iteration domains, where scattering functions ensure no dependency cycles are introduced. The algorithmic process for loop fusion typically proceeds in structured steps to systematically identify and apply the transformation:
  1. Dependence Analysis: Examine pairs of adjacent loops to compute data dependencies using tools like the dependence graph or polyhedral representations; flag incompatibilities such as anti-dependencies or loop-carried dependencies that would be violated by merging.
  2. Iteration Space Alignment: Verify and adjust loop headers (initialization, condition, increment) to ensure identical iteration vectors; this may involve affine transformations to map non-identical bounds onto a common , such as shifting indices or applying unimodular matrices in polyhedral frameworks.
  3. Body Merging and Index Update: Insert the body of the second loop into the first, removing the second loop's control structure, and substitute any index variables with equivalents from the first loop to maintain consistency.
  4. Expression Simplification: Post-fusion, apply algebraic simplifications to the merged loop body, such as , to reduce computational redundancy.
For bound alignment in basic cases, consider two loops with lower bounds L1,L2L_1, L_2, upper bounds U1,U2U_1, U_2, and step sizes S1,S2S_1, S_2. Fusion is valid if the iteration counts are equal, i.e., U1L1S1=U2L2S2,\frac{U_1 - L_1}{S_1} = \frac{U_2 - L_2}{S_2}, and the loops can be affinely mapped such that L1+kS1=L2+mS2L_1 + k \cdot S_1 = L_2 + m \cdot S_2 for corresponding s k,mk, m. In polyhedral terms, this requires the iteration domains to be schedulable under a common affine schedule without conflicts. Challenges in loop fusion arise particularly with non-consecutive loops, where intermediate loops must be fused first to bring targets adjacent, potentially requiring a greedy search over fusion sequences to avoid suboptimal groupings. Scalar expansions—introducing temporary variables to break anti-dependencies—add complexity by increasing register pressure, while integrating recent polyhedral model advancements allows handling irregular bounds and non-affine accesses more robustly than traditional methods, though at higher computational cost for dependence computation.

Applicability Conditions and Benefits

Loop fusion is applicable when the loops in question are adjacent in the code or can be reordered without violating dependencies, ensuring that their iteration spaces are compatible—typically requiring identical trip counts or the ability to handle differences through additional or code. Furthermore, the loops must exhibit shared data accesses that do not introduce conflicts, such as antidependences, while preserving the original data dependences to maintain program semantics. Compatibility checks, often performed via , confirm that fusion will not alter the program's output. The primary benefits of loop fusion include a reduction in loop initiation and control overhead, as combining multiple loops eliminates redundant increments, decrements, and branch tests, thereby streamlining execution. It also enhances cache locality by consolidating accesses into a single pass, promoting better reuse of in the and minimizing cache misses. Additionally, fusion can facilitate by increasing the granularity of parallelizable work units, allowing compilers to better exploit parallelism in subsequent optimization passes. In GPU compilers such as those for , loop fusion—often termed kernel fusion—further reduces demands by merging operations that share , which is particularly advantageous for memory-bound applications. Quantitative evaluations in the polyhedral model demonstrate that loop fusion can yield significant reductions in execution time for nested loops in numerical codes, as observed across benchmarks like PolyBench, where fusion enables improved reuse without sacrificing parallelism. As of 2025, techniques are being integrated to guide loop fusion and other polyhedral optimizations, achieving notable speedups on benchmarks like PolyBench. Despite these advantages, loop fusion introduces trade-offs, such as potential increases in register pressure due to the larger scope of live variables within the fused loop, which may lead to spills and degrade performance if register resources are limited. It can also heighten complexity, complicating further optimizations or , particularly in GPU environments where fused kernels amplify instruction-level pressures.

Implementation Examples

One practical implementation of loop fusion in C++ involves merging two adjacent loops that operate on the same index range to reduce overhead and improve data locality. Consider an example from FPGA optimization where separate loops update arrays a and b based on a local memory array; the fused version interleaves the operations within a single inner loop. Unfused loops:

cpp

for (int i = 0; i < 10; i++) { for (int j = 0; j < 300; j++) { a[j] = localMem[j] + 3; } for (int k = 0; k < 300; k++) { b[k] = localMem[k] + 4; } }

for (int i = 0; i < 10; i++) { for (int j = 0; j < 300; j++) { a[j] = localMem[j] + 3; } for (int k = 0; k < 300; k++) { b[k] = localMem[k] + 4; } }

Fused loop:

cpp

for (int i = 0; i < 10; i++) { for (int j = 0; j < 300; j++) { int localMemVal = localMem[j]; a[j] = localMemVal + 3; b[j] = localMemVal + 4; } }

for (int i = 0; i < 10; i++) { for (int j = 0; j < 300; j++) { int localMemVal = localMem[j]; a[j] = localMemVal + 3; b[j] = localMemVal + 4; } }

This transformation uses the (STL) indirectly through array accesses but avoids vector operations to ensure simple indexing; in practice, STL containers like std::vector can replace raw arrays for dynamic sizing while preserving the fusion structure. In , loop fusion is often applied during code generation to merge successive loops with identical counts, enhancing in vectorized environments. A representative example computes the sum and product of an array's elements in separate loops, which the Coder fuses into one for the generated C/C++ output. Unfused MATLAB code:

matlab

function [y_f_sum, y_f_prod] = SumAndProduct(Arr) %#codegen y_f_sum = 0; y_f_prod = 1; for i = 1:[length](/page/Length)(Arr) y_f_sum = y_f_sum + Arr(i); end for i = 1:[length](/page/Length)(Arr) y_f_prod = y_f_prod * Arr(i); end end

function [y_f_sum, y_f_prod] = SumAndProduct(Arr) %#codegen y_f_sum = 0; y_f_prod = 1; for i = 1:[length](/page/Length)(Arr) y_f_sum = y_f_sum + Arr(i); end for i = 1:[length](/page/Length)(Arr) y_f_prod = y_f_prod * Arr(i); end end

The fused version in generated code combines the accumulations, leveraging 's implicit parallelism for matrix operations without explicit indexing changes. This highlights how fusion reduces loop overhead while maintaining vectorization potential. To illustrate the transformation process, consider the C++ example above: bound adjustment ensures the outer loop (over i) remains unchanged, while the inner loops' identical bounds (0 to 300) allow direct merging; body interleaving then combines the statements by introducing a temporary localMemVal to avoid recomputation, preserving semantics. Verification via execution traces—such as profiling memory accesses before and after—confirms reduced loads from localMem (from two per to one) and equivalent outputs, as traces show identical array values post-fusion. Compiler support for loop fusion spans multiple languages and toolchains. In , the fusion pass operates within the loop transformation pipeline, greedily merging adjacent loops with matching trip counts after checks for dependencies and adjacency, typically enabled under optimization levels like -O2 or -O3 without a dedicated flag; this applies to C++ compilation via . For , the Coder performs automatic fusion during code generation for targets like C/C++, focusing on successive loops without user flags. In GCC, fusion is not explicitly flagged but occurs as part of broader loop optimizations under -O3, particularly for adjacent independent loops.

Advanced Topics and Comparisons

Differences and Interactions Between Fission and Fusion

Loop fission and loop fusion represent opposing yet complementary strategies in loop optimizations, each targeting distinct bottlenecks. Fission decomposes a single loop into multiple loops, increasing the overall loop count to promote , expose finer-grained parallelism, and isolate computations with divergent access patterns, thereby reducing cache pollution or enabling targeted vectorization. In opposition, fusion consolidates multiple adjacent loops into one, decreasing the loop count to cut initialization overhead, enhance data reuse across iterations, and bolster temporal locality by keeping related data in cache longer. These contrasting effects on loop structure yield inverse influences on execution: fission mitigates overhead from oversized loops prone to , while fusion amplifies efficiency in scenarios where separate loops incur redundant traffic. The techniques interact synergistically within optimization pipelines, where fission often preconditions loops for fusion by segregating incompatible statements, allowing subsequent merging of aligned portions to balance parallelism and locality. In polyhedral compilers such as , these operations integrate into iterative passes via affine scheduling, where initial fission resolves dependence conflicts, enabling fusion to optimize the remaining structure without semantic violations. This combined application appears in frameworks like the Integer Set Library (ISL), facilitating automated sequences that adapt to program semantics. Selection between fission and fusion hinges on , which profiles iteration-level data flows to guide transformation legality and efficacy. Programs with sparse dependencies—characterized by high independence between iterations—benefit from fission, as it unlocks parallel execution with minimal costs. Conversely, those exhibiting strong alignment in accesses and uniform dependence directions suit fusion, maximizing while adhering to sequential constraints enforced by tools like the Polyhedral Dependence Graph. An illustrative example in the Pluto compiler demonstrates their application: for the 1-D Jacobi stencil computation, fusion and tiling isolate parallelizable sections, yielding 4× to 7× single-core speedups over baseline code.

Applications in Parallel Computing and Modern Compilers

In parallel computing, loop fission facilitates the distribution of sub-loops across multiple threads or processes, enabling independent execution and improved load balancing in shared-memory environments like OpenMP. For instance, in stencil-based domain-specific languages, fission splits dependent computations into separate parallel regions, allowing each to be assigned to distinct threads while preserving data dependencies. This approach is particularly effective in hybrid MPI-OpenMP setups. Conversely, loop fusion reduces synchronization overhead by merging adjacent loops into a single parallel construct, minimizing barriers and communication in OpenMP directives; in atmospheric modeling codes, fusion combines multiple stencil updates, cutting synchronization points and enhancing throughput in distributed-memory systems. Modern compilers integrate loop fission and fusion to enhance vectorization and parallelism. In , the LoopVectorize pass is preceded by fusion transformations that merge compatible loops—those with identical trip counts and no adverse dependencies—to create larger nests amenable to SIMD instructions and thread-level parallelism, as demonstrated in SPEC benchmarks where fusion enabled vectorization in 14 loops (8 in SPEC CPU 2006 and 6 in SPEC CPU 2017). Fission complements this by distributing loops post-vectorization, targeting profitability through alias and dependence analysis. GCC's Graphite framework, leveraging the polyhedral model, supports both transformations for static control regions, enabling fusion and fission alongside tiling and interchange to optimize cache locality and parallelism in loop nests. These integrations form part of the optimization pipeline, with Graphite requiring ISL configuration for polyhedral analyses. Emerging trends extend these techniques to heterogeneous and AI-driven environments. In CPU-GPU systems, compilers like Hercules apply fission to split reduction loops for producer-consumer patterns on GPUs, achieving up to 9.77× speedup in edge detection benchmarks by parallelizing partial reductions, while fusion merges memory-bound loops to cut global memory traffic in image processing tasks. Performance benchmarks in high-performance computing underscore these gains. On the Fugaku supercomputer, enhanced fission in the Clang-based compiler reduced register pressure in NICAM atmospheric kernels, yielding ~20% speedup in physical and dynamic processes, contributing to exascale scalability beyond traditional LINPACK metrics. In polyhedral-optimized HPC codes, fusion and fission in Graphite-enabled GCC improved stencil computations by 15-25% in memory-bound scenarios, aligning with broader HPC trends where such transformations boost weak scaling in distributed simulations.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.