Recent from talks
Contribute something
Nothing was collected or created yet.
Loop nest optimization
View on WikipediaThis article needs additional citations for verification. (January 2021) |
In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead reduction of the loop nests. (Nested loops occur when one loop is inside of another loop.) One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.
The technique used to produce this optimization is called loop tiling,[1] also known as loop blocking[2] or strip mine and interchange.
Overview
[edit]Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused. The partitioning of loop iteration space leads to partitioning of a large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
An ordinary loop
for (i=0; i<N; ++i) {
...
}
can be blocked with a block size B by replacing it with
for (j=0; j<N; j+=B) {
for (i=j; i<min(N, j+B); ++i) {
....
}
}
where min() is a function returning the minimum of its arguments.
Example: matrix-vector multiplication
[edit]The following is an example of matrix-vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.
int i, j, a[100][100], b[100], c[100];
int n = 100;
for (i = 0; i < n; i++) {
c[i] = 0;
for (j = 0; j < n; j++) {
c[i] = c[i] + a[i][j] * b[j];
}
}
After loop tiling is applied using 2 * 2 blocks, the code looks like:
int i, j, x, y, a[100][100], b[100], c[100];
int n = 100;
for (i = 0; i < n; i += 2) {
c[i] = 0;
c[i + 1] = 0;
for (j = 0; j < n; j += 2) {
for (x = i; x < min(i + 2, n); x++) {
for (y = j; y < min(j + 2, n); y++) {
c[x] = c[x] + a[x][y] * b[y];
}
}
}
}
The original loop iteration space is n by n. The accessed chunk of array a[i, j] is also n by n. When n is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example, i = 1, j = 1 to n) may cross cache lines, causing cache misses.
Tiling size
[edit]It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests (loop interchange) also plays an important role in achieving better cache performance. Explicit blocking requires choosing a tile size based on these factors. By contrast, cache-oblivious algorithms are designed to make efficient use of cache without explicit blocking.
Example: matrix multiplication
[edit]Many large mathematical operations on computers end up spending much of their time doing matrix multiplication. The operation is:
- C = A×B
where A, B, and C are N×N arrays. Subscripts, for the following description, are in form C[row][column].
The basic loop is:
int i, j, k;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
C[i][j] = 0;
for (k = 0; k < N; ++k)
C[i][j] += A[i][k] * B[k][j];
}
}
There are three problems to solve:
- Floating point additions take some number of cycles to complete. In order to keep an adder with multiple cycle latency busy, the code must update multiple accumulators in parallel.
- Machines can typically do just one memory operation per multiply-add, so values loaded must be reused at least twice.
- Typical PC memory systems can only sustain one 8-byte doubleword per 10–30 double-precision multiply–adds, so values loaded into the cache must be reused many times.
The original loop calculates the result for one entry in the result matrix at a time. By calculating a small block of entries simultaneously, the following loop reuses each loaded value twice, so that the inner loop has four loads and four multiply–adds, thus solving problem #2. By carrying four accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time (problem #1). However, the code does not address the third problem. (Nor does it address the cleanup work necessary when N is odd. Such details will be left out of the following discussion.)
for (i = 0; i < N; i += 2)
{
for (j = 0; j < N; j += 2)
{
acc00 = acc01 = acc10 = acc11 = 0;
for (k = 0; k < N; k++)
{
acc00 += B[k][j + 0] * A[i + 0][k];
acc01 += B[k][j + 1] * A[i + 0][k];
acc10 += B[k][j + 0] * A[i + 1][k];
acc11 += B[k][j + 1] * A[i + 1][k];
}
C[i + 0][j + 0] = acc00;
C[i + 0][j + 1] = acc01;
C[i + 1][j + 0] = acc10;
C[i + 1][j + 1] = acc11;
}
}
This code has had both the i and j iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.
This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the ISA. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.
Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This register pressure is why vendors of RISC CPUs, who intended to build machines more parallel than the general purpose x86 and 68000 CPUs, adopted 32-entry floating-point register files.
The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of A is loaded, and the entire matrix B is loaded. For the entire calculation, C is stored once (that's good), A is loaded into the cache once (assuming a stripe of A fits in the cache with a stripe of B), but B is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N3/ib doubleword loads from main memory. In the code above, ib is 2.
The next step to reduce the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8 GHz Pentium 4 system used for this example, the balance number is 16.5. The second code example above cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is blocked over i. (Technically, this is actually the second time i is blocked, as the first time was the factor of 2.)
for (ii = 0; ii < N; ii += ib)
{
for (j = 0; j < N; j += 2)
{
for (i = ii; i < ii + ib; i += 2)
{
acc00 = acc01 = acc10 = acc11 = 0;
for (k = 0; k < N; k++)
{
acc00 += B[k][j + 0] * A[i + 0][k];
acc01 += B[k][j + 1] * A[i + 0][k];
acc10 += B[k][j + 0] * A[i + 1][k];
acc11 += B[k][j + 1] * A[i + 1][k];
}
C[i + 0][j + 0] = acc00;
C[i + 0][j + 1] = acc01;
C[i + 1][j + 0] = acc10;
C[i + 1][j + 1] = acc11;
}
}
}
With this code, ib can be set to any desired parameter, and the number of loads of the B matrix will be reduced by that factor. This freedom has a cost: N×ib slices of the A matrix are being kept in the cache. As long as that fits, this code will not be limited by the memory system.
So what size matrix fits? The example system, a 2.8 GHz Pentium 4, has a 16KB primary data cache. With ib=20, the slice of the A matrix in this code will be larger than the primary cache when N > 100. For problems larger than that, another trick is needed.
That trick is reducing the size of the stripe of the B matrix by blocking the k loop so that the stripe is of size ib × kb. Blocking the k loop means that the C array will be loaded and stored N/kb times, for a total of memory transfers. B is still transferred N/ib times, for transfers. So long as
- 2*N/kb + N/ib < N/balance
the machine's memory system will keep up with the floating point unit and the code will run at maximum performance. The 16KB cache of the Pentium 4 is not quite big enough: if ib=24 and kb=64 were chosen instead, 12KB of the cache would be used—avoiding completely filling it, which is desirable so the C and B arrays have to have some room to flow through. These numbers come within 20% of the peak floating-point speed of the processor.
Here is the code with loop k blocked.
for (ii = 0; ii < N; ii += ib)
{
for (kk = 0; kk < N; kk += kb)
{
for (j=0; j < N; j += 2)
{
for (i = ii; i < ii + ib; i += 2)
{
if (kk == 0)
acc00 = acc01 = acc10 = acc11 = 0;
else
{
acc00 = C[i + 0][j + 0];
acc01 = C[i + 0][j + 1];
acc10 = C[i + 1][j + 0];
acc11 = C[i + 1][j + 1];
}
for (k = kk; k < kk + kb; k++)
{
acc00 += B[k][j + 0] * A[i + 0][k];
acc01 += B[k][j + 1] * A[i + 0][k];
acc10 += B[k][j + 0] * A[i + 1][k];
acc11 += B[k][j + 1] * A[i + 1][k];
}
C[i + 0][j + 0] = acc00;
C[i + 0][j + 1] = acc01;
C[i + 1][j + 0] = acc10;
C[i + 1][j + 1] = acc11;
}
}
}
}
The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably split the kk == 0 iteration off from the rest of the kk iterations, to remove the if statement from the i loop. This is one of the values of such a compiler: while it is straightforward to code the simple cases of this optimization, keeping all the details correct as the code is replicated and transformed is an error-prone process.
The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the Pentium 4 has 256KB (or more, depending on the model) high-bandwidth level-2 cache as well as the level-1 cache. There is a choice:
- Adjust the block sizes for the level-2 cache. This will stress the processor's ability to keep many instructions in flight simultaneously, and there is a good chance it will be unable to achieve full bandwidth from the level-2 cache.
- Block the loops again, again for the level-2 cache sizes. With a total of three levels of blocking (for the register file, for the L1 cache, and the L2 cache), the code will minimize the required bandwidth at each level of the memory hierarchy. Unfortunately, the extra levels of blocking will incur still more loop overhead, which for some problem sizes on some hardware may be more time-consuming than any shortcomings in the hardware's ability to stream data from the L2 cache.
Rather than specifically tune for one particular cache size, as in the first example, a cache-oblivious algorithm is designed to take advantage of any available cache, no matter what its size is. This automatically takes advantage of two or more levels of memory hierarchy, if available. Cache-oblivious algorithms for matrix multiplication are known.
See also
[edit]References
[edit]- ^ Steven Muchnick; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation. Morgan Kaufmann. ISBN 978-1-55860-320-2.
tiling.
- ^ João M.P. Cardoso; Pedro C. Diniz (2 April 2011). Compilation Techniques for Reconfigurable Architectures. Springer Science & Business Media. ISBN 978-0-387-09671-1.
Further reading
[edit]- Wolfe, M. More Iteration Space Tiling. Supercomputing'89, pages 655–664, 1989.
- Wolf, M. E. and Lam, M. A Data Locality Optimizing Algorithm. PLDI'91, pages 30–44, 1991.
- Irigoin, F. and Triolet, R. Supernode Partitioning. POPL'88, pages 319–329, 1988.
- Xue, J. Loop Tiling for Parallelism. Kluwer Academic Publishers. 2000.
- M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991.
External links
[edit]- Streams benchmark results, showing the overall balance between floating point operations and memory operations for many different computers
- "CHiLL: Composable High-Level Loop Transformation Framework"[permanent dead link]
Loop nest optimization
View on GrokipediaIntroduction
Definition and Scope
Loop nest optimization encompasses a suite of compiler transformations applied to nested loops—typically involving two or more levels of nesting—to enhance program execution speed by improving data locality, minimizing memory access overheads, and enabling opportunities for vectorization or parallel execution.[4] These transformations target the structure and order of iterations within the nest to better align computations with hardware characteristics, such as cache hierarchies and processor pipelines.[5] The scope of loop nest optimization is confined to dense, regular computational patterns, particularly in scientific and numerical kernels like linear algebra operations, where multi-dimensional arrays dominate memory traffic.[4] It differs markedly from single-loop optimizations, such as basic unrolling or strength reduction, which operate on isolated loops without considering inter-level interactions, and excludes transformations on non-loop constructs like conditional branches or recursive calls.[6] This optimization paradigm emerged in the 1980s with the advent of vectorizing compilers designed for supercomputers, such as those targeting the Cray-1 architecture, where efficient loop restructuring was essential to exploit vector hardware.[6] Foundational to these efforts was early work on data dependence analysis, including contributions by Allen and Kennedy in the late 1970s and 1980s, which provided the theoretical basis for safely permuting and parallelizing loop nests in FORTRAN programs.[7] Utpal Banerjee's work on dependence analysis in the 1980s, including his 1988 book Dependence Analysis for Supercomputing, further laid groundwork by formalizing methods for dependence testing to enable parallel execution.[8][9] Loop nest optimization presupposes basic understanding of loop constructs and array-based data structures in imperative languages. Key terminology includes the iteration space, defined as the geometric representation of all possible loop iterations as lattice points in a multi-dimensional domain bounded by loop limits, and the schedule, a function mapping iterations from this space to a linear execution order or distributed across processors while preserving dependences.[10]Motivation and Benefits
Loop nest optimization is primarily motivated by the poor data locality inherent in nested loops, particularly in numerical and scientific computing applications, where non-sequential memory access patterns lead to cache thrashing and excessive memory bandwidth consumption.[11] Without optimization, inner loops often traverse arrays with large strides, evicting recently loaded data from limited cache before it can be reused, thereby increasing latency and reducing overall performance.[11] These techniques seek to reorder or restructure loops to enhance temporal and spatial locality, maximizing data reuse within cache levels and minimizing costly main memory accesses.[11] Contemporary CPU architectures employ a multi-level cache hierarchy to bridge the speed gap between processors and main memory, with typical L1 data caches sized at 32 KB per core for ultra-low latency access and L2 caches ranging from 512 KB to 2 MB per core (as of 2024) for slightly slower but larger storage.[12] In unoptimized loop nests, such as those iterating over multi-dimensional arrays in row-major order, the innermost loop may access elements contiguously while outer loops cause strided jumps that span beyond L1 capacity, resulting in frequent evictions and conflict misses that degrade performance.[11] For instance, a 1996 study found over 80% of intra-nest cache misses in numerical codes were conflict misses on then-contemporary hardware, highlighting the need for transformations that align access with cache line granularity and associativity.[11] The benefits of loop nest optimization manifest in substantial performance gains, including execution time reductions through improved cache hit rates and decreased memory traffic; quantitative studies show speedups of up to 2.15 times in benchmarks like ARC2D via techniques such as interchange and fusion that enhance reuse.[11] In matrix operations, better cache utilization can yield speedups of 2 to 3 times or more for large datasets by fitting working sets into L1/L2 levels, while loop fusion further reduces instruction count by merging redundant control structures, eliminating overhead from multiple loop iterations.[13] Some advanced applications, including stencil computations, report up to 10x speedups when combining loop tuning with cache-aware memory management.[14] Evaluation of these optimizations often relies on metrics like reuse distance, which measures the average number of memory references between consecutive uses of the same data item (typically analyzed in log scale to capture long-distance patterns exceeding cache sizes), and cache miss ratio, the proportion of accesses requiring fetches from lower memory levels.[11] A 1996 study of numerical loop nests found about 93% of cache hits stemmed from intra-nest reuse, with 40% spatial and 60% temporal, but long reuse distances (>1024 references) contributed to miss ratios of 1.1% to 9.9%, emphasizing how optimizations shorten these distances to boost efficiency.[11]Core Concepts
Loop Nest Representation
In loop nest optimization, the structure of nested loops is modeled using the iteration space, which represents all possible executions of the loops as points in a multi-dimensional integer lattice. Each loop level corresponds to a dimension, with iteration points defined by integer vectors satisfying affine constraints derived from loop bounds and conditions. For instance, a simple nested loop with bounds expressed as linear functions of parameters (e.g., outer loop from 1 to N and inner loop from 1 to M) forms a rectangular polytope in the integer lattice , where is the nesting depth. This polyhedral representation captures the computational domain precisely, allowing for geometric manipulations while preserving the program's semantics.[15][16] Graphical tools, such as iteration space diagrams, visualize this polytope as a convex region in the plane or higher dimensions, with individual iteration points as lattice points and data dependences illustrated as directed lines or vectors connecting them. These diagrams highlight the spatial relationships between iterations, facilitating intuitive understanding of potential optimizations like parallelism or locality improvements. For example, in a two-dimensional space, dependences appear as arrows from source to destination points, revealing constraints such as flow or anti-dependences within the polytope.[15][16] A key concept in this representation is the schedule, which maps each iteration point to a time step for execution, often via an affine function , where is a matrix and is the iteration vector. Unimodular transformations, which are integer matrices with determinant , preserve the lattice structure and the number of iteration points, enabling reversible reorderings of the loop nest without altering the iteration count. This scheduling approach ensures dependence preservation by enforcing that if a dependence vector exists from iteration to , then .[16] As an illustrative example, consider a simple two-dimensional loop nest computing a prefix sum over an array:for i = 1 to N
for j = 1 to M
A[i][j] = A[i-1][j] + B[i][j]
for i = 1 to N
for j = 1 to M
A[i][j] = A[i-1][j] + B[i][j]
Data Dependence Analysis
Data dependence analysis is a foundational step in loop nest optimization that identifies relationships between statements in a program, particularly those involving array accesses or scalar variables, to determine constraints on possible transformations. These dependences arise when multiple iterations of a loop nest access the same memory location, potentially leading to data races or incorrect results if not preserved during optimization. The analysis operates on the iteration space of the loop nest, where each point represents a specific execution instance of a statement. By classifying dependences, compilers can ensure that optimizations like reordering or parallelization maintain program semantics. Dependences are categorized into three primary types based on the order of read and write operations to the same memory location: flow (or true) dependences, anti-dependences, and output dependences. A flow dependence occurs when one statement writes to a location that a subsequent statement reads, representing the actual flow of data values between computations. An anti-dependence exists when a statement reads a location that a later statement overwrites, which does not carry data but imposes ordering to avoid using stale values. An output dependence arises when multiple statements write to the same location, requiring serialization to prevent overwriting intermediate results. These classifications stem from early work on program representations that explicitly model such relationships to guide optimizations.[17] Dependences are further distinguished by their scope relative to loop iterations: lexical (or loop-independent) dependences occur between statements in the same iteration or sequentially without crossing loop boundaries, while loop-carried dependences span different iterations, often parameterized by distance vectors that indicate the offset between dependent iterations. For instance, in a nested loop, a loop-carried flow dependence might exist if an inner-loop write in iteration (i,j) affects a read in iteration (i+1,j). Lexical dependences are typically easier to handle as they do not constrain loop-level parallelism, whereas loop-carried ones are critical for determining optimization legality. This distinction allows compilers to focus on preserving only the essential ordering imposed by loop-carried flow dependences.[18] To detect these dependences, compilers employ various testing algorithms tailored to the uniformity of array subscripts. For uniform dependences, where subscript expressions are linear with constant coefficients, the greatest common divisor (GCD) test determines potential existence by checking if the GCD of the coefficients divides the constant difference between references. Specifically, for two references f(i) and g(i) in a single loop, a dependence exists if there are integers p and q such that f(p) = g(q) and the GCD condition holds, indicating overlapping accesses across iterations. This test is efficient but conservative, as it ignores loop bounds and may report false positives. The GCD test forms the basis for more advanced analyses in parallelizing compilers.[18] For non-uniform dependences, involving symbolic coefficients or bounds-dependent expressions, the Banerjee test extends the GCD approach by incorporating loop limits and solving linear inequalities to check for feasible integer solutions within the iteration space. The test evaluates whether there exist iteration points satisfying the dependence equations under the given bounds, using techniques like Fourier-Motzkin elimination for multi-dimensional cases. It provides direction vectors, such as (=, >) indicating no change in the outer loop but forward in the inner, to classify the dependence level. While more precise than GCD, the Banerjee test remains approximate for complex cases and is often combined with other methods for exactness. This framework, developed for supercomputing applications, enables scalable analysis in optimizing compilers.[19] Dependences uncovered by these tests are modeled using a dependence graph, where nodes represent individual statements in the loop nest, and directed edges denote dependences labeled with distance or direction vectors. For example, an edge from statement S1 to S2 with vector (1,0) signifies a flow dependence carried by the outer loop, advancing one iteration while staying constant in the inner loop. The graph captures the partial order of execution required for correctness, with loop-carried edges highlighting constraints across iterations. Such graphs facilitate visualization and algorithmic manipulation during optimization passes.[17] The primary criterion for legal optimizations is that transformations must preserve the order of all flow dependences to avoid altering data flow, while anti- and output dependences can often be eliminated through techniques like register promotion if they do not affect semantics. Loop-carried flow dependences, in particular, define the serialization requirements; reversing their direction could change computation results. Compilers use the dependence graph to verify that proposed reorderings respect this transitive closure, ensuring equivalence to the original program. This preservation is essential for safe application of optimizations in production compilers.[18]Optimization Techniques
Loop Interchange and Reversal
Loop interchange is a loop nest transformation that swaps the order of execution between two adjacent loops to enhance data locality or facilitate subsequent optimizations such as parallelism extraction, while preserving the semantics of the program.[20] This reordering is particularly useful when the original loop order leads to non-sequential memory accesses, such as traversing columns of a row-major array. The transformation is represented by a unimodular matrix that permutes the loop dimensions, ensuring the iteration space remains integer and the volume is preserved. The legality of loop interchange depends on data dependence analysis, where dependence vectors must satisfy specific conditions to avoid reversing execution order. Specifically, for swapping adjacent loops at levels k and k+1, the transformation is legal if all dependence vectors are such that their components make the permuted vectors lexicographically positive, often checked by verifying that dependences carried by the inner loop do not conflict with the swap—equivalently, dependence vectors are perpendicular to the interchange direction, meaning the distance in the swapped dimension is zero for relevant dependences. Algorithms for determining this involve constructing the dependence cone from the set of all dependence vectors and testing if the unimodular swap matrix maps the cone into the positive orthant; if so, the interchange is valid. A simple pseudocode for applying the transformation to a two-level nest, assuming legality is confirmed, is as follows:Original:
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
// computation involving A[i][j]
}
}
Transformed (interchanged):
for (int j = 0; j < M; j++) {
for (int i = 0; i < N; i++) {
// computation involving A[i][j]
}
}
Original:
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
// computation involving A[i][j]
}
}
Transformed (interchanged):
for (int j = 0; j < M; j++) {
for (int i = 0; i < N; i++) {
// computation involving A[i][j]
}
}
for i = 1 to N to for i = N downto 1. The transformation affects dependence directions: it maps true (flow) dependences to anti-dependences and vice versa by negating the relevant components of dependence vectors.[21]
Legality requires that the reversed dependence vectors remain lexicographically nonnegative to preserve program order, meaning no true dependences become reversed in a way that violates semantics.[21] For loops with anti-dependences, reversal can convert them to true dependences, which may be easier to handle in certain contexts, but it is illegal if the loop carries true dependences in the reversed direction. Algorithms test this by applying the reversal matrix (a diagonal with -1 in the reversed dimension) to all dependence vectors and verifying positivity. A pseudocode example for reversal is:
Original:
for (int i = 0; i < N; i++) {
// computation
}
Transformed (reversed):
for (int i = N-1; i >= 0; i--) {
// computation
}
Original:
for (int i = 0; i < N; i++) {
// computation
}
Transformed (reversed):
for (int i = N-1; i >= 0; i--) {
// computation
}
Loop Tiling
Loop tiling, also known as loop blocking or strip-mining with blocking, is a compiler transformation that partitions the iteration space of a loop nest into smaller, fixed-size blocks or tiles to enhance data locality. By ensuring that the data elements accessed within a single tile fit into a target level of the memory hierarchy, such as the cache, tiling reduces cache misses and improves overall program performance, particularly for compute-intensive applications like linear algebra routines.[23] The tiling process for rectangular loop nests typically begins with determining suitable block sizes and then restructuring the loops to iterate over these blocks. For a one-dimensional loop such asfor (i = 1; i <= N; i++) { body }, rectangular blocking with tile size B introduces an outer loop over block indices and an inner loop covering the iterations within each block: for (bi = 1; bi <= ceil(N/B); bi++) for (i = (bi-1)*B + 1; i <= min(bi*B, N); i++) { body }. This transformation groups consecutive iterations, promoting spatial locality for sequential array accesses and temporal locality for reused data within the tile. For multi-dimensional nests, the process extends to each dimension, creating a higher-dimensional blocking structure that aligns with the memory layout.[23]
Selecting the tile size is a critical step influenced by hardware parameters like cache capacity, associativity, and line size—often 64 bytes for modern processors—as well as software factors such as array element size and problem dimensions. Poor choices can lead to excessive cache evictions or underutilization of cache space. A widely used heuristic for square tiles in two-dimensional nests, such as matrix operations, sets the block size B to approximately the square root of the usable cache size divided by the element size; for instance, with a 256 KB L1 data cache and 4-byte integers, B ≈ √(256 × 1024 / 4) ≈ 128, balancing intra-tile reuse against inter-tile overhead. More sophisticated models, including cache miss equations or empirical search, refine this for specific architectures.[24][25]
For non-rectangular or imperfectly nested loops, where iteration spaces are bounded by affine inequalities rather than simple rectangular ranges, standard rectangular tiling may violate dependencies or leave gaps. The hyperplane method addresses this by leveraging affine schedules in the polyhedral model to define tiling directions as hyperplanes that partition the iteration polytope while preserving legality. These schedules compute a sequence of unimodular transformations to generate tiles of arbitrary shapes, ensuring complete coverage and overlap-free execution, often integrated into tools like PoCC or CLooG for code generation.
Although tiling introduces additional loop levels, which incur minor runtime overhead from extra control flow and potential branch instructions at tile boundaries, this cost is generally negligible compared to the locality gains—typically less than 1-2% in execution time for dense codes. Mitigation strategies include unrolling the innermost tile loops to reduce branch frequency and amortize setup costs.[26]
Loop Fusion and Fission
Loop fusion is a compiler optimization technique that merges multiple consecutive loops into a single loop, promoting temporal data reuse by allowing computations to access shared data without intermediate storage. This transformation reduces loop overhead and enhances cache locality, particularly when the loops operate on overlapping data sets. For instance, fusing separate loops over the same index range can eliminate redundant memory loads and stores, improving execution time by up to 34% in sequential codes on certain architectures. Legality of fusion requires no cycles in the dependence graph between the statements of the loops being merged, ensuring that data dependences are preserved without introducing ordering violations.[27] A key application of loop fusion arises in producer-consumer patterns, such as stencil computations where one loop produces intermediate arrays that another consumes. By fusing these loops, the compiler can eliminate temporary array allocations, reducing memory traffic and enabling better reuse of producer outputs directly in consumer iterations. Algorithms for fusion often employ greedy partitioning on a dependence graph to maximize parallelism while respecting fusion-preventing edges, with polynomial-time heuristics for data locality improvements based on reuse vectors that quantify temporal and spatial reuse distances. Profitability analysis typically weighs benefits like reduced memory accesses against costs, such as increased computational complexity in the fused body.[28][27][6] Loop fission, the inverse transformation, splits a single loop nest into multiple independent loops over the same index range, facilitating subsequent optimizations like parallelization or vectorization by isolating compatible statements. This is particularly useful when a loop mixes scalar reductions with array operations, as fission can separate them to expose parallelism without carry dependencies. To handle privatization issues, such as loop-carried scalar dependences, compilers apply scalar expansion, converting scalars into per-iteration arrays or private variables to remove false dependences. For example, in a reduction loop, fission allows the accumulation to be privatized across threads, enabling safe parallel execution.[27][6] Despite its benefits, loop fusion introduces trade-offs, notably increased register pressure from combining operations that compete for limited registers, potentially leading to spilling and higher memory traffic. Aggressive fusion may also exacerbate cache interference if unrelated data accesses are interleaved, though constrained variants limit fusion to innermost loops to mitigate this. Experimental studies show that while fusion yields energy savings of 2-29% in embedded systems by reducing memory operations, judicious application is needed to balance these pressures against reuse gains.[29]Illustrative Examples
Matrix-Vector Multiplication Optimization
The baseline implementation of matrix-vector multiplication, expressed as a loop nest, is given by:for i = 1 to N
for j = 1 to N
y[i] += A[i][j] * x[j]
for i = 1 to N
for j = 1 to N
y[i] += A[i][j] * x[j]
for j = 1 to N
for i = 1 to N
y[i] += A[i][j] * x[j]
for j = 1 to N
for i = 1 to N
y[i] += A[i][j] * x[j]
for jj = 1 to N step B
for ii = 1 to N step B
for j = jj to min(jj + B - 1, N)
for i = ii to min(ii + B - 1, N)
y[i] += A[i][j] * x[j]
for jj = 1 to N step B
for ii = 1 to N step B
for j = jj to min(jj + B - 1, N)
for i = ii to min(ii + B - 1, N)
y[i] += A[i][j] * x[j]
Matrix Multiplication Optimization
Matrix multiplication serves as a canonical example of a compute-intensive kernel with a three-level loop nest, where optimizations like interchange and multi-level tiling dramatically improve cache utilization and performance. The baseline implementation follows the ijk loop order, iterating over rows of C (i), columns of C (j), and the contraction dimension (k):for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
for (i = 0; i < N; i++) {
for (k = 0; k < N; k++) {
for (j = 0; j < N; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
for (i = 0; i < N; i++) {
for (k = 0; k < N; k++) {
for (j = 0; j < N; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
for (ii = 0; ii < N; ii += B) {
for (kk = 0; kk < N; kk += B) {
for (i = ii; i < min(ii + B, N); i++) {
for (k = kk; k < min(kk + B, N); k++) {
// Pack A[i][k] panel if needed
for (jj = 0; jj < N; jj += b) {
for (j = jj; j < min(jj + b, N); j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
}
}
for (ii = 0; ii < N; ii += B) {
for (kk = 0; kk < N; kk += B) {
for (i = ii; i < min(ii + B, N); i++) {
for (k = kk; k < min(kk + B, N); k++) {
// Pack A[i][k] panel if needed
for (jj = 0; jj < N; jj += b) {
for (j = jj; j < min(jj + b, N); j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
}
}
Advanced Considerations
Polyhedral Model Applications
The polyhedral model provides a mathematical framework for representing and optimizing loop nests in compilers, treating iteration spaces as convex polyhedra defined by systems of affine inequalities. In this representation, the iteration domain of a loop nest is modeled as the set of integer points within a polyhedron, where loop bounds and conditions are expressed as linear constraints of the form , with denoting the iteration vector. This affine structure captures the regular access patterns common in array-based computations, enabling systematic analysis and transformation of nested loops.[34] Scattering functions, which are affine mappings from the iteration space to a schedule space, determine the execution order of statement instances by assigning timestamps or coordinates to each point in the polyhedron, thus prescribing the code generation order while respecting data dependences.[35] Applications of the polyhedral model in loop nest optimization center on automated transformations such as tiling and fusion, often formulated as integer linear programming (ILP) problems to find optimal schedules. For tiling, the model partitions the iteration space into smaller blocks to improve data locality, solving for tile sizes and shapes that minimize memory access overhead subject to dependence constraints; fusion, conversely, combines adjacent loops to reduce intermediate data storage, with ILP ensuring legal merging without violating data flow. These optimizations are particularly effective for affine loop nests, where the model's algebraic tools allow enumerating transformation sequences efficiently. Tools like Polly, integrated into the LLVM compiler infrastructure, apply these techniques by detecting scops (static control parts) in code, performing polyhedral analyses, and generating optimized IR through scheduling and tiling passes. Similarly, CLooG serves as a code generator that translates polyhedral schedules back into executable loop code, supporting unions of polyhedra for complex nests.[36][37][35] A representative example of schedule computation involves an affine loop nest for stencil computation, such as updating a 2D array where each element depends on its neighbors. The iteration domain is a polyhedron defined by bounds on indices and , with dependences modeled as affine relations between points. An ILP formulation minimizes execution latency by finding a unimodular transformation matrix that reorders iterations, subject to constraints ensuring no dependence cycles (e.g., for a dependence from statement to ), resulting in a tiled schedule that balances parallelism and locality. This approach can yield significant speedups on multicore systems for benchmarks like Jacobi iteration, with reported improvements of up to 4x in some cases.[38][39] The polyhedral model originated in the 1990s with foundational work by Paul Feautrier on affine scheduling problems, which introduced multidimensional time and dependence polyhedra to handle complex loop transformations systematically. Subsequent developments extended it to code generation and verification, as in formal proofs of semantic preservation for polyhedral compilers. In modern contexts, the model has evolved to support GPU code generation, where tools like PPCG apply scattering and tiling to produce CUDA kernels from affine nests, optimizing for thread block layouts and memory coalescing to achieve near-peak performance on irregular workloads.[39][34][40]Parallelism and Vectorization
Loop nest optimization often extends to parallelism by distributing iterations of outer loops across multiple threads or processors, particularly after transformations like loop fission that eliminate data dependencies. In OpenMP, the#pragma omp parallel for directive can be applied to the outermost loop of a dependence-free nest, such as the row index in a matrix multiplication kernel (e.g., for(int i = 0; i < N; i++) for C = A * B), enabling workload distribution to achieve near-linear speedup on multi-core systems.[41] This approach leverages the fork-join model, where threads execute independent iterations concurrently, but requires prior fission to separate scalar reductions or dependent computations into independent parallelizable sections.[42]
Vectorization targets inner loops for single instruction, multiple data (SIMD) execution, aligning memory accesses and computation to exploit wide vector registers like those in AVX-512, which support 512-bit operations for up to 16 single-precision floats per instruction. Compilers apply techniques such as loop peeling—executing a few iterations separately to handle non-unit strides or alignment issues—ensuring the remaining loop body processes contiguous data in vector-friendly patterns, as seen in inner product loops of matrix operations.[43] For instance, in a tiled matrix multiplication, the innermost loop over column indices can be vectorized after ensuring unit-stride access, yielding significant performance gains on AVX-512 hardware depending on data types.[44]
Implementing parallelism in tiled loop nests introduces challenges like load balancing, where uneven tile sizes or iteration counts across threads lead to idle processors and suboptimal speedup. False sharing exacerbates this in shared-cache environments, occurring when threads update distinct variables mapped to the same cache line, triggering unnecessary coherence traffic and serialization.[45][46]
Auto-parallelizers, such as Intel's oneAPI DPC++/C++ Compiler, automate these processes by analyzing dependence graphs and inserting OpenMP directives or SIMD intrinsics, often guided by flags like -parallel for loop distribution and -xCORE-AVX512 for vectorization.[47] To predict theoretical limits, the work-depth model evaluates parallelism potential: total work (T_1, sequential time) divided by depth (T_∞, critical path length) bounds speedup as min(P, T_1 / T_∞), where P is processor count, helping assess if a nest's granularity supports efficient scaling.[48]