Hubbry Logo
Systolic arraySystolic arrayMain
Open search
Systolic array
Community hub
Systolic array
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Systolic array
Systolic array
from Wikipedia

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbours, stores the result within itself and passes it downstream. Systolic arrays were first used in Colossus, which was an early computer used to break German Lorenz ciphers during World War II.[1] Due to the classified nature of Colossus, they were independently invented or rediscovered by H. T. Kung and Charles Leiserson who described arrays for many dense linear algebra computations (matrix product, solving systems of linear equations, LU decomposition, etc.) for banded matrices. Early applications include computing greatest common divisors of integers and polynomials.[2] Nowadays, they can be found in NPUs and hardware accelerators based on spatial designs. They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD, as discussed later in this article.

The parallel input data flows through a network of hard-wired processor nodes, which combine, process, merge or sort the input data into a derived result. Because the wave-like propagation of data through a systolic array resembles the pulse of the human circulatory system, the name systolic was coined from medical terminology. The name is derived from systole as an analogy to the regular pumping of blood by the heart.

Applications

[edit]

Systolic arrays are often hard-wired for specific operations, such as multiply and accumulate, to perform massively parallel integration, convolution, correlation, matrix multiplication or data sorting tasks. They are also used for dynamic programming algorithms, used in DNA and protein sequence analysis.

Architecture

[edit]

A systolic array typically consists of a large monolithic network of primitive computing nodes which can be hardwired or software configured for a specific application. The nodes are usually fixed and identical, while the interconnect is programmable. The more general wavefront processors, by contrast, employ sophisticated and individually programmable nodes which may or may not be monolithic, depending on the array size and design parameters. The other distinction is that systolic arrays rely on synchronous data transfers, while wavefront tend to work asynchronously.

Unlike the more common Von Neumann architecture, where program execution follows a script of instructions stored in common memory, addressed and sequenced under the control of the CPU's program counter (PC), the individual nodes within a systolic array are triggered by the arrival of new data and always process the data in exactly the same way. The actual processing within each node may be hard wired or block micro coded, in which case the common node personality can be block programmable.

The systolic array paradigm with data-streams driven by data counters, is the counterpart of the Von Neumann architecture with instruction-stream driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.

Goals and benefits

[edit]

A major benefit of systolic arrays is that all operand data and partial results are stored within (passing through) the processor array. There is no need to access external buses, main memory or internal caches during each operation as is the case with Von Neumann or Harvard sequential machines. The sequential limits on parallel performance dictated by Amdahl's Law also do not apply in the same way, because data dependencies are implicitly handled by the programmable node interconnect and there are no sequential steps in managing the highly parallel data flow.

Systolic arrays are therefore extremely good at artificial intelligence, image processing, pattern recognition, computer vision and other tasks that animal brains do particularly well. Wavefront processors in general can also be very good at machine learning by implementing self configuring neural nets in hardware.

Classification controversy

[edit]

While systolic arrays are officially classified as MISD, their classification is somewhat problematic. Because the input is typically a vector of independent values, the systolic array is definitely not SISD. Since these input values are merged and combined into the result(s) and do not maintain their independence as they would in a SIMD vector processing unit, the array cannot be classified as such. Consequently, the array cannot be classified as a MIMD either, because MIMD can be viewed as a mere collection of smaller SISD and SIMD machines.

Finally, because the data swarm is transformed as it passes through the array from node to node, the multiple nodes are not operating on the same data, which makes the MISD classification a misnomer. The other reason why a systolic array should not qualify as a MISD is the same as the one which disqualifies it from the SISD category: The input data is typically a vector not a single data value, although one could argue that any given input vector is a single item of data.

In spite of all of the above, systolic arrays are often offered as a classic example of MISD architecture in textbooks on parallel computing and in engineering classes. If the array is viewed from the outside as atomic it should perhaps be classified as SFMuDMeR = single function, multiple data, merged result(s).

Systolic arrays use a pre-defined computational flow graph that connects their nodes. Kahn process networks use a similar flow graph, but are distinguished by the nodes working in lock-step in the systolic array: in a Kahn network, there are FIFO queues between each node.

Detailed description

[edit]

A systolic array is composed of matrix-like rows of data processing units called cells. Data processing units (DPUs) are similar to central processing units (CPUs), (except for the usual lack of a program counter,[3] since operation is transport-triggered, i.e., by the arrival of a data object). Each cell shares the information with its neighbors immediately after processing. The systolic array is often rectangular where data flows across the array between neighbour DPUs, often with different data flowing in different directions. The data streams entering and leaving the ports of the array are generated by auto-sequencing memory units, ASMs. Each ASM includes a data counter. In embedded systems a data stream may also be input from and/or output to an external source.

Examples of 2x2 Matrix Multiplication in Systolic Array
Systolic array algorithm accumulating output values inside DPUs.
Systolic array algorithm pre-loading and keeping one operand stationary inside DPUs while computing. In the example, the green matrix is pre-loaded in the array and can be reused for subsequent multiplications.

An example of a systolic algorithm might be designed for matrix multiplication. One matrix is fed in a row at a time from the top of the array and is passed down the array, the other matrix is fed in a column at a time from the left hand side of the array and passes from left to right. Dummy values are then passed in until each processor has seen one whole row and one whole column. At this point, the result of the multiplication is stored in the array and can now be output a row or a column at a time, flowing down or across the array.[4]

Systolic arrays are arrays of DPUs which are connected to a small number of nearest neighbour DPUs in a mesh-like topology. DPUs perform a sequence of operations on data that flows between them. Because the traditional systolic array synthesis methods have been practiced by algebraic algorithms, only uniform arrays with only linear pipes can be obtained, so that the architectures are the same in all DPUs. The consequence is, that only applications with regular data dependencies can be implemented on classical systolic arrays. Like SIMD machines, clocked systolic arrays compute in "lock-step" with each processor undertaking alternate compute | communicate phases. But systolic arrays with asynchronous handshake between DPUs are called wavefront arrays. One well-known systolic array is Carnegie Mellon University's iWarp processor, which has been manufactured by Intel. An iWarp system has a linear array processor connected by data buses going in both directions.

History

[edit]

Systolic arrays (also known as wavefront processors), were first described by H. T. Kung and Charles E. Leiserson, who published the first paper describing systolic arrays in 1979. However, the first machine known to have used a similar technique was the Colossus Mark II in 1944.

Examples

[edit]

Polynomial evaluation

[edit]

Horner's rule for evaluating a polynomial is:

A linear systolic array in which the processors are arranged in pairs: one multiplies its input by and passes the result to the right, the next adds and passes the result to the right.

Convolution

[edit]

Consider a chain of processing elements (PEs), each performing a multiply-accumulate operation. It processes input data () and weights () systolically, meaning data flows through the array in a regular, rhythmic manner. The weights remain stationary within each PE, while the input data and partial sums () move in opposite directions.

Each PE performs the following operation:where:

  • is the input data.
  • is the incoming partial sum.
  • is the weight stored in the PE.
  • is the output data (passed to the next PE).
  • is the updated partial sum.

From the left, the input stream is , and from the right, the output stream is . If enter the rightmost PE simultaneously, then the leftmost PE outputsThis is the 1-dimensional convolution. Similarly, n-dimensional convolution can be computed by an n-dimensional array of PEs.

Many other implementations of the 1D convolutions are available, with different data flows.[5]

See [5] Figure 12 for an algorithm that performs on-the-fly least-squares using one- and two-dimensional systolic arrays.

Sorting

[edit]

Bubble sort is also an example of 1D systolic computation,[6] although it applies N-1 passes for an array of size N. Each pass systolically moves the maximum element of a subsequence towards its final location in the sorted result.

If one is willing to use N/2 processing elements (PE) each with a comparator and two registers, elements arranged in a stack-like fashion, an array (or stream) of size N can thus be sorted in 2N time by pushing its elements in while on every level of the systolic stack the maximum of the pair of elements stored in each PE is pushed further down. And after all the elements are pushed in, the process is reversed with the minimum element in each PE being popped out (or "pushed up"), resulting in the stream of elements coming out sorted in ascending order.[7]

Sorting input arrays of larger size (N > P) than the number of processing elements (P) is somewhat complex to do efficiently with such a system, but can be realized (by adding an external serial processor) in O(N log N/log P) time. The serial processor needs to manage a "bucket B-tree", where each node in the B-tree has P "buckets" that are eventually each sorted in O(P) time using the PEs.[8]

Implementations

[edit]
  • Inmos Transputer[9]
  • Cisco PXF network processor is internally organized as systolic array.[10]
  • Google’s TPU is also designed around a systolic array.
  • Paracel FDF4T TestFinder text search system[11]
  • Paracel FDF4G GeneMatcher Biological (DNA and Protein) search system
  • Inferentia chip at Amazon Web Services[12]
  • Gemmini systolic array-based accelerator developed at UC Berkeley[13]

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A systolic array is a homogeneous network of tightly coupled processing elements that rhythmically produce and pass data through the system in a pipelined manner, enabling efficient parallel computation with minimal memory access, much like the synchronous pumping of blood in the human circulatory system. The architecture features simple, regular communication paths between identical processors, supporting high throughput for data-intensive operations such as matrix multiplication and linear algebra transformations. Introduced in 1978 by H. T. Kung and Charles E. Leiserson at Carnegie Mellon University, systolic arrays were conceived to leverage the scaling of very-large-scale integration (VLSI) technology, addressing the growing demand for specialized hardware in compute-bound applications by emphasizing modularity, scalability, and reduced data movement overhead. Historically, the design evolved from earlier parallel array processors like the , shifting focus toward pipelined structures that minimize global and exploit spatial locality to achieve predictable . Key principles include synchronous operation via a global clock, repeatability of processing elements for ease of fabrication, and data flow that enters from the boundaries, propagates inward for computation, and exits without buffering, which optimizes for repetitive algorithms in and beyond. Early implementations targeted problems like dense matrix computations and , demonstrating potential for very high computational density in VLSI chips. In modern computing, systolic arrays have seen resurgence in application-specific accelerators, particularly for deep neural networks (DNNs) and convolutional neural networks (CNNs), where they excel in handling the matrix-heavy workloads of inference and training. Google's (TPU), introduced in 2016, exemplifies this with a 256×256 systolic array of multiply-accumulate units that performs 92 tera-operations per second at low power, achieving up to 83 times better performance-per-watt than contemporary CPUs for neural network tasks by reusing data across adjacent elements without frequent memory fetches. Other notable applications include for image and audio analysis, cryptography algorithms, and computer vision tasks like , underscoring the architecture's enduring efficiency in domains requiring high parallelism and predictable latency. Despite advantages in throughput and energy efficiency, challenges such as limited flexibility for non-matrix operations and high design complexity persist, driving ongoing research into hybrid and scalable variants.

Fundamentals

Definition and Principles

A systolic array is a homogeneous network of tightly coupled processing elements (PEs), forming a parallel computing architecture in which data flows in a pipelined, rhythmic manner from external memories through the array, while computations occur synchronously in a pulse-like fashion. The term "systolic" derives from the physiological concept of , referring to the rhythmic contraction of the heart that pumps blood through the body, mirroring how data is rhythmically computed and passed through the network of processors. This design enables efficient handling of regular, data-intensive algorithms, such as matrix operations, by emphasizing concurrent data movement and computation without reliance on complex control mechanisms. At its core, each PE in a systolic array performs simple, localized operations—typically multiply-accumulate (MAC) functions—on received from adjacent neighbors, ensuring that processing remains decentralized. There is no global directing operations; instead, the architecture depends on nearest-neighbor communications for exchange, with only the boundary PEs interfacing with external streams to inject into the array and extract results. This pipelined flow allows multiple computations to proceed simultaneously across the array, akin to a propagating through the structure, promoting high throughput for repetitive numerical tasks. Key characteristics of systolic arrays include their regularity, manifested in the uniform arrangement and identical functionality of PEs, which simplifies and fabrication; modularity, enabling the array to be scaled by adding more PEs to meet varying performance needs; and local interconnects, which restrict communication to short, nearest-neighbor links to reduce latency and wiring . These features collectively support a highly synchronized, pipelined environment optimized for VLSI implementation. For illustration, consider a simple one-dimensional systolic array: a linear chain of identical PEs connected sequentially, ideal for vector operations such as or dot products. In this setup, one (e.g., a vector) enters from the left boundary PE, while another (e.g., filter coefficients) enters from the top of each PE; each PE multiplies the incoming values, adds to an accumulated partial sum passed from the previous PE, and forwards the results to the next, with the final output emerging from the right end after a delay equal to the length.

Historical Development

The concept of the systolic array was proposed in the late by and Charles Leiserson at , drawing inspiration from the rhythmic pumping action of blood flow in the —termed "systole" by physiologists—and the emerging trends in very large-scale integration (VLSI) technology that favored modular, pipelined designs. This approach addressed the growing limitations of von Neumann architectures, which suffered from memory-computation bottlenecks in handling parallel, compute-intensive tasks such as and numerical computations. A foundational milestone came in 1978 with Kung and Leiserson's , "Systolic Arrays for (VLSI)," which outlined the as a network of processors that rhythmically compute and pass data to exploit VLSI's potential for massive parallelism while minimizing global communication. Throughout the , theoretical models evolved into practical designs, with Kung's group at Carnegie Mellon pioneering hardware prototypes; a key example was the Warp machine, a linear programmable systolic array with a prototype completed in 1985 and full 10-cell systems in 1986, capable of 10 MFLOPS per cell and targeted at signal and image processing applications. These developments integrated systolic principles with VLSI paradigms, enabling efficient implementations for (DSP) tasks like and matrix operations. By the 1990s, systolic arrays shifted from custom application-specific integrated circuits () to more flexible programmable systems, exemplified by the iWarp project—a collaboration between Carnegie Mellon and —that deployed its first 64-cell system in 1990, supporting both systolic communication and shared-memory models for scalable numerical computations. This evolution maintained an initial emphasis on DSP and linear algebra problems, laying groundwork for broader adoption in while highlighting the architecture's adaptability to advancing semiconductor technologies.

Architectural Design

Core Components

A systolic array is fundamentally composed of processing elements (PEs), which serve as the basic computational units. Each PE is an identical cell equipped with an (ALU) for performing operations such as multiplication and accumulation, registers for temporary , and local control logic to manage autonomous operation. These elements operate in a rhythmic, pipelined manner, receiving inputs from neighbors, computing locally, and passing results onward without requiring global coordination. The interconnects form the structural backbone, consisting of fixed, nearest-neighbor wiring patterns that facilitate data transfer between adjacent PEs. Common topologies include linear arrays for sequential operations, 2D mesh configurations for matrix computations, and hexagonal arrangements for more complex data flows, all avoiding global buses to minimize latency and wiring complexity. This regular, local interconnection ensures modularity and ease of fabrication in VLSI implementations. Boundary conditions define the interfaces at the array's edges, where ports connect to external or host systems for loading and result extraction. Unlike traditional architectures, systolic arrays lack an internal , relying instead on these boundary ports to inject initial streams and retrieve outputs, with unused boundary outputs often ignored to simplify design. is achieved by tiling multiple PEs into 1D, 2D, or even 3D configurations, allowing the array size to match computational demands. For instance, a 2D array of PEs can efficiently handle by aligning flows along rows and columns, enabling pipelined processing of larger problems through repeated boundary injections. This modular extension supports high throughput without redesigning the core PE logic.

Data Flow and Synchronization

In systolic arrays, data flows rhythmically through the network of processing elements (PEs), with inputs typically entering from the array boundaries and propagating unidirectionally or bidirectionally between adjacent PEs in a pipelined manner. This systolic rhythm mimics a heartbeat, where data pulses through the array without requiring global , enabling efficient reuse as operands interact locally to perform computations like inner products. Outputs generally exit from the opposite boundaries, and partial pipelining—achieved through registers within PEs—minimizes idle cycles or bubbles by staging data transfers to overlap computation and movement. Synchronization in systolic arrays is achieved via a global clock that drives simultaneous activations across all PEs, ensuring that data transfers and computations occur in without the need for complex due to the fixed, interconnect paths. Registers, often referred to as shadow or latching registers in implementations, temporarily hold data during transfers between PEs, preventing race conditions and maintaining timing integrity as the clock edge triggers both computation and shifting. This clock-driven approach simplifies control, as each PE follows a predefined sequence of operations synchronized to the array's pulsations, avoiding the overhead of asynchronous handshaking. Common flow patterns include forward flows for input operands entering from one side and backward flows for partial results propagating in the reverse direction, as seen in linear arrays for matrix-vector multiplication where vectors move oppositely to align multiplications. In two-dimensional arrays, skewing techniques stagger input data—such as delaying each row by one clock cycle—to create diagonal wavefronts that align computations across the array, ensuring that dependent operations arrive precisely when needed for pipelined execution. Data dependencies in iterative algorithms are handled through local buffering within PEs, where small registers store intermediate results to resolve hazards without stalling the global flow, preserving the regularity of the dependence graph while enabling wavefront propagation. This local resolution leverages the array's modular structure, allowing dependencies to be mapped spatially so that data arrives just in time, thus maintaining high throughput without global intervention.

Goals and Benefits

Performance Objectives

Systolic arrays are designed to maximize throughput by leveraging pipelined data flows that enable a constant rate of computation across the array, particularly for linear algebra operations such as . This approach targets the use of O(n²) processors to achieve O(n) for problems like solving dense systems of n linear equations, where data is rhythmically pumped through the network to overlap input, computation, and output phases. By ensuring a steady stream of data via simple, regular communication paths, systolic arrays sustain high data rates without bottlenecks from global broadcasts or reductions. Latency reduction forms another core objective, achieved through local interconnects that minimize communication overhead between neighboring processing elements. The goal is constant-time execution per operation across the array, as data moves only to adjacent cells in a or hexagonal , avoiding long-distance wires that dominate delay in conventional designs. For instance, in matrix-vector multiplication, this pipelining reduces overall latency from O(wn) in sequential implementations to O(2n + w), where w is the size. Scalability targets emphasize linear speedup proportional to array size for tasks, such as banded matrix operations on fixed-size arrays that handle inputs of arbitrary length. Additionally, is incorporated via modular replacement strategies, where redundant processing elements and bypass links allow reconfiguration to isolate and substitute faulty units, thereby maintaining performance and extending in large-scale implementations. Techniques like virtual arrays with 25-100% redundancy achieve linear improvements in reliability metrics, independent of array dimensions. Resource efficiency objectives prioritize optimization under VLSI area and time constraints, favoring uniform processor layouts and local connections to facilitate dense, low-cost fabrication. Regular grid structures minimize wiring complexity and support modular expansion, aligning with the systolic principle of simplicity for high-volume production. This design enables efficient silicon utilization, where the number of processors scales with computational demands without proportional increases in interconnection overhead.

Advantages Over Conventional Architectures

Systolic arrays address key limitations of conventional von Neumann architectures, which suffer from a central bottleneck where must shuttle between a and a single processor, limiting scalability for compute-intensive tasks. By distributing both storage and computation across a network of processing elements (PEs) connected via local links, systolic arrays enable multiple computations per access, with flowing rhythmically through the system to minimize global communication overhead. This distributed approach supports massive parallelism without relying on shared buses, allowing for efficient handling of data-parallel algorithms like matrix operations. The simplicity and regularity of systolic array designs offer significant advantages over irregular multiple-instruction multiple-data (MIMD) architectures, which often require complex control logic and irregular interconnects that complicate implementation. Systolic arrays employ a grid of identical PEs with local, nearest-neighbor connections, facilitating straightforward very-large-scale integration (VLSI) fabrication and reducing design complexity. This regularity also leads to lower power consumption, as data movement is confined to short, predictable paths, avoiding the energy overhead of long-distance wiring in conventional designs. Performance in systolic arrays is highly predictable due to their fixed data paths and deterministic synchronization, contrasting with cache-dependent systems where execution times vary based on unpredictable memory hits and misses. The rhythmic data flow ensures near-100% utilization of PEs for well-suited algorithms, as computations proceed in lockstep without idle cycles from resource contention. Systolic arrays achieve cost-effectiveness through their modular structure, which allows scaling by simply adding more PEs without redesigning the core architecture, making per-operation costs low for high-volume numerical processing. This scalability aligns system cost proportionally with required performance, enabling efficient deployment in specialized hardware for tasks like signal processing.

Operational Mechanics

Processing Cycles

In a systolic array, each clock cycle follows a rhythmic where processing elements (PEs) synchronously receive input from neighboring PEs, perform a local —typically a multiply-accumulate operation such as updating a partial sum with CC+A×BC \leftarrow C + A \times B—temporarily store the results in local registers, and then forward the to adjacent PEs for the next cycle. This structure ensures that flows without global communication, mimicking a heartbeat-like pulsation across the array. In two-dimensional systolic arrays, such as those used for , the computations advance via wavefront propagation, where active processing spreads diagonally from the top-left corner toward the bottom-right, progressively activating diagonals of PEs as input operands align. This diagonal wavefront enables efficient overlap of data movement and computation, with each cycle advancing the frontier by one position along the anti-diagonals. The operational phases include pipeline filling, steady-state execution, and draining: initial cycles load input data into the array, often injecting zeros or boundary values to initialize partial results; steady-state follows, where the array reaches full utilization with every PE contributing to throughput proportional to the input bandwidth; and final cycles output completed results as the exits the array boundaries. For an N×NN \times N array, filling and draining typically require on the order of 2N2N cycles, bounding the transient overhead. For algorithms with iterative loops, such as successive approximation methods, the systolic array handles multiple iterations by pipelining computations across cycles, allowing new iterations to enter as prior ones drain, or through data recirculation where outputs are fed back as inputs for subsequent passes. This approach maintains rhythmic data flow without halting the array, enabling continuous processing of looped workloads.

Mathematical Foundations

The fundamental operation within a processing element (PE) of a systolic array is the multiply-accumulate (MAC), expressed as cc+abc \leftarrow c + a \cdot b where aa and bb are scalar inputs received from neighboring PEs or boundaries, and cc represents the partial result passed onward after the computation. This recursive form enables pipelined accumulation across the array, with each PE performing one MAC per cycle while data flows rhythmically through local interconnects. Systolic arrays are synthesized by mapping algorithms modeled as dependence graphs onto hardware topologies via systolic projection. A dependence graph consists of nodes representing index-domain computations and directed edges denoting dependencies with associated delays. The mapping selects a scheduling vector sT\mathbf{s}^T to assign execution times to nodes and a projection vector pT\mathbf{p}^T to map multi-dimensional index spaces to the array's lower-dimensional geometry, ensuring no two dependent computations are scheduled simultaneously on the same processor. For instance, in matrix-vector c=Ab\mathbf{c} = A \mathbf{b} where AA is an n×mn \times m matrix and b\mathbf{b} is an mm-vector, the dependence graph captures partial sums ci(k+1)=ci(k)+aikbkc_i^{(k+1)} = c_i^{(k)} + a_{i k} b_k for i=1i = 1 to nn and k=1k = 1 to mm. A linear projection, such as pT=[10]\mathbf{p}^T = [1 \, 0], maps this to a 1D array of nn PEs, with dependencies resolved by propagating bkb_k and partial cic_i along the chain. The validity of the mapping requires sTd1\mathbf{s}^T \mathbf{d} \geq 1 for each dependency vector d\mathbf{d}, minimizing latency while respecting the systolic constraint of no global communication. Timing in systolic arrays is governed by node scheduling functions that dictate wavefront propagation. A standard scheduling function for 2D arrays is ψ(i,j)=i+j\psi(i,j) = i + j, which assigns time steps along diagonals, enabling simultaneous execution of independent computations in a that advances one step per cycle. This creates rhythmic data movement, where inputs propagate without stalling, provided the processor index is derived as (i,j) \cdot \mathbf{p} \mod array bounds. To synchronize arrivals, data skewing introduces initial delays in input vectors; for example, in the matrix-vector case, elements of b\mathbf{b} are skewed by offsets proportional to row indices, ensuring aika_{i k} and bkb_k meet at PE ii exactly when needed. The skewing vector is computed as t=sTe\mathbf{t} = \mathbf{s}^T \mathbf{e} for edge vectors e\mathbf{e}, aligning flows to the . Complexity analysis reveals tradeoffs between processors and time under systolic constraints. For matrix-vector on an n×mn \times m configured as a 1D linear systolic of size min(n,m)\min(n,m), the is T=n+m1T = n + m - 1 cycles: n1n - 1 cycles to fill the with initial partial sums, mm cycles for accumulation (or vice versa if m<nm < n), minus 1 to account for overlapping output. This derives from the wavefront model, where input propagation takes max(n,m)\max(n,m) steps, but pipelining overlaps computation and data movement, yielding total latency T=n+m1T = n + m - 1. The processor-time product is pTnmp \cdot T \approx n m, matching the O(nm)O(n m) arithmetic operations, but systolic designs optimize pp to O(min(n,m))O(\min(n,m)) while bounding T=O(n+m)T = O(n + m), outperforming sequential O(nm)O(n m) time with p=1p = 1. For 2D extensions like matrix , scaling to p=O(n2)p = O(n^2) yields T=O(n)T = O(n), with the constant factor (e.g., 3 for square matrices) from the schedule's dot product sTp|\mathbf{s}^T \mathbf{p}|. These tradeoffs are formalized by minimizing T=max(sTI)T = \max(\mathbf{s}^T \mathbf{I}) over index domain I\mathbf{I}, subject to processor count p=PIp = |\mathbf{P} \mathbf{I}| where P\mathbf{P} is the projection matrix.

Examples

Polynomial Multiplication

Systolic arrays provide an efficient mapping for multiplying two polynomials of degree nn, denoted as A(x)=i=0naixiA(x) = \sum_{i=0}^n a_i x^i and B(x)=i=0nbixiB(x) = \sum_{i=0}^n b_i x^i, to compute the product C(x)=A(x)B(x)=k=02nckxkC(x) = A(x) B(x) = \sum_{k=0}^{2n} c_k x^k. This is achieved using a linear consisting of n+1n+1 processing elements (PEs) arranged in a one-dimensional chain. Each PE performs local multiply-accumulate operations on incoming coefficient streams, leveraging the rhythmic data flow inherent to systolic designs to compute the convolution of the coefficient sequences without global communication. The mapping exploits the structure of polynomial multiplication as a convolution, where the coefficients of A(x)A(x) are input from one end of the array (e.g., the left), flowing rightward, while the coefficients of B(x)B(x) enter from the opposite end (e.g., the right), flowing leftward. Each PE, indexed from 0 to nn, receives one coefficient from each polynomial at appropriate timings, multiplies them to form a partial product, and accumulates it into a local register that tracks contributions shifted by the PE's position in the array. This ensures that partial products for each degree kk of the result are systematically gathered across the PEs, with boundary conditions handled by preloading zeros for leading and trailing coefficients beyond degree nn (e.g., ai=0a_i = 0 for i>ni > n and similarly for bib_i). The design maintains constant bandwidth, as each PE communicates only with its immediate neighbors. Execution proceeds in a pipelined manner over 2n+12n+1 cycles, with the first c0c_0 emerging after n+1n+1 cycles and subsequent coefficients output one per cycle. In each cycle after the initial latency, data propagation aligns the inputs such that the next result emerges from a designated output , typically the rightmost PE or a dedicated accumulator . The core for each follows the : ck=i=0kaibkic_k = \sum_{i=0}^k a_i b_{k-i} for k=0k = 0 to 2n2n, with the systolic scheduling ensuring that all terms in the sum for a given kk are accumulated without conflicts. Boundary conditions are managed by injecting zero values at the array ends after the initial coefficients, preventing extraneous contributions and ensuring exact degree-2n2n results. The architecture can be visualized as a linear chain of n+1n+1 PEs, with forward data paths for the AA coefficients (left to right) and backward data paths for the BB coefficients (right to left), connected via nearest-neighbor links. Each PE includes input buffers, a multiplier, an accumulator, and output ports, with control logic to synchronize the flows and handle zero padding for boundaries. This setup exemplifies the systolic principle of localized computation and rhythmic pulsing, enabling high throughput for operations in VLSI implementations.

Convolution Operations

Systolic arrays provide an efficient architecture for performing convolution operations, particularly in signal and image processing tasks involving (FIR) filters. The core algorithm computes the output sequence y=k=0Mhx[nk]y = \sum_{k=0}^{M} h \cdot x[n - k], where xx represents the input signal and hh is the fixed kernel of length M+1M+1. This computation is mapped onto a linear systolic array comprising M+1M+1 processing elements (PEs), each responsible for a specific kernel coefficient. In this configuration, the input signal xx flows unidirectionally through the array, while the kernel coefficients hh propagate in the opposite direction to align with the sliding window. Each PE executes a multiply-accumulate operation, multiplying the arriving xx and hh values and adding the product to an incoming partial sum, which is then forwarded to the adjacent PE. This pipelined data flow ensures that the dependencies of the convolution are resolved locally within the array. Once the is filled, typically after an initial latency of O(M)O(M) cycles, the array generates one output sample yy per clock cycle, achieving constant-time processing per sample independent of kernel size. This design is optimized for FIR filters but extends to (IIR) variants by incorporating feedback paths in the PE structure to handle recursive dependencies. For two-dimensional , common in filtering, the architecture scales to a 2D of PEs, leveraging separable kernels to decompose the operation into two sequential one-dimensional passes. In the first stage, horizontal linear arrays perform row-wise convolutions across the , producing intermediate feature maps. These intermediates then undergo column-wise convolutions in a vertical pass using the same or reconfigured , effectively computing the full 2D output y[i,j]=p=0Mq=0Mh[p,q]x[ip,jq]y[i,j] = \sum_{p=0}^{M} \sum_{q=0}^{M} h[p,q] \cdot x[i-p, j-q] for kernel size (M+1)×(M+1)(M+1) \times (M+1). This separable approach on a 2D mesh maintains high utilization, with data streaming through rows and columns in a systolic manner to minimize global communication. Efficiency mirrors the 1D case, yielding constant time per output pixel after pipeline fill, while the mesh parallelism scales throughput with the array dimensions.

Sorting Algorithms

Systolic arrays implement parallel comparison-based sorting through mapping algorithms like Batcher's odd-even or the onto a two-dimensional mesh structure. These algorithms construct sorting networks composed of compare-exchange operations, which are ideally suited to the regular, pipelined data flow of systolic architectures, enabling simultaneous across multiple paths. In the mapping, input data elements enter the array boundaries and propagate through an arrangement of processing elements (PEs) organized into sequential stages that mirror the layers of the . Each PE functions as a compare-exchange unit, receiving two data values from neighboring PEs or inputs, comparing them, and swapping if the values are out of order to direct smaller elements toward one output direction (e.g., upward or leftward) while larger ones move oppositely. This localized operation ensures no global communication is needed, maintaining the systolic principle of rhythmic data pulsing without buffering delays. occurs via the array's clock, with data advancing one step per cycle across the . The execution proceeds in a series of logarithmic stages, where the odd-even merge sort recursively divides the input into halves, sorts them, and merges via alternating odd and even indexed comparisons. For an n × n sorting up to n² elements, the process requires log₂ n merging levels, with each level comprising up to n/2 parallel compare-exchange operations per row or column; the overall sorting completes in O((log n)²) cycles, achieving high throughput as multiple data streams can through the network. The variant follows a similar staged structure but builds bitonic sequences (increasing then decreasing) before merging, offering comparable complexity and adaptability to the . Variants of these implementations employ truncated sorting networks for tasks like finding or selection, where only a partial set of stages is activated to isolate the k-th smallest element without completing the full sort. This reduces computational overhead, as the network up to the position suffices, leveraging the same compare-exchange PEs but with fewer cycles and potentially reconfigurable connections in the systolic mesh.

Applications

Traditional Signal Processing

Systolic arrays found early and prominent applications in traditional (DSP), particularly for tasks requiring high-throughput real-time computation that exceeded the capabilities of general-purpose CPUs at the time. These architectures were especially suited to operations like real-time filtering, where linear systolic arrays could efficiently implement (FIR) filters by pipelining data through processing elements, each performing multiply-accumulate operations on incoming streams. This design allowed for sustained high sample rates, addressing the limitations of von Neumann architectures that suffered from memory-computation bottlenecks in handling continuous data flows from sensors or audio inputs. In audio processing, linear systolic arrays were employed to realize FIR filters, enabling low-latency equalization and in real-time systems. For instance, a linear array of processing cells could compute filtered outputs every few clock cycles, supporting applications in where voice signals demanded precise temporal processing without delays. Similarly, two-dimensional systolic arrays extended this capability to image enhancement in early video systems, performing convolutions across pixel grids to sharpen or denoise frames, which was critical for broadcast and technologies in the 1980s. These configurations leveraged the rhythmic data flow inherent to systolic designs, ensuring modular scalability for varying filter lengths or sizes. Fast Fourier Transform (FFT) computation also benefited from systolic arrays, with linear or hexagonal configurations mapping the Cooley-Tukey to achieve pipelined, real-time spectral analysis. Such implementations were vital for frequency-domain processing in communications, where systolic structures reduced latency compared to sequential FFT methods, enabling efficient modulation and at high data rates. In adaptive for and systems, systolic arrays facilitated the computation of optimal weights for multiple sidelobe cancellers, enhancing spatial filtering to suppress interference and focus on target signals in environments. This was particularly impactful in 1980s defense applications, where real-time adaptability was essential for detecting submerged threats or airborne objects. The adoption of systolic arrays surged in the 1980s within and sectors, driven by the need for specialized hardware to handle intensive DSP workloads. For example, systolic chips were developed for tasks, using algorithms mapped to linear arrays for connected word processing in systems. Projects like wafer-scale systolic integrations supported radar in defense initiatives, while telecom firms explored them for high-speed channel equalization. These early realizations underscored systolic arrays' role in overcoming CPU limitations for sample rates exceeding millions per second, paving the way for dedicated DSP silicon.

Modern Machine Learning Accelerators

Systolic arrays play a central role in modern accelerators by efficiently handling the matrix multiplications and general matrix multiply (GEMM) operations that dominate deep and . These architectures enable high-throughput computation with minimal data movement, addressing the bottlenecks in conventional processors. For instance, Google's Tensor Processing Units (TPUs), introduced in , leverage 2D systolic arrays to perform tensor operations at scale in datacenters. The first-generation TPU (TPUv1) features a single 256×256 systolic array of multiply-accumulate (MAC) units optimized for 8-bit integer operations, delivering up to 92 while consuming 40 , primarily for workloads. Subsequent versions expanded this design: TPUv2 and TPUv3 incorporate multiple 128×128 systolic arrays per chip (equivalent in compute capacity to 256×256), supporting bfloat16 precision for and achieving higher energy efficiency through pipelined . TPUv4 (2021) builds on this with four 128×128 arrays per chip, adding support for int8 and structured sparsity in embeddings to reduce compute for pruned models, while maintaining compatibility with low-precision formats like FP16 and INT8 for broader acceleration. Later generations continued advancements: TPUv5 (2023) nearly doubled v4 performance with enhanced systolic arrays; TPUv6e (, GA May 2025) offers improved efficiency for large-scale ; and TPUv7 (, announced April 2025, GA Q4 2025) provides over 4x the performance of Trillium, emphasizing with advanced systolic designs for generative AI. These optimizations enable TPUs to process large-scale models, such as those in Google's search and translation services, with high utilization on dense GEMM tasks. Beyond Google, other vendors have integrated systolic-like units into their AI accelerators. Apple's Neural Engine, debuting in the A11 Bionic chip in 2017 and evolving through M-series processors, employs units based on systolic arrays to accelerate on-device for tasks like image recognition and . Performance has scaled significantly, from 15.8 on A15 (2021) to 38 on M4 (2024), with the M5 series (October 2025) featuring an enhanced 16-core Neural Engine for even greater efficiency in mobile AI deployment. Similarly, Huawei's Ascend series uses DaVinci cores with Cube units—systolic array-based matrix multiply accelerators—that support vector and scalar processing for both and , with recent models like Ascend 910C (2024/2025) emphasizing scalability in datacenter environments and upcoming Ascend 950 (2026) promising further petaflop-scale performance. In the 2020s, systolic arrays have seen renewed adoption in FPGA and ASIC designs for edge AI, where reconfigurable fabrics like those in Versal AI Edge Series Gen 2 (2023) implement sparse systolic slices to handle resource-constrained on devices such as IoT sensors, offering flexibility over fixed . Recent advancements focus on hybrid systolic designs that combine arrays with other accelerator elements, such as vector units or sparse cores, to enhance versatility for diverse ML workloads. For example, hybrid systolic architectures integrate reconfigurable dataflows to support both dense and sparse operations, reducing by up to 8% in edge inference while boosting throughput by 57% for models like transformers. These hybrids prioritize energy efficiency for mobile and datacenter applications, enabling systolic arrays to achieve 2.3× energy savings over prior accelerators in low-power scenarios, thus supporting sustainable scaling of AI deployment.

Implementations and Debates

Hardware Realizations

Early prototypes of systolic arrays emerged in the mid-1980s, with Carnegie Mellon University's Warp project delivering a notable example. A of the Warp system, featuring a 2-cell linear systolic array, was completed in June 1985, with the full 10-cell version completed in early 1986; each cell was capable of 10 MFLOPS, serving as a programmable for high-performance signal and image processing tasks. This prototype emphasized systolic data flow to achieve efficient pipelined computations, demonstrating feasibility for attached processors in research environments. Building on Warp, the iWARP project in the 1990s advanced scalable multicomputer architectures with systolic principles. Developed collaboratively by Carnegie Mellon and , iWARP integrated a single-chip component combining a Lisp processor, , and communication support into a mesh-connected system; the first 64-cell production systems were delivered in 1990, enabling high-speed for scientific applications. These systems supported systolic communication patterns across nodes, scaling to hundreds of cells while maintaining low-latency data rhythmic flow. Commercial realizations in the 1980s and 1990s included -based systems and specialized . , developed by INMOS, were adapted into systolic arrays for parallel processing, such as in multilayered implementations where one- and two-dimensional transputer configurations achieved efficient data pipelining for model reduction and feedback control tasks. Additionally, systolic like those for image processing emerged, with designs leveraging orthogonal interconnects to handle convolutions and correlations in dedicated chips. Intel's iWARP, commercialized from the academic prototype, represented a key ASIC evolution, packaging systolic elements into scalable, off-the-shelf multicomputers. In modern hardware, systolic arrays have seen widespread adoption in AI accelerators, exemplified by Google's (TPU) series since 2015. The first TPU featured a 256×256 systolic array of 8-bit multiply-accumulate units, optimized for matrix multiplications in neural networks, delivering up to 92 tera operations per second while minimizing data movement. Subsequent iterations, such as TPU v3 with dual 128×128 arrays (2018), and later versions up to the seventh-generation TPU v7 (Ironwood, announced November 2025), have further scaled systolic array designs for cloud-scale and advanced AI workloads, integrating systolic computation with host interfaces for tensor operations. Recent has explored scalable systolic arrays for sparse neural networks and energy-efficient edge devices, with prototypes demonstrating up to 68% power savings through optimized dataflows (as of 2024). For , open-source FPGA implementations on platforms have proliferated, such as parametric systolic array generators for and hardened designs with error detection, enabling customizable prototypes for DNN acceleration. Despite advances, hardware realizations face challenges in fabrication and . Large systolic arrays suffer from reduced yields due to increased defect probabilities in dense interconnects and processing elements, necessitating fault-tolerant designs like to maintain reliability. Integration with GPUs or CPUs introduces overheads from data transfer latencies and mismatched paradigms, often requiring hybrid architectures to balance systolic efficiency with general-purpose flexibility.

Classification Controversies

The classification of systolic arrays has sparked ongoing debates within parallel computing literature, primarily revolving around strict versus loose interpretations of the core definition. In its original formulation, a systolic array is defined as a homogeneous network of processors that rhythmically compute and pass data through local interconnections, with no global control lines and synchronous operation to enable pipelined throughput without additional logic overhead. This strict view, emphasizing pure systolic data flow where data "pumps" through the array like blood through the heart, excludes designs incorporating hybrid control mechanisms, such as centralized scheduling or external intervention, which later extensions have introduced to enhance flexibility. Controversies arise over the inclusion of reconfigurable arrays, which allow dynamic reconfiguration of processing elements (PEs) during operation, potentially violating the rhythmic, fixed-flow principle of pure systolic designs. Similarly, arrays employing global clocks for synchronization face criticism, as clock skew and distribution challenges in large-scale implementations can disrupt the intended local, self-timed rhythm, leading some researchers to question their systolic purity. Disputes extend to emerging contexts like neuromorphic computing, where systolic-inspired structures simulate neural dynamics but incorporate non-local feedback loops, blurring the boundaries of traditional definitions. In quantum computing, analogous array-based approaches for quantum circuit simulation have been proposed, yet they often hybridize with classical control, prompting debates on whether such adaptations retain systolic essence. Key arguments trace back to and literature, where papers questioned the fit of non-homogeneous PEs—varying processor functions within the —arguing that they compromise the and regularity central to systolic efficiency. For instance, controversially places systolic arrays under MISD (), but this is widely critiqued as problematic since data modification at each node aligns more closely with SIMD-like behavior, rendering MISD "nonsensical" in refined taxonomies. Modern perspectives, such as those on Google's Tensor Processing Units (TPUs), describe them as "systolic-inspired" rather than purely systolic due to integrated host interfaces and hybrid data management that deviate from unadulterated local flow. These debates have significant implications for in , as ambiguous boundaries complicate comparisons across architectures and hinder standardized . The issues remain unresolved amid evolving hardware paradigms, with no consensus on delineating pure systolic designs from their extensions.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.