Hubbry Logo
Instruction pipeliningInstruction pipeliningMain
Open search
Instruction pipelining
Community hub
Instruction pipelining
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Instruction pipelining
Instruction pipelining
from Wikipedia

Basic five-stage pipeline
Clock cycle
Instr. No.
1 2 3 4 5 6 7
1 IF ID EX MEM WB
2 IF ID EX MEM WB
3 IF ID EX MEM WB
4 IF ID EX MEM
5 IF ID EX
(IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back).

In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.

In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps (the eponymous "pipeline") performed by different processor units with different parts of instructions processed in parallel.

Concept and motivation

[edit]

In a pipelined computer, instructions travel through the central processing unit (CPU) in stages. For example, it might have one stage for each step of the von Neumann cycle: Fetch the instruction, fetch the operands, do the instruction, write the results. A pipelined computer usually has "pipeline registers" after each stage. These store information from the instruction and calculations so that the logic gates of the next stage can do the next step.

This arrangement lets the CPU complete an instruction on each clock cycle. It is common for even-numbered stages to operate on one edge of the square-wave clock, while odd-numbered stages operate on the other edge. This allows more CPU throughput than a multicycle computer at a given clock rate, but may increase latency due to the added overhead of the pipelining process itself. Also, even though the electronic logic has a fixed maximum speed, a pipelined computer can be made faster or slower by varying the number of stages in the pipeline. With more stages, each stage does less work, and so the stage has fewer delays from the logic gates and could run at a higher clock rate.

A pipelined model of computer is often the most economical, when cost is measured as logic gates per instruction per second. At each instant, an instruction is in only one pipeline stage, and on average, a pipeline stage is less costly than a multicycle computer. Also, when made well, most of the pipelined computer's logic is in use most of the time. In contrast, out of order computers usually have large amounts of idle logic at any given instant. Similar calculations usually show that a pipelined computer uses less energy per instruction.

However, a pipelined computer is usually more complex and more costly than a comparable multicycle computer. It typically has more logic gates, registers and a more complex control unit. In a like way, it might use more total energy, while using less energy per instruction. Out of order CPUs can usually do more instructions per second because they can do several instructions at once.

In a pipelined computer, the control unit arranges for the flow to start, continue, and stop as a program commands. The instruction data is usually passed in pipeline registers from one stage to the next, with a somewhat separated piece of control logic for each stage. The control unit also assures that the instruction in each stage does not harm the operation of instructions in other stages. For example, if two stages must use the same piece of data, the control logic assures that the uses are done in the correct sequence.

When operating efficiently, a pipelined computer will have an instruction in each stage. It is then working on all of those instructions at the same time. It can finish about one instruction for each cycle of its clock. But when a program switches to a different sequence of instructions, the pipeline sometimes must discard the data in process and restart. This is called a "stall."

Much of the design of a pipelined computer prevents interference between the stages and reduces stalls.

Number of steps

[edit]

The number of dependent steps varies with the machine architecture. For example:

As the pipeline is made "deeper" (with a greater number of dependent steps), a given step can be implemented with simpler circuitry, which may let the processor clock run faster.[3] Such pipelines may be called superpipelines.[4]

A processor is said to be fully pipelined if it can fetch an instruction on every cycle. Thus, if some instructions or conditions require delays that inhibit fetching new instructions, the processor is not fully pipelined.

History

[edit]

Seminal uses of pipelining were in the ILLIAC II project and the IBM Stretch project, though a simple version was used earlier in the Z1 in 1939 and the Z3 in 1941.[5]

Pipelining began in earnest in the late 1970s in supercomputers such as vector processors and array processors.[citation needed] One of the early supercomputers was the Cyber series built by Control Data Corporation. Its main architect, Seymour Cray, later headed Cray Research. Cray developed the XMP line of supercomputers, using pipelining for both multiply and add/subtract functions. Later, Star Technologies added parallelism (several pipelined functions working in parallel), developed by Roger Chen. In 1984, Star Technologies added the pipelined divide circuit developed by James Bradley.

Pipelining was not limited to supercomputers. In 1976, the Amdahl Corporation's 470 series general purpose mainframe had a 7-step pipeline, and a patented branch prediction circuit.[citation needed] A central element in RISC design philosophy,[6] by the mid-1980s, pipelining was also being introduced into traditional CISC architectures by their developers.[7]

Hazards

[edit]

The model of sequential execution assumes that each instruction completes before the next one begins; this assumption is not true on a pipelined processor. A situation where the expected result is problematic is known as a hazard. Imagine the following two register instructions to a hypothetical processor:

1: add 1 to R5
2: copy R5 to R6

If the processor has the 5 steps listed in the initial illustration (the 'Basic five-stage pipeline' at the start of the article), instruction 1 would be fetched at time t1 and its execution would be complete at t5. Instruction 2 would be fetched at t2 and would be complete at t6. The first instruction might deposit the incremented number into R5 as its fifth step (register write back) at t5. But the second instruction might get the number from R5 (to copy to R6) in its second step (instruction decode and register fetch) at time t3. It seems that the first instruction would not have incremented the value by then. The above code invokes a hazard.

Writing computer programs in a compiled language might not raise these concerns, as the compiler could be designed to generate machine code that avoids hazards.

Workarounds

[edit]

In some early DSP and RISC processors, the documentation advises programmers to avoid such dependencies in adjacent and nearly adjacent instructions (called delay slots), or declares that the second instruction uses an old value rather than the desired value (in the example above, the processor might counter-intuitively copy the unincremented value), or declares that the value it uses is undefined. The programmer may have unrelated work that the processor can do in the meantime; or, to ensure correct results, the programmer may insert NOPs into the code, partly negating the advantages of pipelining.

Solutions

[edit]

Pipelined processors commonly use three techniques to work as expected when the programmer assumes that each instruction completes before the next one begins:

  • The pipeline could stall, or cease scheduling new instructions until the required values are available. This results in empty slots in the pipeline, or bubbles, in which no work is performed.
  • An additional data path can be added that routes a computed value to a future instruction elsewhere in the pipeline before the instruction that produced it has been fully retired, a process called operand forwarding.[8][9]
  • The processor can locate other instructions which are not dependent on the current ones and which can be immediately executed without hazards, an optimization known as out-of-order execution.

Branches

[edit]

A branch out of the normal instruction sequence often involves a hazard. Unless the processor can give effect to the branch in a single time cycle, the pipeline will continue fetching instructions sequentially. Such instructions cannot be allowed to take effect because the programmer has diverted control to another part of the program.

A conditional branch is even more problematic. The processor may or may not branch, depending on a calculation that has not yet occurred. Various processors may stall, may attempt branch prediction, and may be able to begin to execute two different program sequences (eager execution), each assuming the branch is or is not taken, discarding all work that pertains to the incorrect guess.[a]

A processor with an implementation of branch prediction that usually makes correct predictions can minimize the performance penalty from branching. However, if branches are predicted poorly, it may create more work for the processor, such as flushing from the pipeline the incorrect code path that has begun execution before resuming execution at the correct location.

Programs written for a pipelined processor deliberately avoid branching to minimize possible loss of speed. For example, the programmer can handle the usual case with sequential execution and branch only on detecting unusual cases. Using programs such as gcov to analyze code coverage lets the programmer measure how often particular branches are actually executed and gain insight with which to optimize the code. In some cases, a programmer can handle both the usual case and unusual case with branch-free code.

Special situations

[edit]
Self-modifying programs
The technique of self-modifying code can be problematic on a pipelined processor. In this technique, one of the effects of a program is to modify its own upcoming instructions. If the processor has an instruction cache, the original instruction may already have been copied into a prefetch input queue and the modification will not take effect. Some processors such as the Zilog Z280 can configure their on-chip cache memories for data-only fetches, or as part of their ordinary memory address space, and avoid such difficulties with self-modifying instructions.
Uninterruptible instructions
An instruction may be uninterruptible to ensure its atomicity, such as when it swaps two items. A sequential processor permits interrupts between instructions, but a pipelining processor overlaps instructions, so executing an uninterruptible instruction renders portions of ordinary instructions uninterruptible too. The Cyrix coma bug would hang a single-core system using an infinite loop in which an uninterruptible instruction was always in the pipeline.

Design considerations

[edit]
Speed
Pipelining keeps all portions of the processor occupied and increases the amount of useful work the processor can do in a given time. Pipelining typically reduces the processor's cycle time and increases the throughput of instructions. The speed advantage is diminished to the extent that execution encounters hazards that require execution to slow below its ideal rate. A non-pipelined processor executes only a single instruction at a time. The start of the next instruction is delayed not based on hazards but unconditionally.
A pipelined processor's need to organize all its work into modular steps may require the duplication of registers, which increases the latency of some instructions.
Economy
By making each dependent step simpler, pipelining can enable complex operations more economically than adding complex circuitry, such as for numerical calculations. However, a processor that declines to pursue increased speed with pipelining may be simpler and cheaper to manufacture.
Predictability
Compared to environments where the programmer needs to avoid or work around hazards, use of a non-pipelined processor may make it easier to program and to train programmers. The non-pipelined processor also makes it easier to predict the exact timing of a given sequence of instructions.

Illustrated example

[edit]

To the right is a generic pipeline with four stages: fetch, decode, execute and write-back. The top gray box is the list of instructions waiting to be executed, the bottom gray box is the list of instructions that have had their execution completed, and the middle white box is the pipeline.

The execution is as follows:

Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
Clock Execution
0
  • Four instructions are waiting to be executed
1
  • The green instruction is fetched from memory
2
  • The green instruction is decoded
  • The purple instruction is fetched from memory
3
  • The green instruction is executed (actual operation is performed)
  • The purple instruction is decoded
  • The blue instruction is fetched
4
  • The green instruction's results are written back to the register file or memory
  • The purple instruction is executed
  • The blue instruction is decoded
  • The red instruction is fetched
5
  • The execution of green instruction is completed
  • The purple instruction is written back
  • The blue instruction is executed
  • The red instruction is decoded
6
  • The execution of purple instruction is completed
  • The blue instruction is written back
  • The red instruction is executed
7
  • The execution of blue instruction is completed
  • The red instruction is written back
8
  • The execution of red instruction is completed
9
  • The execution of all four instructions is completed

Pipeline bubble

[edit]
A bubble in cycle 3 delays execution.

A pipelined processor may deal with hazards by stalling and creating a bubble in the pipeline, resulting in one or more cycles in which nothing useful happens.

In the illustration at right, in cycle 3, the processor cannot decode the purple instruction, perhaps because the processor determines that decoding depends on results produced by the execution of the green instruction. The green instruction can proceed to the Execute stage and then to the Write-back stage as scheduled, but the purple instruction is stalled for one cycle at the Fetch stage. The blue instruction, which was due to be fetched during cycle 3, is stalled for one cycle, as is the red instruction after it.

Because of the bubble (the blue ovals in the illustration), the processor's Decode circuitry is idle during cycle 3. Its Execute circuitry is idle during cycle 4 and its Write-back circuitry is idle during cycle 5.

When the bubble moves out of the pipeline (at cycle 6), normal execution resumes. But everything now is one cycle late. It will take 8 cycles (cycle 1 through 8) rather than 7 to completely execute the four instructions shown in colors.[b]

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Instruction pipelining is a fundamental technique in that enhances processor performance by dividing the execution of an instruction into sequential stages—typically including fetch, decode, execute, access, and write-back—allowing multiple instructions to overlap in execution as each stage processes a different instruction simultaneously, akin to an . This approach aims to achieve an ideal (CPI) of 1, improving throughput without necessarily increasing clock speed, though it requires uniform instruction formats for optimal efficiency, as seen in reduced instruction set computing (RISC) designs. Originating in the late 1950s, pipelining was first implemented in general-purpose computers like the , which introduced overlapped execution to meet ambitious performance goals, marking a shift from single-cycle processors. The technique gained prominence in the 1980s with RISC architectures, such as the MIPS R3000 featuring a five-stage , which refined compiler-scheduled operations to minimize stalls from data dependencies, control hazards like branches, and structural conflicts. Key challenges include pipeline hazards that cause stalls or flushes, addressed through strategies like branch prediction (achieving up to 90% accuracy in dynamic schemes), instruction reordering, and delayed branching. Advancements evolved into deeper pipelines (superpipelining) for higher frequencies, multiple parallel pipelines (superscalar execution), and out-of-order processing with reservation stations, enabling modern CPUs to sustain beyond simple overlap. Despite these benefits, deeper pipelines increase latency for individual instructions and complicate , requiring precise mechanisms to maintain program correctness. Overall, instruction pipelining remains a cornerstone of , underpinning the efficiency of processors from embedded systems to supercomputers.

Fundamentals

Concept and Motivation

Instruction pipelining is a technique for implementing within a processor by decomposing the execution of each instruction into a series of sequential stages, such as fetch, decode, execute, and write-back, which can operate concurrently on different instructions. This approach allows overlapping of instruction processing, where the hardware resources dedicated to each stage are utilized more efficiently by handling portions of multiple instructions simultaneously. The concept draws an analogy to an industrial assembly line, where specialized units perform distinct tasks on successive items in an overlapped manner, preventing idle time and increasing overall production rate without requiring duplicate equipment for each item. In a processor, this means that while one instruction completes its execute stage, the next may be in decode, and another in fetch, thereby sustaining continuous operation across the pipeline stages. The primary motivation for instruction pipelining is to enhance processor throughput, enabling the completion of approximately one instruction per clock cycle in an ideal steady-state scenario, which corresponds to an (IPC) of about 1. Without pipelining, the time to execute an instruction is the sum of all stage times T=tiT = \sum t_i; with pipelining, the effective time per instruction approaches the duration of the longest stage tmaxt_{\max}, yielding a throughput improvement factor of up to nn for nn balanced stages. This can be quantified as the of non-pipelined execution time to pipelined execution time for a program, or S=CPInon×cyclesnonCPIpip×cyclespipS = \frac{\text{CPI}_{\text{non}} \times \text{cycles}_{\text{non}}}{\text{CPI}_{\text{pip}} \times \text{cycles}_{\text{pip}}}, where CPIpip1\text{CPI}_{\text{pip}} \approx 1 in steady state and cycle times adjust for stage division.

Pipeline Stages

The classic five-stage pipeline in (RISC) architectures divides instruction execution into sequential phases to enable overlapping operations and improve throughput. These stages are Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB), with each stage handling a specific portion of the instruction lifecycle while instructions advance synchronously through the pipeline. In the IF stage, the processor retrieves the instruction from instruction memory using the to generate the . The ID stage decodes the fetched instruction to interpret the and operands, reading the necessary values from the register file while generating control signals for subsequent stages. During the EX stage, the (ALU) performs computations such as addition, subtraction, or address calculation based on the decoded operation. The MEM stage facilitates data memory operations, allowing load instructions to read from memory or store instructions to write to it. Finally, the WB stage writes the computed results—either from the ALU or memory—back to the register file for use by future instructions. Pipeline designs vary in stage count to balance simplicity, power efficiency, and performance targets. Shallow pipelines with 3 to 5 stages are common in simple or embedded processors, such as early RISC implementations, where reduced complexity supports lower power consumption and easier design. In contrast, deeper pipelines with 10 to 30 stages appear in high-performance processors, like modern superscalar CPUs, to enable higher clock frequencies by shortening individual stage delays, though this increases latency for single instructions and complexity in hazard management. Achieving balanced latencies across stages is crucial, as the clock cycle time is dictated by the slowest stage, limiting overall frequency if imbalances exist. Uneven stage delays reduce efficiency by forcing the entire pipeline to operate at the pace of the bottleneck, potentially lowering achievable clock speeds and throughput. The MIPS R2000, released in 1986, exemplified this approach with a balanced five-stage that supported clock speeds up to 16.67 MHz, making it one of the fastest commercial microprocessors of its era and establishing a model for subsequent RISC designs.

Historical Development

Early Implementations

The conceptual roots of instruction pipelining trace back to the era of computers, where designers began exploring techniques for overlapping instruction execution to improve throughput despite hardware limitations. A pivotal milestone in this development was the , delivered in 1961, which introduced overlapped instruction execution in a general-purpose computer to achieve high performance for scientific computing, though it fell short of initial speed goals due to technological challenges. Building on these advances, the , the first commercial supercomputer released in 1964 and designed by at , pioneered deep pipelining within its functional units, enabling scalar operations to overlap significantly for enhanced performance in numerical workloads. The featured 10 independent functional units for arithmetic and logic operations—with memory access managed by six peripheral processors—that operated in a pipelined manner, allowing multiple instructions to progress through computation stages simultaneously while using to manage dependencies. This design achieved a throughput of approximately 3 million (MIPS), a remarkable feat for the time, by sustaining high utilization across the units despite a 10 MHz . Building on these advances, the , introduced in 1967, represented another early commercial adoption of instruction tailored for scientific applications. Optimized for floating-point intensive tasks in fields like and physics simulations, it employed a multi-stage that overlapped instruction fetch, decode, operand addressing, fetch, and execution, with a base cycle time of 60 nanoseconds. The design allowed the instruction unit to issue up to 0.8 , surging ahead of execution to buffer operations and achieve concurrency, resulting in performance up to 100 times that of the earlier IBM 7090 for certain floating-point benchmarks. This approach was particularly effective for workloads requiring rapid handling of complex arithmetic, though it introduced challenges like imprecise interrupts due to the depth of overlap. Early pipelined designs like these were inherently constrained by mid-20th-century technology, typically limited to 3-4 pipeline stages owing to slow core memory access times (around 1-2 microseconds) and the complexity of or early logic. These limitations meant that pipelines could not be deepened without risking from propagation delays, and designers prioritized reliability over aggressive overlap, often resulting in simpler fetch-execute structures rather than the deeper pipelines of later decades. Despite these hurdles, such implementations demonstrated pipelining's potential to boost instruction throughput in environments.

Modern Evolution

The rise of Reduced Instruction Set Computing (RISC) architectures in the 1980s significantly advanced instruction pipelining by emphasizing simple, uniform instructions that facilitated efficient multi-stage designs. The MIPS R2000, introduced in 1985, and its successor the in 1988, popularized a clean five-stage pipeline consisting of instruction fetch, decode, execute, memory access, and writeback stages, which minimized interlocks and maximized throughput. This design influenced subsequent RISC implementations, including the ARM architecture, by demonstrating how streamlined pipelines could achieve high performance with low complexity and power consumption. In parallel, complex instruction set computing (CISC) processors like those in the x86 family pursued deeper to attain gigahertz clock speeds, though at the cost of increased complexity. The Pentium 4, launched in 2000, featured a 20-stage pipeline that expanded to 31 stages in the Prescott variant by 2004, enabling higher frequencies but amplifying branch misprediction penalties and power demands. By the mid-2000s, Intel shifted toward balanced designs in the Core microarchitecture series, reducing pipeline depth to 14 stages to improve energy efficiency and recovery from pipeline flushes while integrating . Contemporary trends through 2025 reflect a move away from extreme pipeline depths toward wider execution units and improved branch prediction, constrained by power walls that limit clock scaling. ARM's Cortex-A series, such as the A78 (2020) and subsequent A715 (2022), employs 13- to 15-stage pipelines, prioritizing balanced performance for mobile and edge devices over sheer depth. Similarly, Apple's M1 processor (2020) utilizes an out-of-order pipeline in its high-performance cores, focusing on and wide issue widths to deliver superior single-threaded without excessive depth. This evolution stems from power-performance models showing that deeper pipelines beyond 15-20 stages yield due to higher leakage and dynamic power, favoring superscalar widths instead. Historically, pipeline depths have grown from approximately four stages in designs to over 20 in the , driven by pursuits, but now stabilize at 10-15 stages to optimize recovery times and overall in multicore environments.

Pipeline Hazards

Types of Hazards

In pipelined processors, hazards represent situations where the pipeline's assumption of independent instruction execution is violated, potentially leading to incorrect results or stalls if not addressed. These disruptions arise because instructions in different stages may conflict in their resource usage or data dependencies, preventing the next instruction from proceeding in its scheduled clock cycle. Structural hazards occur when multiple instructions require the same hardware simultaneously, but the resource cannot support concurrent access. A classic example is a single memory unit shared between the instruction fetch (IF) stage and the memory access (MEM) stage, where one instruction is fetching while another is loading or storing data. This conflict forces the pipeline to until the resource becomes available. Data hazards stem from dependencies between instructions where the result of one instruction is needed by a subsequent one before it is fully available. They are classified into three types based on the order of read and write operations: read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). RAW hazards, the most common, happen when an instruction reads a register or memory location after a previous instruction has written to it but before the write completes; for instance, in the sequence add $1, $2, $3 followed by sub $4, $1, $5, the execution (EX) stage of the subtraction requires the write-back (WB) result from the addition, which is still pending. WAR hazards arise when an instruction writes to a location before a prior instruction has read from it, potentially overwriting a value needed by the earlier instruction. WAW hazards occur when multiple instructions write to the same location, risking out-of-order updates that could alter the final value. In in-order pipelines, WAR and WAW hazards are rare or impossible due to the fixed sequential execution order, which ensures reads and writes to registers occur predictably without overtaking. Control hazards emerge from instructions that alter the (PC), such as or jumps, introducing uncertainty about the correct execution path. When a conditional is encountered, subsequent instructions are fetched assuming the branch is not taken, but if it is taken, those fetched instructions are incorrect, leading to a pipeline flush. This uncertainty delays resolution until the branch condition is evaluated, typically in later stages.

Resolution Techniques

Hazard detection in pipelined processors primarily occurs in the instruction decode (ID) stage through dedicated hardware units that identify potential by comparing register specifiers. For read-after-write () data hazards, comparators check whether the destination register of a prior instruction (in the execute or stages) matches the source register of the current instruction in the ID stage, such as verifying if the EX/MEM.RegisterRd equals ID/EX.RegisterRs or similar conditions for MEM/WB dependencies. This early detection allows the pipeline control logic to initiate resolution mechanisms before the propagates, simplifying overall management as all checks are centralized in the ID stage. One common hardware resolution technique is stalling the pipeline, which inserts no-operation (NOP) instructions, or bubbles, to delay dependent instructions until the required data is available. When a is detected—particularly load-use hazards where a load instruction's result is needed immediately in the next instruction—the hazard detection unit prevents the instruction fetch (IF) and ID stages from advancing by holding the (PC) and flushing control signals to zero in the ID/EX pipeline register, effectively propagating a bubble through subsequent stages. This approach ensures correctness but incurs a performance penalty, typically a one-cycle stall for load-use cases in a classic five-stage pipeline, as the dependent instruction remains in ID while the load completes memory access and write-back. Forwarding, also known as bypassing, addresses many hazards more efficiently by using multiplexers (muxes) to route intermediate results directly from later pipeline stages back to earlier ones, bypassing the register file. In a typical implementation, results from the execute (EX) stage are forwarded via muxes from the EX/ pipeline register to the EX stage inputs of dependent instructions, while () stage results are forwarded from the MEM/WB register; control logic selects these paths when register matches are detected, such as ForwardA = 10 for EX/ sourcing. This technique resolves the majority of data hazards without stalling, though it requires additional hardware paths and comparator logic, and cannot handle all cases like immediate load-use dependencies. Software-based resolution through scheduling complements hardware methods by reordering instructions at to minimize occurrences, particularly for non-critical dependencies. Compilers analyze data dependencies and rearrange code sequences—such as placing independent operations between a load and its use—to avoid stalls, assuming knowledge of the target structure; for instance, reordering can reduce clock cycles by eliminating multiple load-use stalls in a sequence. This approach is especially effective for straight-line code but is limited by true dependencies that cannot be reordered without altering program semantics, and it relies on static rather than runtime detection.

Branch Handling

Control Hazards

Control hazards in pipelined processors arise from conditional and jumps that disrupt the sequential execution of instructions by altering the (). When a instruction enters the , the fetch stage continues to load subsequent instructions assuming sequential flow, but the branch outcome (taken or not taken) and target are not resolved until later stages, typically the execute stage in a classic five-stage . This leads to the pipeline fetching and partially processing incorrect instructions, requiring the flushing of 1 to 3 pipeline stages to redirect fetch to the correct path. The impact of unresolved control hazards is significant, as they introduce stalls or flushes that degrade performance. In programs with frequent branches—comprising about 14-20% of instructions—unmitigated control hazards can increase the cycles per instruction (CPI) from an ideal 1 to 1.42, effectively reducing instructions per cycle (IPC) by approximately 30% in branch-heavy workloads. One early technique to mitigate control hazards is delayed branching, which schedules the branch resolution such that the instruction immediately following the branch executes unconditionally, irrespective of the branch outcome. This creates a branch delay slot that the compiler fills with independent instructions, such as those not dependent on the branch condition or target, thereby hiding the penalty without flushing. In the MIPS architecture, which features a single branch delay slot in its five-stage pipeline, compilers rearrange code to populate this slot with useful operations about 48-60% of the time, avoiding the full branch penalty when successful. In contrast to data hazards, which stem from dependencies on operand values and primarily affect register reads or writes, control hazards fundamentally alter the instruction stream path, often necessitating pipeline-wide recovery rather than targeted forwarding or localized stalls.

Prediction Methods

Prediction methods for branch outcomes aim to anticipate whether a conditional branch will be taken or not taken, thereby reducing the pipeline stall associated with control hazards. Static prediction techniques, determined at compile time without runtime adaptation, include always-not-taken and always-taken strategies. The always-not-taken approach assumes all branches fall through to the sequential path, achieving typical accuracies of around 60-70% in benchmark workloads, but performs poorly on loops where backward branches are frequently taken. Conversely, always-taken prediction favors branch targets, yielding slightly higher accuracy in programs with more taken branches, yet it also struggles with non-loop forward branches that are often not taken. These methods are simple to implement, requiring no hardware tables, but their fixed nature limits effectiveness in diverse code patterns. Dynamic , in contrast, adapts based on runtime branch using hardware structures like the branch history table (BHT). A seminal dynamic scheme employs a 2-bit saturating counter per entry in the BHT, indexed by the branch's program counter (PC) bits, to track recent outcomes: the counter increments on taken branches and decrements on not taken, with the top bit determining the . This design, introduced in early studies of pipelined processors, mitigates the oscillation issues of 1-bit predictors and achieves accuracies of 82-99% depending on table size (e.g., 512 to 32K entries), significantly outperforming static methods by capturing local branch patterns. Aliasing from shared table entries can degrade performance, but the 2-bit mechanism provides hysteresis for stable predictions in repetitive code like loops. To address target address resolution for taken branches, especially indirect ones, the branch target buffer (BTB) serves as a cache-like structure that stores recent branch PCs and their targets, indexed by the current PC during fetch. Proposed in foundational work on pipelined systems, the BTB enables early target fetching, reducing latency beyond direction prediction alone; provide the target immediately, while misses default to sequential execution. It complements BHT-based direction predictors, with set-associative designs minimizing conflicts, and is essential for high-performance pipelines where branch resolution occurs late. Advanced dynamic predictors like TAGE (TAgged GEometric history length) combine multiple history lengths with tagging to exploit both local and global correlations, using a geometric increase in history table sizes for longer patterns. Developed as a high-accuracy solution, TAGE selects among components via a parallel lookup and override mechanism, achieving over 95% accuracy in modern benchmarks with reasonable hardware (e.g., 64KB), and is widely adopted in x86 and processors for its balance of precision and complexity. For instance, Intel's Core i7 employs a hybrid predictor incorporating TAGE-like elements in its 14-stage pipeline, limiting misprediction penalties to 10-15 cycles through accurate foresight and recovery.

Advanced Topics

Special Situations

In pipelined processors, exceptions such as arithmetic overflows or page faults must be handled precisely to maintain the illusion of sequential execution, meaning the processor state reflects completion of all prior instructions without side effects from subsequent ones. Precise exceptions contrast with imprecise ones, where the state may reflect partially executed future instructions, complicating software recovery and debugging. To achieve precision in out-of-order pipelines, mechanisms like history buffers store the original register values and memory states before speculative updates, allowing rollback to the exact faulting instruction upon exception detection. Interrupts, which are asynchronous signals from hardware devices, are classified as maskable—those that can be temporarily disabled by setting an interrupt mask bit—or non-maskable, which cannot be ignored and demand immediate response for critical events like power failure. In pipelined designs, handling an typically involves flushing instructions after the current one from the to prevent interference, while saving processor state from pipeline registers at boundaries such as instruction decode (ID) to execute (EX), ensuring the restart address points to the interrupted instruction. Multi-cycle instructions, such as integer division operations that typically require 20–90 cycles or more depending on the , introduce variable latency that can disrupt flow. These are managed either by stalling the —inserting no-op bubbles until completion to resolve structural hazards—or by dedicating separate functional units that operate in parallel without blocking the main , as seen in early RISC designs where divide units feed results back via a dedicated . In pipelines, exceptions leverage banked registers—separate sets of registers for modes like Fast Interrupt Request (FIQ)—to minimize overhead; for instance, FIQ mode provides banked R8–R12 and SPSR, avoiding the need to push these to the stack and saving several clock cycles compared to standard IRQ handling. For recovery in , checkpointing establishes restore points by snapshotting the register rename map and architectural state just before branches or other speculative decisions, enabling efficient rollback on mispredictions or exceptions without full re-execution of the window. This approach, as in checkpoint processing and recovery designs, uses a small buffer of checkpoints (e.g., 8 for a 2048-instruction window) to limit overhead to under 8% while scaling instruction windows.

Integration with Other ILP Techniques

Instruction pipelining integrates seamlessly with superscalar architectures, which employ multiple parallel pipelines to issue and execute several instructions simultaneously in each clock cycle, typically ranging from 2 to 8 instructions per cycle depending on the design. This approach exploits instruction-level parallelism (ILP) by dispatching independent instructions to distinct execution units, such as integer ALUs or floating-point units, thereby increasing throughput beyond the single-instruction-per-cycle limit of scalar pipelines. For instance, Intel Core processors, starting from the Core 2 generation, feature a 4-wide superscalar design, allowing up to four micro-operations to be issued per cycle to enhance overall performance in pipelined environments. Out-of-order execution further enhances pipelining by dynamically reordering instructions at runtime to bypass dependencies and hide latency in deep pipelines, using mechanisms like reservation stations and a reorder buffer to maintain architectural correctness. Originating from , this technique tracks instruction dependencies and dispatches ready operations to available execution units ahead of program order, mitigating stalls from data hazards in superscalar pipelines. The reorder buffer ensures results are committed in original order, supporting precise exceptions while tolerating long latencies, such as those from memory accesses, in modern deep pipelines. Speculative execution complements pipelining by allowing the processor to fetch, decode, and execute instructions along a predicted path before resolving branches, with mispredicted work discarded via squashing to minimize penalties. This ties closely to branch methods, enabling the pipeline to continue processing without stalling on control hazards, as the speculated instructions can be rolled back if the prediction fails. A practical integration is seen in AMD's architecture (introduced in 2022), which combines a 19-stage with and speculative mechanisms to achieve over 5 (IPC) in high-ILP workloads. However, widening superscalar issue rates beyond 4 yields due to escalating hardware complexity, including larger dependency check logic, increased power consumption, and limited available ILP in typical programs. Designs exceeding 4-6 wide often face bottlenecks in fetch, rename, and wakeup stages, making further scaling inefficient without proportional performance gains.

Design Considerations

Performance Metrics

The clock (CPI) serves as a fundamental metric for assessing pipeline efficiency, representing the average number of clock cycles required to execute one instruction. In an ideal scalar without hazards, CPI equals 1, as each instruction completes one stage per cycle in . However, structural, data, and control hazards introduce stalls, increasing CPI above 1; the formula for CPI in the presence of stalls is given by CPI = 1 + stall fraction, where the stall fraction is the average number of stall cycles per instruction. The (IPC), the reciprocal of CPI, measures the pipeline's ability to sustain instruction execution rates, providing a direct indicator of throughput. In scalar pipelines, IPC ideally approaches 1, but superscalar processors, which issue multiple instructions per cycle, target IPC values greater than 1 to exploit . For instance, modern superscalar designs aim for IPC in the range of 2-4, depending on workload and hardware capabilities, enabling higher overall compared to non-superscalar pipelines. Pipeline throughput quantifies the steady-state rate at which instructions are completed, typically expressed as once the pipeline is filled. In an k-stage , throughput reaches 1 instruction per cycle, limited only by the slowest stage and assuming no bottlenecks from hazards. Real-world throughput is reduced by pipeline stalls and flushes, with deeper pipelines potentially increasing peak throughput but amplifying losses from disruptions. The branch misprediction penalty measures the performance cost of incorrect branch predictions, defined as the number of cycles lost due to flushing incorrectly fetched instructions from the . This penalty is often approximated as the pipeline depth minus the branch resolution , multiplied by the flush cost, leading to losses of several cycles per misprediction in deep pipelines. For example, in a 5-stage pipeline resolving branches in the execute , the penalty can be 2-3 cycles, scaling with pipeline complexity. Overall, pipelining contributes a 5-10x over non-pipelined processors in modern CPUs by overlapping instruction execution, though actual gains depend on ; this is evident in typical 5- to 20-stage designs achieving effective IPC of 2-4 under balanced workloads.

Implementation Trade-offs

Implementing deeper instruction pipelines enables higher clock frequencies by reducing the delay per stage, allowing each stage to complete in less time. However, this comes at the cost of increased penalties for control , such as branch mispredictions and exceptions, since recovery from errors requires flushing more stages, leading to greater degradation. Power consumption in pipelined processors rises with additional stages due to increased switching activity from more latches and wires, which elevate dynamic energy dissipation across the . Techniques like dynamic voltage scaling can mitigate this by adjusting supply voltage and based on workload demands, reducing overall power without proportionally sacrificing performance. Forwarding logic and associated buffers, essential for resolving data hazards in deep pipelines, introduce significant area overhead on the die, often accounting for a notable portion of the processor's silicon resources. Post-2010, processor designs shifted toward shallower pipelines combined with wider issue widths to improve energy efficiency, as exemplified by Intel's transition from the architecture's 31 stages in the Prescott to approximately 14 stages in the Core microarchitecture starting with Nehalem and . The optimal pipeline depth represents a balancing act influenced by target application domains, with mobile processors favoring 10-15 stages to prioritize low power and quick recovery from hazards, while server processors often employ 20 or more stages to maximize throughput at higher frequencies.

Illustrative Examples

Basic Pipeline Execution

In a classic five-stage instruction pipeline, as described in foundational computer architecture texts, each instruction progresses through the Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB) stages, with each stage ideally completing in one clock cycle. This design enables multiple instructions to overlap in execution, increasing throughput without reducing the latency for a single instruction. To illustrate ideal pipeline behavior, consider the execution of four independent arithmetic-logic unit (ALU) instructions on a processor like the : ADD R1, R2, R3 (adds R2 and R3, stores in R1); SUB R4, R5, R6 (subtracts R5 from R6, stores in R4); AND R7, R8, R9 (bitwise AND of R8 and R9, stores in R7); and OR R10, R11, R12 (bitwise OR of R11 and R12, stores in R10). These instructions have no data dependencies or control transfers, allowing seamless overlap without disruptions. The first instruction enters the IF stage in clock cycle 1, the second in cycle 2, the third in cycle 3, and the fourth in cycle 4. The following table depicts the cycle-by-cycle progression of these instructions through the pipeline stages, assuming balanced stage timings and no hazards. Each row represents a clock cycle, with stages listed horizontally for each instruction (I1 through I4). A dash (-) indicates the instruction has not yet entered that cycle or has completed all stages.
CycleI1 (ADD)I2 (SUB)I3 (AND)I4 (OR)
1IF---
2IDIF--
3EXIDIF-
4EXIDIF
5WBEXID
6-WBEX
7--WB
8---WB
By cycle 5, the pipeline reaches , with all five stages occupied by portions of different instructions: I1 completing WB, I2 in MEM, I3 in EX, I4 in ID, and the prior cycle's IF stage now feeding into ID for a hypothetical next instruction. In this configuration, the processor completes one instruction per cycle, achieving a throughput approaching one instruction per clock cycle after the initial pipeline fill. This ideal overlap demonstrates the core benefit of for sequential instruction streams under no-hazard conditions.

Stalls and Bubbles

In instruction pipelining, stalls occur when a hazard prevents an instruction from proceeding to the next stage, forcing the pipeline to pause earlier instructions to maintain correctness. Bubbles refer to the no-operation (NOP) instructions or idle cycles inserted during these stalls, which propagate through the pipeline stages without performing useful work. Data hazards, particularly read-after-write (RAW) dependencies, are a primary cause of such stalls, as a subsequent instruction attempts to read a register before the producing instruction writes its result. A classic example of a RAW data hazard is a load instruction followed immediately by a dependent arithmetic operation, such as:

lw $t0, 0($t1) # Load word into $t0 add $t2, $t0, $t3 # Add $t0 to $t3, store in $t2

lw $t0, 0($t1) # Load word into $t0 add $t2, $t0, $t3 # Add $t0 to $t3, store in $t2

In a standard five-stage pipeline (Instruction Fetch [IF], Instruction Decode [ID], Execute [EX], Memory [MEM], Write-Back [WB]), the load instruction fetches data during MEM but only writes it back to the register file during WB. The add instruction decodes the dependency during ID and requires the value in EX. Without mitigation, this creates a two-cycle stall in the EX stage for the add, as it must wait until the load completes WB. The pipeline control detects the hazard during ID of the add and inserts two NOP bubbles, delaying fetch and decode of subsequent instructions. The following table illustrates the pipeline execution with stalls (assuming no other hazards), showing instructions held in their stages during the two stall cycles:
Cycle123456789
lwIFIDEXMEMWB----
add-IFIDIDIDEXMEMWB-
next--IFIFIFIDEXMEMWB
Bubble insertion ensures correctness by effectively pushing back dependent instructions; the NOPs flow through IF, ID, EX, etc., without altering program state, allowing the load to complete before the add proceeds. Forwarding, or bypassing, mitigates this by the result directly from the or WB stage to the EX stage input via multiplexers (muxes), bypassing the register file read. In the load-add example, the load's -stage output can be mux-selected for the add's EX input after one stall cycle, reducing the penalty to one cycle instead of two. The hardware adds muxes at the ALU inputs, controlled by detection logic, which selects between the register file value and forwarded paths from prior stages. This simple resolution maintains pipeline throughput closer to ideal while preserving sequential execution semantics. In unoptimized pipelines lacking forwarding, bubbles from hazards can reduce throughput by 20-50% in dependent code sequences, where instructions frequently rely on immediate predecessors, elevating (CPI) significantly above the ideal value of 1.

References

  1. https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a78
Add your contribution
Related Hubs
User Avatar
No comments yet.