Recent from talks
Nothing was collected or created yet.
Branch (computer science)
View on WikipediaThis article needs additional citations for verification. (June 2009) |
| Machine code |
|---|
| General concepts |
| Instructions |
A branch, jump or transfer is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order.[a] Branch (or branching, branched) may also refer to the act of switching execution to a different instruction sequence as a result of executing a branch instruction. Branch instructions are used to implement control flow in program loops and conditionals (i.e., executing a particular sequence of instructions only if certain conditions are satisfied).
A branch instruction can be either an unconditional branch, which always results in branching, or a conditional branch, which may or may not cause branching depending on some condition. Also, depending on how it specifies the address of the new instruction sequence (the "target" address), a branch instruction is generally classified as direct, indirect or relative, meaning that the instruction contains the target address, or it specifies where the target address is to be found (e.g., a register or memory location), or it specifies the difference between the current and target addresses.
Implementation
[edit]Branch instructions can alter the contents of the CPU's program counter (PC) (or instruction pointer on Intel microprocessors). The program counter maintains the memory address of the next machine instruction to be fetched and executed. Therefore, a branch, if executed, causes the CPU to execute code from a new memory address, changing the program logic according to the algorithm planned by the programmer.
One type of machine level branch is the jump instruction. These may or may not result in the PC being loaded or modified with some new, different value other than what it ordinarily would have been (being incremented past the current instruction to point to the following, next instruction). Jumps typically have unconditional and conditional forms where the latter may be taken or not taken (the PC is modified or not) depending on some condition.
The second type of machine level branch is the call instruction which is used to implement subroutines. Like jump instructions, calls may or may not modify the PC according to condition codes, however, additionally a return address is saved in a secure place in memory (usually in a memory resident data structure called a stack). Upon completion of the subroutine, this return address is restored to the PC, and so program execution resumes with the instruction following the call instruction.
The third type of machine level branch is the return instruction. This "pops" a return address off the stack and loads it into the PC register, thus returning control to the calling routine. Return instructions may also be conditionally executed. This description pertains to ordinary practice; however, the machine programmer has considerable powers to manipulate the return address on the stack, and so redirect program execution in any number of different ways.
Depending on the processor, jump and call instructions may alter the contents of the PC register in different ways. An absolute address may be loaded, or the current contents of the PC may have some value (or displacement) added or subtracted from its current value, making the destination address relative to the current place in the program. The source of the displacement value may vary, such as an immediate value embedded within the instruction, or the contents of a processor register or memory location, or the contents of some location added to an index value.
The term branch can also be used when referring to programs in high-level programming languages. In these branches usually take the form of conditional statements of various forms that encapsulate the instruction sequence that will be executed if the conditions are satisfied. Unconditional branch instructions such as GOTO are used to unconditionally jump to a different instruction sequence. If the algorithm requires a conditional branch, the GOTO (or GOSUB subroutine call) is preceded by an IF-THEN statement specifying the condition(s). All high level languages support algorithms that can re-use code as a loop, a control structure that repeats a sequence of instructions until some condition is satisfied that causes the loop to terminate. Loops also qualify as branch instructions. At the machine level, loops are implemented as ordinary conditional jumps that redirect execution to repeating code.
In CPUs with flag registers, an earlier instruction sets a condition in the flag register. The earlier instruction may be arithmetic, or a logic instruction. It is often close to the branch, though not necessarily the instruction immediately before the branch. The stored condition is then used in a branch such as jump if overflow-flag set. This temporary information is often stored in a flag register but may also be located elsewhere. A flag register design is simple in slower, simple computers. In fast computers a flag register can place a bottleneck on speed, because instructions that could otherwise operate in parallel (in several execution units) need to set the flag bits in a particular sequence.
There are also machines (or particular instructions) where the condition may be checked by the jump instruction itself, such as branch <label> if register X negative. In simple computer designs, comparison branches execute more arithmetic and can use more power than flag register branches. In fast computer designs comparison branches can run faster than flag register branches, because comparison branches can access the registers with more parallelism, using the same CPU mechanisms as a calculation.
Some early and simple CPU architectures, still found in microcontrollers, may not implement a conditional jump, but rather only a conditional "skip the next instruction" operation. A conditional jump or call is thus implemented as a conditional skip of an unconditional jump or call instruction.
Examples
[edit]Depending on the computer architecture, the assembly language mnemonic for a jump instruction is typically some shortened form of the word jump or the word branch, often along with other informative letters (or an extra parameter) representing the condition. Sometimes other details are included as well, such as the range of the jump (the offset size) or a special addressing mode that should be used to locate the actual effective offset.
This table lists the machine level branch or jump instructions found in several well-known architectures:
| condition or result | x86 | PDP-11, VAX | ARM (partly 6502) | equation |
|---|---|---|---|---|
| zero (implies equal for sub/cmp) | JZ; JNZ | BEQ; BNE | BEQ; BNE | zero; not zero |
| negative (N), sign (S), or minus (M) | JS; JNS | BMI; BPL | BMI; BPL | negative; not negative |
| arithmetic overflow (flag called O or V) | JO; JNO | BVS; BVC | BVS; BVC | overflow; not overflow |
| carry (from add, cmp, shift, etc.) | JC; JNC | BCS; BCC | BCS; BCC | carry; not carry |
| unsigned below (lower) | JB | BLO | BLO * | borrow |
| unsigned below or equal (lower or same) | JBE | BLOS | BLS * | borrow or zero |
| unsigned above or equal (higher or same) | JAE | BHIS | BHS * | not borrow |
| unsigned above (higher) | JA | BHI | BHI * | not borrow and not zero |
| signed less than | JL | BLT | BLT | sign≠overflow |
| signed less or equal | JLE | BLE | BLE | (sign≠overflow) or zero |
| signed greater or equal | JGE | BGE | BGE | sign=overflow |
| signed greater than | JG | BGT | BGT | (sign=overflow) and not zero |
* x86, the PDP-11, VAX, and some others, set the carry-flag to signal borrow and clear the carry-flag to signal no borrow. ARM, 6502, the PIC, and some others, do the opposite for subtractive operations. This inverted function of the carry flag for certain instructions is marked by (*), that is, borrow=not carry in some parts of the table, but if not otherwise noted, borrow≡carry. However, carry on additive operations are handled the same way by most architectures.
Performance problems with branch instructions
[edit]To achieve high performance, modern processors are pipelined. They consist of multiple parts that each partially process an instruction, feed their results to the next stage in the pipeline, and start working on the next instruction in the program. This design expects instructions to execute in a particular unchanging sequence. Conditional branch instructions make it impossible to know this sequence. So conditional branches can cause "stalls" in which the pipeline has to be restarted on a different part of the program.
Improving performance by reducing stalls from branches
[edit]Several techniques improve speed by reducing stalls from conditional branches.
Branch prediction hints
[edit]Historically, branch prediction took statistics, and used the result to optimize code. A programmer would compile a test version of a program, and run it with test data. The test code counted how the branches were actually taken. The statistics from the test code were then used by the compiler to optimize the branches of released code. The optimization would arrange that the fastest branch direction (taken or not) would always be the most frequently taken control flow path. To permit this, CPUs must be designed with (or at least have) predictable branch timing. Some CPUs have instruction sets (such as the Power ISA) that were designed with "branch hints" so that a compiler can tell a CPU how each branch is to be taken.
The problem with software branch prediction is that it requires a complex software development process.
Hardware branch predictors
[edit]To run any software, hardware branch predictors moved the statistics into the electronics. Branch predictors are parts of a processor that guess the outcome of a conditional branch. Then the processor's logic gambles on the guess by beginning to execute the expected instruction flow. An example of a simple hardware branch prediction scheme is to assume that all backward branches (i.e. to a smaller program counter) are taken (because they are part of a loop), and all forward branches (to a larger program counter) are not taken (because they leave a loop). Better branch predictors are developed and validated statistically by running them in simulation on a variety of test programs. Good predictors usually count the outcomes of previous executions of a branch. Faster, more expensive computers can then run faster by investing in better branch prediction electronics. In a CPU with hardware branch prediction, branch hints let the compiler's presumably superior branch prediction override the hardware's more simplistic branch prediction.
Branch-free code
[edit]Some logic can be written without branches or with fewer branches. It is often possible to use bitwise operations, conditional moves or other predication instead of branches.[1][2] In fact, branch-free code is a must for cryptography due to timing attacks.[3]
Delay slot
[edit]Another technique is a branch delay slot. In this approach, at least one instruction following a branch is always executed, with some exceptions such like the legacy MIPS architecture likely/unlikely branch instruction. Therefore, the computer can use this instruction to do useful work whether or not its pipeline stalls. This approach was historically popular in RISC computers. In a family of compatible CPUs, it complicates multicycle CPUs (with no pipeline), faster CPUs with longer-than-expected pipelines, and superscalar CPUs (which can execute instructions out of order.)
See also
[edit]Notes
[edit]- ^ At least conceptually; see out-of-order execution.
References
[edit]- ^ Knuth, Donald (2008). The Art of Computer Programming. Vol. 4, Pre-fascicle 1A (Revision 6 ed.). pp. 48–49.
- ^ "Avoiding Branches". Chessprogramming wiki.
- ^ "Constant-Time Crypto". BearSSL.
External links
[edit]- Free IA-32 and x86-64 documentation, provided by Intel
- The PDP-11 FAQ
- The ARM instruction set
Branch (computer science)
View on GrokipediaFundamentals
Definition and Purpose
In computer science, a branch is an instruction that alters the normal sequential flow of execution in a program by changing the program counter to point to a non-sequential memory address, thereby directing the processor to begin executing code at a different location.[7] This mechanism interrupts the default linear progression where instructions are fetched and executed one after another from consecutive memory addresses.[8] The primary purpose of branches is to enable conditional logic, allowing programs to make decisions, implement loops, and handle non-linear control flows based on runtime conditions, which promotes efficient resource utilization by avoiding the execution of irrelevant code paths.[1] For instance, branches facilitate structures like if-then-else statements or iterative loops, ensuring that computational resources are directed only toward necessary operations.[8] Historically, early computers such as the ENIAC (1945) relied on rudimentary conditional transfer instructions, often based on the sign of the accumulator, to implement basic loops and control transfers, as more sophisticated conditional mechanisms were not yet standardized.[9] These evolved into the diverse branch instructions of modern architectures, which support complex conditionals and integrate seamlessly with high-level programming constructs.[1] Branches form a fundamental subset of control flow statements, which govern the order in which instructions are processed, distinctly from the sequential execution model that assumes uninterrupted progression through code.[7] This distinction underscores branches' role in enabling dynamic, adaptive program behavior essential to virtually all software applications.[1]Types of Branches
Branches in computer architecture are broadly classified into conditional and unconditional types based on their behavior in altering program control flow. Conditional branches depend on the evaluation of a runtime condition to determine whether the branch is taken or falls through to the next instruction. These conditions are typically assessed using status flags in a flag register, such as the zero flag (set when the result of an operation is zero), the carry flag (indicating an unsigned overflow or borrow), and the overflow flag (signaling a signed arithmetic overflow).[10] Unconditional branches always redirect the program counter to a new instruction address, unconditionally altering the execution flow without any condition check. These are commonly implemented as jumps to a specified target.[11] Several subtypes of branches serve specific control transfer purposes. Procedure calls function as a type of branch that transfers control to a subroutine while saving the return address, often in a dedicated register or on the stack, to enable resumption of the calling routine. Returns reverse this process by loading the saved address from the stack or register to restore the prior execution context. Indirect branches compute the target address dynamically from a value stored in a register or memory location, allowing for flexible but unpredictable jumps.[12] Branch instructions employ distinct addressing modes to specify their target locations. Relative addressing calculates the target by adding a signed offset to the current program counter, facilitating position-independent code. Absolute addressing directly embeds the full target memory address in the instruction. Register-indirect addressing loads the target address from a general-purpose register, supporting computed jumps.[1] In early computer architectures, special cases like skip instructions provided a simple branching mechanism by incrementing the program counter by more than one, thereby bypassing one or more subsequent instructions; these could be conditional or unconditional depending on the design.Implementation
Low-Level Mechanisms
Branch instructions at the hardware level fundamentally alter the flow of execution by modifying the program counter (PC), which holds the address of the next instruction to be fetched. In a typical CPU, the PC is incremented sequentially after each instruction fetch to point to the subsequent instruction in memory. However, a branch instruction computes a new target address and loads it into the PC if the branch condition is met, thereby redirecting the fetch unit to a non-sequential location. For relative branches, common in many architectures to support position-independent code, the target address is calculated as the current PC plus a signed offset encoded in the instruction:where the offset is typically scaled (e.g., by the instruction length) to account for the PC's value at execution time. Absolute branches, less common due to their lack of relocatability, directly specify the full target address. This PC update occurs unconditionally for unconditional branches or conditionally based on prior computation results. Condition evaluation for conditional branches relies on status flags in the processor's flags register, which capture outcomes from arithmetic, logical, or comparison operations. Common flags include the zero flag (Z), set if the result is zero; the sign flag (S or N), indicating a negative result via the most significant bit; the overflow flag (V), signaling arithmetic overflow; and the carry flag (C), for carry-out in additions or borrow-in subtractions. These flags are updated by instructions preceding the branch, such as comparisons (e.g., subtract without storing the result). The branch instruction then tests a combination of flags against a condition code. For instance, in pseudocode form, the evaluation might appear as:
if (Z == 1) { // Equal or [zero condition](/page/Zero_flag)
PC = target;
} else {
PC = PC + instruction_length; // Sequential execution
}
if (Z == 1) { // Equal or [zero condition](/page/Zero_flag)
PC = target;
} else {
PC = PC + instruction_length; // Sequential execution
}
High-Level Language Constructs
In high-level programming languages, conditional statements such as if-then-else and switch serve as abstractions over low-level conditional branches, allowing developers to specify alternative execution paths based on runtime conditions. These constructs are typically compiled into machine code that involves evaluating a boolean expression and performing a conditional jump if the condition is true or false, thereby altering the program counter to skip or execute specific code blocks. For instance, an if-then-else statement might generate a conditional branch instruction to jump over the "else" block if the condition holds, followed by an unconditional branch to skip the "then" block otherwise. Similarly, switch statements often compile to a series of conditional branches or, for dense cases, a jump table that uses computed addresses for direct branching to the matching case.[15][16] Loop constructs like for and while also rely on branches to manage iteration and termination. In a while loop, the condition is tested at the beginning of each iteration via a conditional branch that jumps to the loop body if true or exits to subsequent code if false; a branch back to the condition check closes the loop. For loops follow a similar pattern, incorporating initialization, condition testing, and increment steps, with branches handling the loop continuation or exit. These mappings enable modular control flow in source code while translating to efficient branch instructions at the assembly level.[16][17] The evolution of these constructs traces back to the transition from assembly language's explicit goto instructions in the early 1950s to structured alternatives in high-level languages during the late 1950s and 1960s. Early languages like Fortran (introduced in 1957) featured an arithmetic IF statement that functioned as a conditional goto based on the sign of an expression, while its DO loop used branches for iteration but lacked a full boolean if-then-else. Algol 58 and Algol 60 advanced this by introducing block-structured if statements with then and else clauses and general loop forms, promoting clearer control flow without unrestricted jumps.[18][19][20] Structured programming paradigms, formalized in the 1960s, emphasized these high-level constructs over unstructured goto statements to enhance readability and maintainability. Goto, which directly translates to an unconditional branch, allows arbitrary jumps but can lead to "spaghetti code" with tangled control paths; its use became controversial following Edsger Dijkstra's 1968 critique, which argued it complicates program verification and debugging. Modern languages thus favor structured branches in if, switch, and loops, relegating goto to rare cases like error handling.[21][20] Compilers play a crucial role in optimizing these high-level branches during translation to machine code. Techniques such as dead code elimination remove unreachable branch targets or conditional blocks that always evaluate to a fixed outcome, reducing unnecessary jumps. Loop unrolling expands loop bodies to minimize the frequency of condition-testing branches, trading code size for fewer iterations and overhead. These optimizations preserve the semantic intent of the source constructs while streamlining the underlying branch instructions for better execution efficiency.[22][23] To illustrate, consider this pseudocode for a simple if-then-else:if (x > 0) then
y = x * 2 // branch to here if true
else
y = x - 1 // branch to here if false
end if
if (x > 0) then
y = x * 2 // branch to here if true
else
y = x - 1 // branch to here if false
end if
x > 0, branching to the else block or skipping it via a conditional jump, and an unconditional branch to rejoin after the if.[17]
Assembly Language Examples
In x86 assembly, the JE (Jump if Equal) instruction performs a conditional branch if the zero flag (ZF) is set to 1, typically following a comparison that results in equality.[24] The JMP instruction provides an unconditional branch to a specified target address. For subroutine calls, CALL pushes the return address onto the stack and branches to the target, while RET pops the return address and resumes execution. In ARM assembly, the BEQ (Branch if Equal) instruction conditionally branches if the zero flag indicates equality after a comparison.[25] The B instruction executes an unconditional branch to the target address.[25] For subroutines, BL (Branch with Link) performs an unconditional branch while storing the return address in the link register.[25] In MIPS assembly, the BEQ (Branch on Equal) instruction conditionally branches if two registers hold equal values.[26] The J (Jump) instruction provides an unconditional branch using a 26-bit target address shifted left by two bits.[26] A simple example in x86-64 assembly implements the conditional logicif (a == b) { c = 1; } using a comparison, conditional branch, and labels:
mov rax, [a] ; Load a into RAX
mov rbx, [b] ; Load b into RBX
cmp rax, rbx ; Compare RAX and RBX, sets ZF if equal
jne not_equal ; Jump if not equal (ZF=0)
mov rcx, 1 ; Set c = 1
jmp end_if
not_equal:
mov rcx, 0 ; Implicit else: c = 0 (optional)
end_if:
mov rax, [a] ; Load a into RAX
mov rbx, [b] ; Load b into RBX
cmp rax, rbx ; Compare RAX and RBX, sets ZF if equal
jne not_equal ; Jump if not equal (ZF=0)
mov rcx, 1 ; Set c = 1
jmp end_if
not_equal:
mov rcx, 0 ; Implicit else: c = 0 (optional)
end_if:
Performance Challenges
Branch Hazards
Branch hazards arise in pipelined processors when the outcome of a branch instruction—whether taken or not taken—remains unknown during the instruction fetch stage, resulting in the pipeline speculatively fetching instructions from an incorrect execution path.[29] This uncertainty stems from the sequential nature of instruction fetching, which assumes linear program flow until the branch condition is resolved, potentially leading to wasted computational effort.[30] Branch hazards primarily manifest as three types, though structural variants are less prevalent. Control hazards occur specifically from conditional branches, where the target address or direction depends on runtime evaluation, disrupting the fetch-decode sequence.[29] Data hazards related to branches emerge when the branch condition itself depends on results from prior instructions that have not yet completed, such as a register value loaded or computed earlier in the pipeline.[31] Structural hazards, while uncommon for branches, can arise if branch-related operations conflict with other instructions over shared hardware resources, like memory access units during simultaneous fetch and condition computation.[29] These hazards are typically detected during the decode or execute stages of the pipeline, where the branch instruction is analyzed and its condition begins evaluation, but they remain unresolved until the full condition—often involving arithmetic or comparison—is computed later in the execute phase.[30] Until resolution, the pipeline cannot confirm the correct path, forcing a decision point that delays subsequent fetches.[29] The impact on execution involves potential wrong-path execution, where incorrectly fetched instructions advance through the pipeline stages, necessitating a flush of those speculative instructions to redirect to the proper path once the branch outcome is known.[30] This flushing discards partially processed work, reducing overall pipeline efficiency and throughput.[29] Branch hazards first became prominent in the 1980s with the advent of deep pipelined designs, notably the MIPS R2000 processor released in 1985, which featured a five-stage pipeline where branches were resolved in the decode stage, but still required a delay slot to mitigate control hazards in early RISC architectures.[32]Pipeline Stalls and Penalties
In pipelined processors, a pipeline stall refers to the insertion of one or more no-op (noop) cycles to halt instruction fetch and execution until a branch outcome is resolved, preventing the pipeline from processing incorrect instructions. This stall arises from control hazards, where the next instruction address depends on the branch decision, which is not known until later pipeline stages. Without mitigation, such stalls directly reduce instruction throughput by idling pipeline stages.[33] The branch penalty quantifies the performance cost of these stalls and is calculated as the number of cycles lost due to the delay between instruction fetch and branch resolution, typically expressed as (number of pipeline stages after fetch until resolution) minus one for the overlapping fetch of the wrong path. In a classic five-stage pipeline (fetch, decode, execute, memory, writeback), where the branch is resolved in the execute stage, the penalty is usually one cycle if a single stall is inserted after decode; however, in schemes without early resolution, it can reach up to three cycles for conditional branches due to flushing fetched instructions. For instance, with a 20% branch frequency and a one-cycle penalty per branch, the average stalls per instruction rise to 0.2, increasing the cycles per instruction (CPI) from an ideal 1.0 to 1.2.[33][34] In modern superscalar processors, the cost of a branch misprediction—where the pipeline must flush speculative instructions and refill from the correct path—is significantly higher, often 10-20 cycles due to deeper pipelines and out-of-order execution recovery. For example, in Intel Core processors from the 2020s, misprediction penalties range from 15-20 cycles, as measured in benchmarks flushing the reorder buffer and re-steering the front-end. This cost is exacerbated by factors such as branch frequency, which accounts for 15-20% of instructions in typical programs, and predictability: backward branches in loops are highly predictable (often taken repeatedly), while data-dependent branches (e.g., based on runtime conditions) are less so, leading to higher misprediction rates.[35][36][37] Overall, unresolved branch hazards can elevate CPI by 0.2-0.5 or more without mitigation, depending on workload; for example, in a processor with 20% branches and a three-cycle penalty, CPI could reach 1.6, halving potential throughput compared to a hazard-free ideal. These metrics underscore the throughput degradation in unoptimized pipelines, where even moderate branch densities amplify lost cycles across the instruction stream.[38]Optimization Techniques
Branch Prediction Methods
Branch prediction methods aim to anticipate the outcome of conditional branches at runtime to minimize pipeline disruptions in processors. These techniques are broadly classified into static and dynamic approaches, with the latter further divided into software-based and hardware-based implementations. Static methods make predictions at compile time based on fixed heuristics, while dynamic methods adapt predictions using runtime information, achieving higher accuracy at the cost of additional hardware or software overhead.[39] Static branch prediction employs simple, compiler-determined rules without runtime adaptation. The most basic forms assume branches are always taken or always not taken, which were common in early processors and compilers due to their zero hardware cost. For instance, always-not-taken prediction favors sequential execution, mispredicting only backward branches like loops, while always-taken handles forward branches better but incurs penalties on straight-line code. These methods achieve accuracies around 60-70% on average benchmarks but perform poorly on mixed branch patterns.[40][39] Dynamic prediction leverages runtime history to improve accuracy, often exceeding 90% in modern systems. Software-based dynamic prediction includes compiler hints that guide hardware predictors; for example, GCC's__builtin_expect builtin allows developers to indicate likely branch outcomes, such as marking error paths as unlikely, which the compiler encodes into branch instructions for better static or dynamic handling. This approach integrates with hardware without extra runtime cost and is widely used in performance-critical code.[41][42]
Hardware predictors form the core of dynamic methods, using tables to store branch histories and outcomes. The simplest is the 1-bit predictor, which toggles its prediction based on the previous outcome—predicting taken if the last branch was taken, and vice versa—effectively assuming branches repeat their last behavior. This suffers from two mispredictions per direction change, such as entering and exiting a loop.[43]
To address this, the 2-bit saturating counter predictor uses a 2-bit up/down counter per branch (or shared via indexing), incrementing on taken outcomes and decrementing on not taken, with the most significant bit determining the prediction. States strongly favor one direction (e.g., 11 for strongly taken), providing hysteresis to avoid flipping on single anomalies, which is particularly effective for repetitive patterns like loops, reducing mispredictions by about 20% over 1-bit schemes on SPEC benchmarks.[39][44]
Advanced hardware predictors incorporate branch correlations. Two-level predictors, introduced by Yeh and Patt, use a global or local history register (a shift register of recent outcomes) to index a pattern history table (PHT) of 2-bit counters, capturing dependencies between branches—such as one branch influencing another—yielding up to 93% accuracy on integer workloads by exploiting program context. Global history variants (GShare) XOR the history with the branch address for indexing, mitigating aliasing but introducing interference.[45][46]
Tournament predictors, proposed by McFarling, combine local (per-branch history) and global predictors using a selector table of 2-bit counters to choose the better performer per branch, achieving higher accuracy than individual schemes by adapting to workload variations—often 1-2% improvement over two-level alone.[47]
In modern CPUs as of 2025, the TAGE (TAgged GEometric history length) predictor, developed by Seznec and Michaud, dominates implementations in Intel and AMD processors. TAGE uses multiple tables with varying history lengths and partial tagging to match long patterns precisely, providing fallback to shorter histories on mismatches; it achieves over 95% accuracy on SPEC CPU2017 traces with 32-64 KB storage, significantly reducing mispredictions from correlated branches compared to earlier tournament designs.[48][49]
Research into neural-inspired predictors, such as perceptron-based schemes, treats prediction as a weighted sum of recent outcomes, approximating a single-layer neural network to capture complex nonlinear patterns; these outperform TAGE by 1-3% in hard-to-predict workloads but require more computation, remaining in experimental stages for 2025 hardware.[43][50]
Prediction accuracy is measured as the ratio of correct predictions to total branches executed, with misprediction rates directly impacting instructions per cycle (IPC) by incurring pipeline flushes—typically 10-20 cycles per miss in deep pipelines.[51]
Delay Slots
Delay slots are a technique used in pipelined processor architectures to mitigate the performance impact of control hazards caused by branches. In this approach, one or more instructions immediately following a branch instruction are executed unconditionally, regardless of whether the branch is taken or not, thereby filling pipeline bubbles that would otherwise result from the delay in resolving the branch condition. If the branch is taken, these delay slot instructions are typically annulled (nullified) to prevent incorrect execution, while they are always executed if the branch is not taken. This method shifts the responsibility of utilizing the delay slots to the compiler or assembler, which must schedule independent instructions into these positions to avoid data dependencies or exceptions.[52] The concept of delay slots emerged in the 1980s as part of early Reduced Instruction Set Computer (RISC) designs, where simple pipelines made hardware branch prediction costly and complex. Pioneering RISC projects, such as those at the University of California, Berkeley, introduced delayed branches to expose pipeline latencies to software, allowing compilers to optimize for them without additional hardware interlocks. For instance, the Berkeley RISC I and II prototypes from the early 1980s featured delayed branches with up to two slots, demonstrating that software scheduling could achieve high utilization rates, often exceeding 90% for single slots in optimized code. By the mid-1980s, commercial implementations like the MIPS R2000 processor, released in 1985, standardized a single delay slot for branches and jumps, influencing subsequent RISC architectures.[53][54] In the MIPS architecture, branches and jumps feature a single mandatory delay slot, where the subsequent instruction is always fetched and executed before control transfers to the target address. This design assumes a classic five-stage pipeline (instruction fetch, decode, execute, memory access, write-back), where branch resolution occurs late, creating one bubble that the delay slot fills. MIPS compilers schedule non-dependent instructions, such as independent arithmetic operations or loads from unrelated addresses, into this slot to maximize utility; for example, if a branch depends on a prior addition but a subsequent load is independent, the load can be moved into the slot without hazards. To handle cases where no suitable instruction is available, a no-operation (NOP) can be inserted, though this incurs a penalty. MIPS also supports conditional annulment via "branch-likely" instructions (introduced in MIPS II), such as BEQL (branch if equal likely), which nullify the delay slot if the branch is not taken, reducing wasted execution for unpredictable or rarely taken branches.[52] Some early RISC designs experimented with multiple delay slots to accommodate deeper pipelines, as seen in variants of the Berkeley RISC with two slots, requiring compilers to find sequences of independent instructions. However, conditional annulment was less common in these prototypes and more refined in later architectures like MIPS. The compiler's role is critical, involving instruction reordering during optimization passes to identify fillable slots; studies on early RISC compilers showed they could fill single slots in about 90% of cases and achieve 80% usefulness for the executed instructions, though effectiveness drops for multiple slots due to scarcer independent code. For instance, moving a load instruction before a branch is feasible if the load's base register is not modified by the branch condition, preserving data flow without introducing false dependencies.[53][55] Despite their advantages in simple pipelines, delay slots have limitations that restrict their applicability. Not all branch types support slots effectively; procedure returns (e.g., via JR in MIPS) often lack them because predicting post-return instructions is difficult, leading to frequent NOP insertions and overhead. If no independent instruction exists for a slot—common in tight loops or data-dependent code—the unfilled slot wastes cycles, sometimes equaling the branch penalty it aims to hide. This branch resolution timing, typically one cycle in basic RISC pipelines, makes delay slots less viable in modern superscalar processors, where out-of-order execution and dynamic prediction dominate, rendering the technique largely obsolete since the 1990s.[52][53]Branch-Free Code
Branch-free code encompasses programming techniques that replace conditional branches with arithmetic, bitwise, or hardware-supported conditional operations to implement logic without altering control flow. This method ensures deterministic execution timing by avoiding pipeline disruptions from branch resolution, making it particularly valuable in performance-critical applications.[56] One primary technique involves conditional move instructions, which select between values based on processor flags without branching. In the x86 architecture, the CMOVcc family of instructions, introduced with the Pentium Pro processor in 1995, conditionally transfers data from a source operand to a destination register if the specified EFLAGS condition (such as zero, carry, or overflow) holds. For instance, CMOVZ moves the source value only if the zero flag is set, loading the data into a temporary register unconditionally but committing it to the destination solely upon condition satisfaction; this supports 16-, 32-, or 64-bit operations between registers or from memory to registers. Such instructions enable straightforward replacement of if-then-else constructs, as in assembly code where a conditional jump and two moves are substituted by a single conditional move followed by an unconditional one.[57] Another approach leverages bitwise operations to simulate conditional selection through masking. For example, to assign where is treated as an integer 0 or 1 (derived from a comparison), the expression uses XOR and AND to blend the values branchlessly; the mask (two's complement negation) becomes all 1s if or all 0s if , selecting or accordingly. This pattern, which requires three operations, generalizes to signed integers and avoids data dependencies on branch outcomes. Similar bitwise tricks extend to computing minima or maxima, such as , providing portable, branch-free alternatives to comparisons and selects.[58] Predication offers a hardware-level solution by marking instructions as conditional, executing them on all paths but nullifying results for unmet conditions. In ARM architectures, predication uses vector masks to control execution per lane in SIMD operations; for instance, in the Armv8-M extension with Helium technology, instructions can employ merging (retaining prior values in inactive lanes), zeroing (setting inactive lanes to zero), or don't-care predication (undefined behavior in inactive lanes), applied to 128-bit vectors with up to 16 8-bit lanes. This facilitates branch-free vector processing, such as handling loop tails without scalar fallbacks. In GPU SIMD architectures, predication similarly broadcasts instructions across threads but discards outputs based on per-element predicates, mitigating divergence costs in SIMT models. A foundational study on partial predication, which limits conditional instructions to reduce hardware complexity, demonstrated up to 40% reduction in branch instructions and improved instruction-level parallelism in ILP processors compared to full predication schemes.[59][60] These techniques yield significant advantages, including elimination of misprediction penalties—which can cost 10-20 cycles per branch—and enhanced predictability, crucial for embedded and real-time systems where timing jitter from branches disrupts scheduling. In such environments, branch-free code promotes bounded execution times and simpler verification, as all paths are exercised uniformly without control hazards. However, drawbacks include expanded code size from additional operations or duplicated paths, increased instruction count leading to higher fetch/decode overhead, and potential energy waste in predicated execution where both branches compute unnecessarily. Quantitative evaluations show code size growth of 10-30% in predicated regions, though performance gains often outweigh this in misprediction-prone loops.[56][61] In modern contexts, branch-free methods integrate into optimizing compilers and vectorized code. Just-in-time (JIT) compilers, such as those in dynamic languages, map high-level conditionals like C's ternary operator (?:) to conditional moves for predictable latency. Similarly, in SIMD extensions like Intel's AVX-512, mask registers (k0-k7) enable predicated vector instructions, such as VPCOMB for comparisons or VPERNLOG for masked blends, achieving multi-GB/s throughput in branch-heavy tasks like string processing without scalar branches. For example, case-insensitive matching can use a mask to uppercase only lowercase characters across 64-byte vectors, nullifying unaffected lanes. This predication supports scalable parallelism in vectorized loops, contrasting with earlier SSE/AVX where branches serialized execution.[62]