Recent from talks
Nothing was collected or created yet.
Branch predictor
View on WikipediaIn computer architecture, a branch predictor[1][2][3][4][5] is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures.

Two-way branching is usually implemented with a conditional jump instruction. A conditional jump can either be "taken" and jump to a different place in program memory, or it can be "not taken" and continue execution immediately after the conditional jump. It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline (see fig. 1).
Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong, then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.
The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. As a result, making a pipeline longer increases the need for a more advanced branch predictor.[6]
The first time a conditional jump instruction is encountered, there is not much information to base a prediction on. However, the branch predictor keeps records of whether or not branches are taken, so when it encounters a conditional jump that has been seen several times before, it can base the prediction on the recorded history. The branch predictor may, for example, recognize that the conditional jump is taken more often than not, or that it is taken every second time.
Branch prediction is not the same as branch target prediction. Branch prediction attempts to guess whether a conditional jump will be taken or not. Branch target prediction attempts to guess the target of a taken conditional or unconditional jump before it is computed by decoding and executing the instruction itself. Branch prediction and branch target prediction are often combined into the same circuitry.
Implementation
[edit]Static branch prediction
[edit]Static prediction is the simplest branch prediction technique because it does not rely on information about the dynamic history of code executing. Instead, it predicts the outcome of a branch based solely on the branch instruction.[7] Early implementations of SPARC and MIPS (two of the first commercial RISC architectures) used single-direction static branch prediction: they always predict that a conditional jump will not be taken, so they always fetch the next sequential instruction. Only when the branch or jump is evaluated and found to be taken, does the instruction pointer get set to a non-sequential address.
Both CPUs fetch instructions in one cycle and evaluate branches in the decode stage. As a result, the branch target recurrence is two cycles long, and the machine always fetches the instruction immediately after any taken branch. Both architectures define branch delay slots in order to utilize these fetched instructions.
A more-advanced form of static prediction assumes that backward branches will be taken and forward branches will not. (A backward branch is one that has a target address that is lower than its own address.) This rule increases prediction accuracy in loops, which can be constructed with a backward branch at the end that is mostly taken, or a forward branch at the beginning that is mostly not taken. With processors that use this prediction method, ordering the instructions can maximize branch-prediction accuracy. The RISC-V ISA recommends that software written for RISC-V harts (hardware threads), or generated to run on RISC-V harts, be optimized with the assumption that backward branches are taken and forward branches are not. (Even when the processor incorporates a more-advanced predictor that diminishes the relative value of static prediction.)[8]
Some processors accept branch-prediction hints that override the static prediction. The Intel Pentium 4 and Pentium 4E accept branch-prediction hints as prefixes. In the presence of dynamic prediction, static prediction offers almost no benefit, and overriding the static prediction helps even less. Intel's subsequent Pentium M and Core2 processors ignore branch-prediction hint prefixes.[9] No x86 manufacturer has re-introduced prediction hints.
Dynamic branch predictors naturally fall back on static branch prediction when they have no cached information (such as the first time a given branch is encountered). Both the Motorola MPC7450 (G4e) and the Intel Pentium 4 fall back on static prediction.[10]
In static prediction, all decisions are made at compile time, before the execution of the program.[11]
Dynamic branch prediction
[edit]Dynamic branch prediction[2] uses information about taken or not-taken branches gathered at run-time to predict the outcome of a branch.[1]
Random branch prediction
[edit]Random branch prediction means taking a random guess at whether a branch will be taken, each time it is executed. The cost for a pseudorandom number generator is low compared to other techniques. Random prediction guarantees a 50% correct prediction rate that cannot be increased or reduced by any means. (And it makes timing even more nondeterministic or unpredictable than other methods.) Static branch prediction (assuming forward branches are taken and backward branches are not) predicts each branch with an accuracy ranging from 0% to 100%, with the overall success rate somewhere in-between. Unoptimized code is very likely to run with better than 50% prediction success; even 90% is likely. If the compiler does some instruction reordering, it makes the correct branch prediction rate higher than for unoptimized code. Static branch prediction long ago surpassed random branch prediction. Dynamic branch prediction has surpassed static branch prediction and is commonplace, despite the added complexity.
Next-line prediction
[edit]Some superscalar processors (MIPS R8000, Alpha 21264, and Alpha 21464 (EV8)) fetch each line of instructions with a pointer to the next line. This next-line predictor handles branch target prediction as well as branch direction prediction.
When a next-line predictor points to aligned groups of 2, 4, or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity, a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.
Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch (or its delay slot) will be discarded. Once again, assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded.
The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.
One-level branch prediction
[edit]Saturating counter
[edit]A 1-bit saturating counter (essentially a flip-flop) records the last outcome of the branch. This is the most simple version of dynamic branch predictor possible, although it is not very accurate.
A 2-bit saturating counter[1] is a state machine with four states:

- Strongly not taken
- Weakly not taken
- Weakly taken
- Strongly taken
When a branch is evaluated, the corresponding state machine is updated. Branches evaluated as not taken change the state toward strongly not taken, and branches evaluated as taken change the state toward strongly taken. The advantage of the two-bit counter scheme over a one-bit scheme is that a conditional jump has to deviate twice from what it has done most in the past before the prediction changes. For example, a loop-closing conditional jump is mispredicted once rather than twice.
The original, non-MMX Intel Pentium processor uses a saturating counter, though with an imperfect implementation.[9]
On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.[12]: 3
The predictor table is indexed with the instruction address bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded.
Two-level predictor
[edit]The Two-Level Branch Predictor, also referred to as Correlation-Based Branch Predictor, uses a two-dimensional table of counters, also called "Pattern History Table". The table entries are two-bit counters.
Two-level adaptive predictor
[edit]
If an if statement is executed three times, the decision made on the third execution might depend upon whether the previous two were taken or not. In such scenarios, a two-level adaptive predictor works more efficiently than a saturation counter. Conditional jumps that are taken every second time or have some other regularly recurring pattern are not predicted well by the saturating counter. A two-level adaptive predictor remembers the history of the last n occurrences of the branch and uses one saturating counter for each of the possible 2n history patterns. This method is illustrated in figure 3.
Consider the example of n = 2. This means that the last two occurrences of the branch are stored in a two-bit shift register. This branch history register can have four different binary values, 00, 01, 10, and 11, where zero means "not taken" and one means "taken". A pattern history table contains four entries per branch, one for each of the 22 = 4 possible branch histories, and each entry in the table contains a two-bit saturating counter of the same type as in figure 2 for each branch. The branch history register is used for choosing which of the four saturating counters to use. If the history is 00, then the first counter is used; if the history is 11, then the last of the four counters is used.
Assume, for example, that a conditional jump is taken every third time. The branch sequence is 001001001... In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one. Entry number 01 will go to state "strongly not taken", indicating that after 01 comes a zero. The same is the case with entry number 10, while entry number 11 is never used because there are never two consecutive ones.
The general rule for a two-level adaptive predictor with an n-bit history is that it can predict any repetitive sequence with any period if all n-bit sub-sequences are different.[9]
The advantage of the two-level adaptive predictor is that it can quickly learn to predict an arbitrary repetitive pattern. This method was invented by T.-Y. Yeh and Yale Patt at the University of Michigan.[14] Since the initial publication in 1991, this method has become very popular. Variants of this prediction method are used in most modern microprocessors.[citation needed]
Two-level neural predictor
[edit]A two-level branch predictor where the second level is replaced with a neural network has been proposed.[15]
Local branch prediction
[edit]A local branch predictor has a separate history buffer for each conditional jump instruction. It may use a two-level adaptive predictor. The history buffer is separate for each conditional jump instruction, while the pattern history table may be separate as well or it may be shared between all conditional jumps.
The Intel Pentium MMX, Pentium II, and Pentium III have local branch predictors with a local 4-bit history and a local pattern history table with 16 entries for each conditional jump.
On the SPEC'89 benchmarks, very large local predictors saturate at 97.1% correct.[12]: 6
Global branch prediction
[edit]A global branch predictor does not keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. The advantage of a shared history is that any correlation between different conditional jumps is part of making the predictions. The disadvantage is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated, and that the history buffer may not include any bits from the same branch if there are many other branches in between. It may use a two-level adaptive predictor.
This scheme is better than the saturating counter scheme only for large table sizes, and it is rarely as good as local prediction. The history buffer must be longer in order to make a good prediction. The size of the pattern history table grows exponentially with the size of the history buffer. Hence, the big pattern history table must be shared among all conditional jumps.
A two-level adaptive predictor with globally shared history buffer and pattern history table is called a "gshare" predictor if it xors the global history and branch PC, and "gselect" if it concatenates them. Global branch prediction is used in AMD processors, and in Intel Pentium M, Core, Core 2, and Silvermont-based Atom processors.
Alloyed branch prediction
[edit]An alloyed branch predictor[16] combines the local and global prediction principles by concatenating local and global branch histories, possibly with some bits from the program counter as well. Tests indicate that the VIA Nano processor may be using this technique.[9]
Agree predictor
[edit]An agree predictor is a two-level adaptive predictor with globally shared history buffer and pattern history table, and an additional local saturating counter. The outputs of the local and the global predictors are XORed with each other to give the final prediction. The purpose is to reduce contentions in the pattern history table where two branches with opposite prediction happen to share the same entry in the pattern history table.[17]
Hybrid predictor
[edit]A hybrid predictor, also called combined predictor, implements more than one prediction mechanism. The final prediction is based either on a meta-predictor that remembers which of the predictors has made the best predictions in the past, or a majority vote function based on an odd number of different predictors.
Scott McFarling proposed combined branch prediction in his 1993 paper.[12]
On the SPEC'89 benchmarks, such a predictor is about as good as the local predictor.[citation needed]
Predictors like gshare use multiple table entries to track the behavior of any particular branch. This multiplication of entries makes it much more likely that two branches will map to the same table entry (a situation called aliasing), which in turn makes it much more likely that prediction accuracy will suffer for those branches. Once you have multiple predictors, it is beneficial to arrange that each predictor will have different aliasing patterns, so that it is more likely that at least one predictor will have no aliasing. Combined predictors with different indexing functions for the different predictors are called gskew predictors, and are analogous to skewed associative caches used for data and instruction caching.
Loop predictor
[edit]A conditional jump that controls a loop is best predicted with a special loop predictor. A conditional jump in the bottom of a loop that repeats N times will be taken N-1 times and then not taken once. If the conditional jump is placed at the top of the loop, it will be not taken N-1 times and then taken once. A conditional jump that goes many times one way and then the other way once is detected as having loop behavior. Such a conditional jump can be predicted easily with a simple counter. A loop predictor is part of a hybrid predictor where a meta-predictor detects whether the conditional jump has loop behavior.
Indirect branch predictor
[edit]An indirect jump instruction can choose among more than two branches. Some processors have specialized indirect branch predictors.[18][19] Newer processors from Intel[20] and AMD[21] can predict indirect branches by using a two-level adaptive predictor. This kind of instruction contributes more than one bit to the history buffer. The zEC12 and later z/Architecture processors from IBM support a BRANCH PREDICTION PRELOAD instruction that can preload the branch predictor entry for a given instruction with a branch target address constructed by adding the contents of a general-purpose register to an immediate displacement value.[22][23]
Processors without this mechanism will simply predict an indirect jump to go to the same target as it did last time.[9]
Prediction of function returns
[edit]A function will normally return to where it is called from. The return instruction is an indirect jump that reads its target address from the call stack. Many microprocessors have a separate prediction mechanism for return instructions. This mechanism is based on a so-called return stack buffer, which is a local mirror of the call stack. The size of the return stack buffer is typically 4–16 entries.[9]
Overriding branch prediction
[edit]The trade-off between fast branch prediction and good branch prediction is sometimes dealt with by having two branch predictors. The first branch predictor is fast and simple. The second branch predictor, which is slower, more complex, and with bigger tables, will override a possibly wrong prediction made by the first predictor.
The Alpha 21264 and Alpha EV8 microprocessors used a fast single-cycle next-line predictor to handle the branch target recurrence and provide a simple and fast branch prediction. Because the next-line predictor is so inaccurate, and the branch resolution recurrence takes so long, both cores have two-cycle secondary branch predictors that can override the prediction of the next-line predictor at the cost of a single lost fetch cycle.
The Intel Core i7 has two branch target buffers and possibly two or more branch predictors.[24]
Neural branch prediction
[edit]Machine learning for branch prediction using LVQ and multilayer perceptrons, called "neural branch prediction", was proposed by Lucian Vintan (Lucian Blaga University of Sibiu).[25] One year later he developed the perceptron branch predictor.[26] The neural branch predictor research was developed much further by Daniel Jimenez.[27] In 2001,[27] the first perceptron predictor was presented that was feasible to implement in hardware. The first commercial implementation of a perceptron branch predictor was in AMD's Piledriver microarchitecture.[28]
The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth. Classical predictors require exponential resource growth. Jimenez reports a global improvement of 5.7% over a McFarling-style hybrid predictor.[29] He also used a gshare/perceptron overriding hybrid predictors.[29]
The main disadvantage of the perceptron predictor is its high latency. Even after taking advantage of high-speed arithmetic tricks, the computation latency is relatively high compared to the clock period of many modern microarchitectures. In order to reduce the prediction latency, Jimenez proposed in 2003 the fast-path neural predictor, where the perceptron predictor chooses its weights according to the current branch's path, rather than according to the branch's PC. Many other researchers developed this concept (A. Seznec, M. Monchiero, D. Tarjan & K. Skadron, V. Desmet, Akkary et al., K. Aasaraai, Michael Black, etc.).[citation needed]
Most of the state-of-the-art branch predictors are using a perceptron predictor (see Intel's "Championship Branch Prediction Competition"[30]). Intel already implements this idea in one of the IA-64's simulators (2003).[31]
The AMD Ryzen[32][33][34] multi-core processor's Infinity Fabric and the Samsung Exynos processors include perceptron-based neural branch predictors.
History
[edit]The IBM 7030 Stretch, designed in the late 1950s, pre-executes all unconditional branches and any conditional branches that depended on the index registers. For other conditional branches, the first two production models implemented predict untaken; subsequent models were changed to implement predictions based on the current values of the indicator bits (corresponding to today's condition codes).[35] The Stretch designers had considered static hint bits in the branch instructions early in the project but decided against them. Misprediction recovery was provided by the lookahead unit on Stretch, and part of Stretch's reputation for less-than-stellar performance was blamed on the time required for misprediction recovery. Subsequent IBM large computer designs did not use branch prediction with speculative execution until the IBM 3090 in 1985.
Two-bit predictors were introduced by Tom McWilliams and Curt Widdoes in 1977 for the Lawrence Livermore National Lab S-1 supercomputer and independently by Jim Smith in 1979 at CDC.[36]
Microprogrammed processors, popular from the 1960s to the 1980s and beyond, took multiple cycles per instruction, and generally did not require branch prediction. However, in addition to the IBM 3090, there are several other examples of microprogrammed designs that incorporated branch prediction.
The Burroughs B4900, a microprogrammed COBOL machine released around 1982, was pipelined and used branch prediction. The B4900 branch prediction history state is stored back into the in-memory instructions during program execution. The B4900 implements 4-state branch prediction by using 4 semantically equivalent branch opcodes to represent each branch operator type. The opcode used indicated the history of that particular branch instruction. If the hardware determines that the branch prediction state of a particular branch needs to be updated, it rewrites the opcode with the semantically equivalent opcode that hinted the proper history. This scheme obtains a 93% hit rate. US patent 4,435,756 and others were granted on this scheme.
The DEC VAX 9000, announced in 1989, is both microprogrammed and pipelined, and performs branch prediction.[37]
The first commercial RISC processors, the MIPS R2000 and R3000 and the earlier SPARC processors, do only trivial "not-taken" branch prediction. Because they use branch delay slots, fetched just one instruction per cycle, and execute in-order, there is no performance loss. The later R4000 uses the same trivial "not-taken" branch prediction, and loses two cycles to each taken branch because the branch resolution recurrence is four cycles long.
Branch prediction became more important with the introduction of pipelined superscalar processors like the Intel Pentium, DEC Alpha 21064, the MIPS R8000, and the IBM POWER series. These processors all rely on one-bit or simple bimodal predictors.
The DEC Alpha 21264 (EV6) uses a next-line predictor overridden by a combined local predictor and global predictor, where the combining choice is made by a bimodal predictor.[38]
The AMD K8 has a combined bimodal and global predictor, where the combining choice is another bimodal predictor. This processor caches the base and choice bimodal predictor counters in bits of the L2 cache otherwise used for ECC. As a result, it has effectively very large base and choice predictor tables, and parity rather than ECC on instructions in the L2 cache. The parity design is sufficient, since any instruction suffering a parity error can be invalidated and refetched from memory.
The Alpha 21464[38] (EV8, cancelled late in design) had a minimum branch misprediction penalty of 14 cycles. It was to use a complex but fast next-line predictor overridden by a combined bimodal and majority-voting predictor. The majority vote was between the bimodal and two gskew predictors.
In 2018 a catastrophic security vulnerability called Spectre was made public by Google's Project Zero and other researchers. Affecting virtually all modern CPUs, the vulnerability involves priming the branch predictors so another process (or the kernel) will mispredict a branch and use secret data as an array index, evicting one of the attacker's cache lines. The attacker can time access to their own array to find out which one, turning this CPU internal (microarchitectural) state into a value the attacker can save which has information about values they could not read directly.[39]
See also
[edit]- Branch target predictor
- Branch prediction analysis attacks – on RSA cryptosystem public-key cryptography
- Instruction unit
- Cache prefetching
- Indirect branch control (IBC)
- Indirect branch prediction barrier (IBPB)
- Indirect branch restricted speculation (IBRS)
- Predication (computer architecture)
- Single thread indirect branch predictor (STIBP)
References
[edit]- ^ a b c Malishevsky, Alexey; Beck, Douglas; Schmid, Andreas; Landry, Eric. "Dynamic Branch Prediction". Archived from the original on 2019-07-17. Retrieved 2017-03-22.
- ^ a b Cheng, Chih-Cheng. "The Schemes and Performances of Dynamic Branch predictors" (PDF).
- ^ Parihar, Raj. "Branch Prediction Techniques and Optimizations" (PDF). Archived from the original (PDF) on 2017-05-16. Retrieved 2017-04-02.
- ^ Mutlu, Onur (2013-02-11). "18-447 Computer Architecture Lecture 11: Branch Prediction" (PDF). Archived from the original (PDF) on 2015-03-25.
- ^ Michaud, Pierre; Seznec, André; Uhlig, Richard (September 1996). Skewed branch predictors. HAL (report). S2CID 3712157.
- ^ Eyerman, S.; Smith, J.E.; Eeckhout, L. (2006). Characterizing the branch misprediction penalty. 2006 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE. pp. 48–58. doi:10.1109/ispass.2006.1620789. ISBN 1-4244-0186-0. S2CID 72217.
- ^ Shen, John P.; Lipasti, Mikko (2005). Modern processor design: fundamentals of superscalar processors. Boston: McGraw-Hill Higher Education. pp. 455. ISBN 0-07-057064-7.
- ^ "The RISC-V Instruction Set Manual Volume I Unprivileged Architecture". Google Docs.
- ^ a b c d e f Fog, Agner (2016-12-01). "The microarchitecture of Intel, AMD, and VIA CPUs" (PDF). pp. 26, 38. Retrieved 2017-03-22.
- ^ "The Pentium 4 and the G4e: an Architectural Comparison". Ars Technica. 12 May 2001.
- ^ Plusquellic, Jim. "CMSC 611: Advanced Computer Architecture, Chapter 4 (Part V)".
- ^ a b c McFarling, Scott (June 1993). "Combining Branch Predictors" (PDF). Digital Western Research Lab (WRL) Technical Report, TN-36.
- ^ "New Algorithm Improves Branch Prediction: 3/27/95" (PDF). Microprocessor Report. 9 (4). March 27, 1995. Archived (PDF) from the original on 2015-03-10. Retrieved 2016-02-02.
- ^ Yeh, T.-Y.; Patt, Y. N. (1991). "Two-Level Adaptive Training Branch Prediction". Proceedings of the 24th annual international symposium on Microarchitecture. Albuquerque, New Mexico, Puerto Rico: ACM. pp. 51–61. doi:10.1145/123465.123475.
- ^ Egan, Colin; Steven, Gordon; Quick, P.; Anguera, R.; Vintan, Lucian (December 2003). "Two-Level Branch Prediction using Neural Networks". Journal of Systems Architecture. 49 (12–15): 557–570. doi:10.1016/S1383-7621(03)00095-X.
- ^ Skadron, K.; Martonosi, M.; Clark, D. W. (October 2000). "A Taxonomy of Branch Mispredictions, and Alloyed Prediction as a Robust Solution to Wrong-History Mispredictions" (PDF). Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques. Philadelphia. pp. 199–206. doi:10.1109/PACT.2000.888344.
- ^ Sprangle, E.; Chappell, R.S.; Alsup, M.; Patt, Y.N. (June 1997). "The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference" (PDF). Proceedings of the 24th International Symposium on Computer Architecture. Denver. doi:10.1145/264107.264210.
- ^ "Cortex-A15 MPCore Technical Reference Manual, section 6.5.3 "Indirect predictor"". ARM Holdings.
- ^ Driesen, Karel; Hölzle, Urs (1997-06-25). "Limits of Indirect Branch Prediction" (PDF). Archived from the original (PDF) on 2016-05-06.
- ^ Stokes, Jon (2004-02-25). "A Look at Centrino's Core: The Pentium M". pp. 2–3.
- ^ Kanter, Aaron (2008-10-28). "Performance Analysis for Core 2 and K8: Part 1". p. 5.
- ^ z/Architecture Principles of Operation (PDF) (Fourteenth ed.). IBM. May 2022. pp. 7-42 – 7-45. SA22-7832-13.
- ^ "IBM zEnterprise BC12 Technical Guide" (PDF). IBM. February 2014. p. 78.
- ^ WO 2000/014628, Yeh, Tse-Yu & Sharangpani, H. P., "A method and apparatus for branch prediction using a second level branch prediction table", published 2000-03-16
- ^ Vintan, Lucian N. (1999). "Towards a High Performance Neural Branch Predictor" (PDF). Proceedings International Journal Conference on Neural Networks (IJCNN). doi:10.1109/IJCNN.1999.831066. Archived from the original (PDF) on 2019-07-13. Retrieved 2010-12-02.
- ^ Vintan, Lucian N. (2000). "Towards a Powerful Dynamic Branch Predictor" (PDF). Romanian Journal of Information Science and Technology. 3 (3). Bucharest: Romanian Academy: 287–301. ISSN 1453-8245.
- ^ a b Jimenez, D. A.; Lin, C. (2001). "Dynamic Branch Prediction with Perceptrons" (PDF). Proceedings of the 7th International Symposium on High Performance Computer Architecture (HPCA-7). Monterrey, NL, Mexico. pp. 197–296. doi:10.1109/HPCA.2001.903263.
- ^ Walton, Jarred (2012-05-15). "The AMD Trinity Review (A10-4600M): A New Hope". AnandTech. Archived from the original on May 17, 2012.
- ^ a b Jimenez, Daniel A. (December 2003). Fast Path-Based Neural Branch Prediction (PDF). The 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-36). San Diego, USA. pp. 243–252. doi:10.1109/MICRO.2003.1253199. Archived from the original (PDF) on 2016-03-31. Retrieved 2018-04-08.
- ^ "Championship Branch Prediction".
- ^ Brekelbaum, Edward; Rupley, Jeff; Wilkerson, Chris; Black, Bryan (December 2002). "Hierarchical scheduling windows". Proceedings of the 35th International Symposium on Microarchitecture. Istanbul, Turkey. doi:10.1109/MICRO.2002.1176236.
- ^ James, Dave (2017-12-06). "AMD Ryzen reviews, news, performance, pricing, and availability". PCGamesN.
- ^ "AMD Takes Computing to a New Horizon with Ryzen™ Processors" (Press release). AMD. Retrieved 2016-12-14.
- ^ "AMD's Zen CPU is now called Ryzen, and it might actually challenge Intel". Ars Technica UK. Retrieved 2016-12-14.
- ^ "IBM Stretch (7030) -- Aggressive Uniprocessor Parallelism".
- ^ "S-1 Supercomputer".
- ^ Murray, J.E.; Salett, R.M.; Hetherington, R.C.; McKeen, F.X. (1990). "Micro-architecture of the VAX 9000". Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage. pp. 44–53. doi:10.1109/CMPCON.1990.63652. ISBN 0-8186-2028-5. S2CID 24999559.
- ^ a b Seznec, A.; Felix, S.; Krishnan, V.; Sazeides, Y. "Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor". Proceedings 29th Annual International Symposium on Computer Architecture. doi:10.1109/ISCA.2002.1003587.
- ^ Gibbs, Samuel (2018-01-04). "Meltdown and Spectre: 'worst ever' CPU bugs affect virtually all computers". the Guardian. Retrieved 2018-05-18.
External links
[edit]- Seznec et al. (1996). "Multiple-Block Ahead Branch Predictors" – demonstrates prediction accuracy is not impaired by indexing with previous branch address.
- Seznec et al. (2002). "Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor" – describes the Alpha EV8 branch predictor. This paper does an excellent job discussing how they arrived at their design from various hardware constraints and simulation studies.
- Jimenez (2003). "Reconsidering Complex Branch Predictors" – describes the EV6 and K8 branch predictors, and pipelining considerations.
- Fog, Agner (2009). "The microarchitecture of Intel, AMD, and VIA CPUs". Retrieved 2009-10-01.
- Andrews, Jeff (2007-10-30). "Branch and Loop Reorganization to Prevent Mispredicts". Intel Software Network. Archived from the original on 2018-11-11. Retrieved 2018-11-10.
- Yee, Alexander (Jun 27, 2012). "What is Branch Prediction? (first answer, Answer 35214) Why is processing a sorted array faster than processing an unsorted array?". Stack Overflow: Java.
Branch predictor
View on GrokipediaFundamentals
Definition and Purpose
A conditional branch is a fundamental instruction in most processor instruction set architectures (ISA) that alters the sequential flow of program execution by changing the program counter to a target address if a specified condition—such as a register value equaling zero or a flag indicating carry—is met; otherwise, execution continues to the next sequential instruction.[8] These instructions enable critical control flow mechanisms in software, including decision-making structures like if-else statements and loops, which are compiled from high-level languages into assembly code; for instance, a simple conditional check might translate to a branch-if-equal (BEQ) in RISC architectures or a conditional jump (Jcc) in x86.[9] Without such branches, programs would be limited to linear execution, severely restricting algorithmic flexibility and efficiency.[1] In pipelined processors, conditional branches introduce control hazards because their outcome is unknown until the condition is evaluated, typically late in the pipeline stages like execute or memory; to mitigate stalls, the CPU speculatively fetches and executes instructions from the predicted path.[8] A branch misprediction occurs when the speculation is incorrect, necessitating the flush of all speculatively executed instructions from the pipeline and a restart from the correct path, which incurs a significant performance penalty—often 10-20 cycles in superscalar processors due to the depth of modern pipelines and the cost of recovering state.[9] This penalty scales with pipeline length and issue width, as more resources are wasted on incorrect speculation, directly reducing overall throughput.[10] Branch instructions typically comprise 15-20% of executed instructions in representative integer workloads like SPEC CPU benchmarks.[9] High prediction accuracy can significantly boost instructions per cycle (IPC), as even small improvements in reducing mispredictions prevent frequent pipeline disruptions and sustain higher effective utilization. Core components of a branch predictor include the Branch Target Buffer (BTB), a cache-like structure that stores recently encountered branch addresses along with their target addresses to enable rapid identification and prefetching of taken branches, and prediction tables that maintain direction predictions (taken or not taken) using counters or history-based entries.[11][8]Branch Hazards in Pipelined Processors
In pipelined processors, instructions are processed through multiple stages to improve throughput, typically including instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and write-back (WB).[12] Sequential execution assumes that the next instruction follows the current one in memory, allowing the fetch stage to load the subsequent address immediately after the current IF completes. However, conditional branch instructions, which alter the program counter (PC) based on a condition, disrupt this flow by potentially redirecting execution to a non-sequential address, leading to incorrect instructions being fetched and processed in subsequent stages until the branch outcome is resolved, often in the EX or MEM stage.[12] Branch hazards primarily manifest as control hazards, which occur when the pipeline cannot determine the next instruction to fetch due to unresolved branch decisions, such as whether a conditional branch will be taken or not taken.[13] This contrasts with data hazards, which arise from dependencies between instructions where a later instruction requires the result of an earlier one that has not yet completed.[13] Control hazards specifically affect taken branches, where execution jumps to a target address, or not-taken branches, which continue sequentially; in both cases, the delay in resolution causes the pipeline to stall or fetch wrong-path instructions.[12] Upon detecting a misprediction, the processor recovers by flushing the incorrectly fetched instructions from the pipeline stages and refetching from the correct PC, which incurs a significant performance penalty.[14] In deep pipelines, this penalty typically ranges from 10 to 20 cycles, depending on pipeline depth and the number of misspeculated instructions, as the entire frontend must refill after draining the wrong-path work.[14] To mitigate these stalls, processors employ speculation by predicting branch outcomes early in the IF or ID stage and fetching ahead along the predicted path, thereby overlapping execution and reducing idle cycles if the prediction is correct.[12]Basic Prediction Techniques
Static Branch Prediction
Static branch prediction employs fixed rules determined at compile time or through simple hardware heuristics to forecast the outcome of conditional branches, without adapting to runtime execution patterns. These techniques rely on assumptions about typical branch behavior, such as the direction of the branch displacement, to decide whether a branch will be taken or not taken.[15] Common methods include the always-not-taken strategy, where all branches are predicted as not taken, and the always-taken strategy, where all branches are assumed to be taken. A more refined approach is the backward-taken/forward-not-taken (BTFNT) heuristic, which predicts backward branches (negative displacement, often loop closures) as taken and forward branches (positive displacement) as not taken. This BTFNT method leverages the observation that loops frequently execute their closing branches, making it a simple hardware implementation based on the sign bit of the branch offset.[16][15] The primary advantages of static branch prediction are its simplicity and minimal hardware overhead, requiring no storage for history or training phases, which reduces power consumption and design complexity. For instance, the always-not-taken method was used in the Intel i486 processor, avoiding any additional logic beyond default pipeline flow. There is also no misprediction recovery overhead during an initial "warm-up" period, unlike dynamic predictors.[15][17] However, static methods suffer from limited accuracy, typically achieving 60-70% correct predictions overall, as they cannot account for program-specific or runtime-varying branch behaviors. The always-not-taken approach, for example, performs poorly on looping code where backward branches are repeatedly taken, leading to frequent mispredictions and pipeline flushes; in benchmarks, it yields around 60% accuracy on integer workloads but fails on irregular or loop-heavy applications. BTFNT improves this to about 66% accuracy by favoring loops, yet still incurs a 34% misprediction rate on diverse programs.[18][19] Implementations often involve compiler-directed decisions, such as profile-guided optimization where branches are annotated based on execution traces from training runs. Some instruction set architectures support hardware hints for static prediction; for example, MIPS includes a "branch likely" bit in conditional branch instructions, set by the compiler to indicate high probability of being taken, allowing the hardware to delay instruction fetch until resolution and reduce penalties on mispredictions.[19][15]Dynamic Branch Prediction
Dynamic branch prediction adapts to program behavior at runtime by leveraging hardware mechanisms that learn from historical branch outcomes, in contrast to static methods that rely on fixed compiler or architectural decisions. This approach observes whether branches were taken or not taken in previous executions and uses that information to inform future predictions, enabling the predictor to adjust dynamically as the program's control flow evolves. Introduced in seminal work by James E. Smith, dynamic prediction employs a table-based structure where entries are updated based on observed outcomes, allowing for more accurate speculation in pipelined processors where branch mispredictions can incur significant penalties.[20] At its core, the hardware consists of a pattern history table (PHT), which is an array of saturating counters indexed by the branch's program counter (PC) or address. Each entry in the PHT typically uses a 2-bit counter to represent prediction states: strongly taken, weakly taken, weakly not taken, and strongly not taken; the counter increments on taken branches and decrements on not taken ones, saturating at the extremes to provide hysteresis against transient changes. When a branch is encountered, its PC selects a PHT entry, and the counter's state determines the prediction—taken if the value exceeds a threshold (e.g., greater than 1) and not taken otherwise—after which the actual outcome updates the counter for future use. This simple yet effective mechanism, as detailed in early studies, captures local patterns in branch behavior without requiring software intervention. Later evolutions incorporate branch history for indexing to capture correlations.[20] Basic per-branch dynamic predictors achieve substantial accuracy improvements over static techniques, reaching 85-93% correct predictions on standard benchmarks like SPEC, compared to approximately 60% for static always-taken or backward-taken strategies, which fail to adapt to varying workloads. These gains stem from the ability to learn program-specific patterns, reducing misprediction rates and thereby mitigating pipeline stalls in high-performance processors. For instance, in evaluations of integer benchmarks, basic dynamic schemes halved mispredictions relative to static baselines, yielding up to 13% overall performance uplift by minimizing branch-related delays.[21] The evolution of dynamic branch prediction has progressed from these foundational counter-based designs to more sophisticated methods incorporating longer histories and correlations across branches, though core principles of history-driven table updates remain central; advanced variants are explored in subsequent sections. Early implementations focused on per-branch counters for simplicity and low overhead, while later refinements addressed aliasing and correlation to push accuracies even higher in modern superscalar architectures.[21]Core Dynamic Predictor Architectures
One-Level Branch Predictors
One-level branch predictors, also referred to as bimodal predictors, represent a foundational dynamic branch prediction technique that relies solely on the branch's program counter (PC) address to index a pattern history table (PHT) populated with 2-bit saturating counters. This architecture enables the predictor to adapt to branch behavior over time without incorporating historical outcome patterns from previous branches. The PHT serves as a small, fully associative cache-like structure where each entry corresponds to a potential branch location, allowing quick lookups during instruction fetch to anticipate whether a conditional branch will be taken or not taken.[22] In operation, each 2-bit counter in the PHT maintains one of four states to encode prediction strength and direction: 00 for strongly not taken, 01 for weakly not taken, 10 for weakly taken, and 11 for strongly taken. When a branch is fetched, the predictor indexes the PHT using the branch PC and interprets the counter's most significant bit to make the prediction—predicting taken if the bit is 1 (states 10 or 11) and not taken if 0 (states 00 or 01). Upon resolution of the branch outcome, the counter is updated accordingly: incremented toward 11 if the branch is taken or decremented toward 00 if not taken, with saturation preventing overflow beyond these bounds. This mechanism provides hysteresis, allowing the predictor to tolerate occasional opposite outcomes without immediately flipping the prediction, thereby improving stability for branches with biased but imperfect patterns.[22][23] The indexing process typically hashes or directly uses a subset of the branch PC bits to generate the table index, such as the lower-order bits PC[11:2] for a 256-entry PHT, where the index is simply the value of those bits. This address-only approach avoids the complexity of history registers but introduces potential aliasing, where unrelated branches mapping to the same index interfere with each other's history, degrading performance in programs with many branches. Empirical evaluations on benchmarks like SPEC demonstrate that one-level predictors achieve accuracies up to approximately 93% with large tables, though practical sizes often yield around 85% due to aliasing effects.[22][23]| Counter State | Binary | Prediction | Update on Taken | Update on Not Taken |
|---|---|---|---|---|
| Strongly Not Taken | 00 | Not Taken | Increment to 01 | Saturate at 00 |
| Weakly Not Taken | 01 | Not Taken | Increment to 10 | Decrement to 00 |
| Weakly Taken | 10 | Taken | Increment to 11 | Decrement to 01 |
| Strongly Taken | 11 | Taken | Saturate at 11 | Decrement to 10 |