Branch Prediction
Predict branch outcomes to keep the pipeline full
Topics Covered
Why Branch Prediction Matters
Modern processors use deep pipelines (15-20+ stages). When a branch instruction is fetched, the outcome (taken/not-taken) and the target address are not known until several stages later. Without prediction, the processor must stall or flush incorrectly fetched instructions.
- Minimum branch penalty: ~7 cycles in modern processors
- Typical misprediction penalty: 11+ cycles (some architectures even more)
- Branches are ~20% of all instructions, so even small misprediction rates cause significant performance loss
The goal: predict direction (taken vs. not-taken) and target address as early as possible, ideally at the fetch stage.
Key Points
- •Branch penalty = number of pipeline stages wasted on misprediction
- •Min penalty ~7 cycles, typical 11+ cycles
- •Branches are ~20% of instructions — prediction accuracy is critical
- •Must predict both direction AND target address
Exam Tip
Know the difference between branch direction prediction and branch target prediction. Questions often ask you to calculate CPI impact of misprediction rate.
Branch Target Buffer (BTB)
The Branch Target Buffer (BTB) is a hardware cache that stores the target address of previously taken branches. It is indexed by the PC of the branch instruction.
- Lookup happens at fetch stage — before the instruction is even decoded
- If the PC is found in the BTB, the processor fetches from the predicted target
- If not found, the processor assumes not-taken (fetches PC+4)
- BTB stores:
Tag (PC) | Target Address | Valid bit
The BTB only tells you where to go if the branch is taken. A separate direction predictor determines whether the branch is taken or not-taken.
BTB miss scenarios:
- First time seeing this branch (compulsory miss)
- BTB capacity exceeded (capacity miss)
- Conflict in BTB indexing (conflict miss)
Key Points
- •BTB caches target addresses of previously taken branches
- •Indexed by PC, looked up at fetch stage
- •BTB miss → assume not-taken (fetch PC+4)
- •BTB provides target; direction predictor provides taken/not-taken
Exam Tip
BTB is about TARGET prediction, not direction prediction. They are separate mechanisms that work together.
Last-Time (1-Bit) Predictor
The simplest direction predictor: store 1 bit per branch indicating the last outcome.
- Predict the same direction as the last time this branch was executed
- State machine with 2 states: Taken (T) and Not-Taken (NT)
- On misprediction, flip the bit
Problem with loops: For a loop that executes N times, the 1-bit predictor mispredicts twice per loop invocation:
- On the last iteration (predicts Taken, but loop exits → Not-Taken)
- On the first iteration of the next invocation (predicts Not-Taken from last exit, but loop enters → Taken)
Loop accuracy: (N-2)/N for each full invocation (2 mispredictions per N iterations).
Key Points
- •1-bit predictor = "predict same as last time"
- •Two states: Taken, Not-Taken
- •Mispredicts twice per loop invocation (on exit and re-entry)
- •Loop accuracy = (N-2)/N per invocation
Exam Tip
Be careful: the question may ask per-iteration accuracy vs. per-invocation accuracy. For a loop running N iterations, you get 2 mispredictions → accuracy (N-2)/N.
2-Bit Saturating Counter Predictor
Uses a 2-bit saturating counter per branch. The counter has 4 states:
| State | Value | Prediction |
|---|---|---|
| Strongly Not-Taken | 00 | Not-Taken |
| Weakly Not-Taken | 01 | Not-Taken |
| Weakly Taken | 10 | Taken |
| Strongly Taken | 11 | Taken |
Transitions:
- On Taken outcome: increment (saturate at 11)
- On Not-Taken outcome: decrement (saturate at 00)
Key advantage over 1-bit: A single anomalous outcome does not flip the prediction. The predictor must see two consecutive mispredictions to change direction.
Loop accuracy: For a loop of N iterations, only 1 misprediction per invocation (on the exit). The re-entry is correctly predicted because the counter only drops to "weakly taken" (10), still predicting Taken. Accuracy = (N-1)/N.
Key Points
- •4 states: 00 (Strongly NT), 01 (Weakly NT), 10 (Weakly T), 11 (Strongly T)
- •Saturating: increment on Taken, decrement on Not-Taken, no wrap-around
- •Requires two consecutive mispredictions to change predicted direction
- •Loop accuracy = (N-1)/N — only mispredicts the loop exit
Exam Tip
This is extremely common on exams. Be able to trace through the state transitions for a given branch pattern (e.g., TTTNT TTTNT). Know the 4 states and their binary encodings.
Two-Level Global Prediction (GHR + PHT)
The idea: branch outcomes are often correlated with the outcomes of recent branches. A global predictor captures this inter-branch correlation.
Components:
- Global History Register (GHR): A single shift register (shared by all branches) that records the last k branch outcomes (1 = Taken, 0 = Not-Taken)
- Pattern History Table (PHT): An array of 2k entries, each a 2-bit saturating counter
Operation:
- Use the GHR value (k bits) to index into the PHT
- The selected 2-bit counter provides the prediction
- After the branch resolves, update the GHR (shift in the outcome) and update the PHT counter
Limitation: Different branches with the same global history map to the same PHT entry → aliasing/interference. Two branches that have nothing to do with each other can corrupt each other's predictions.
Key Points
- •GHR = single shift register recording last k branch outcomes
- •PHT = 2^k entries of 2-bit saturating counters
- •GHR indexes the PHT to get prediction
- •Captures inter-branch correlation (global patterns)
- •Suffers from aliasing: different branches, same GHR → same PHT entry
Exam Tip
Understand the difference between global correlation (captured by GHR) and per-branch patterns (captured by local predictors). Exam questions may ask which predictor is better for which pattern.
Gshare Predictor (GHR XOR PC)
Gshare reduces aliasing in global prediction by XOR-ing the GHR with the branch PC to index the PHT.
Index = GHR XOR PC[k:0]
- Same PHT structure as two-level global (2k entries of 2-bit counters)
- XOR spreads different branches across different PHT entries even when they share the same global history
- Simple hardware: just XOR gates, no extra storage
Why it works: Two branches at different PCs with the same GHR value now index different PHT entries, reducing destructive aliasing.
Limitation: Still a global predictor — does not capture per-branch (local) patterns well. Some branches have periodic patterns (e.g., TNTNTN) that are better captured by local history.
Key Points
- •Index = GHR XOR PC — reduces aliasing vs. pure GHR indexing
- •Same PHT of 2-bit counters
- •Simple to implement, better accuracy than pure global
- •Still a global predictor — misses per-branch local patterns
Exam Tip
Gshare is one of the most common predictors asked about. Know that XOR of GHR and PC is the key innovation over basic two-level global.
Two-Level Local Prediction
Instead of one global history, keep a separate history register per branch. This captures per-branch patterns (e.g., alternating T/NT).
Components:
- Branch History Table (BHT): Array of per-branch shift registers, indexed by PC. Each stores the last k outcomes of that specific branch.
- Pattern History Table (PHT): Shared array of 2-bit counters, indexed by the per-branch history
Operation:
- Use PC to look up the branch's local history register in the BHT
- Use the local history value to index the PHT
- Get prediction from the 2-bit counter
- After resolution, update BHT (shift in outcome) and PHT counter
Advantage: Captures branch-specific periodic patterns (loops, alternating, etc.).
Disadvantage: Cannot capture correlations between different branches.
Key Points
- •Per-branch history registers in BHT (indexed by PC)
- •Local history indexes shared PHT of 2-bit counters
- •Excellent for branch-specific patterns (TNTNTN, loops)
- •Cannot see inter-branch correlations like global predictors
Tournament/Hybrid Predictor (Alpha 21264)
A tournament predictor combines a local and a global predictor, using a choice predictor to select which one to trust for each branch.
Alpha 21264 Tournament Predictor:
- Local History Table (LHT): 1024 entries x 10-bit histories (indexed by PC)
- Local Prediction Table: 1024 entries x 3-bit counters (indexed by local history)
- Global Prediction Table: 4096 entries x 2-bit counters (indexed by 12-bit GHR)
- Choice Prediction Table: 4096 entries x 2-bit counters (indexed by 12-bit GHR)
How it works:
- Both the local and global predictors produce a prediction
- The choice predictor decides which to use
- Choice counter: ≥2 → use global, <2 → use local
- Choice counter is updated when the two predictors disagree: if global was correct, increment; if local was correct, decrement
Key insight: The choice predictor only trains when the two sub-predictors disagree. If both predict the same thing, the choice doesn't matter.
Result: Gets the best of both worlds — local patterns AND global correlations. The Alpha 21264 achieved >95% accuracy on most benchmarks.
Key Points
- •Combines local + global predictor with a choice predictor
- •Alpha 21264: LHT 1024x10, Local Pred 1024x3, Global Pred 4096x2, Choice Pred 4096x2
- •Choice updated only when local and global disagree
- •Achieves best of both: local patterns + global correlations
Exam Tip
Memorize the Alpha 21264 table sizes (1024x10, 1024x3, 4096x2, 4096x2). Know the choice predictor update rule: only updates when local and global disagree.