Notes/Branch Prediction

🎯

Branch Prediction

L10L11Post-Midterm

Predict branch outcomes to keep the pipeline full

Topics Covered

Branch Target Buffer (BTB)Last-time (1-bit) predictor2-bit saturating counterTwo-level global prediction (GHR + PHT)Gshare predictor (GHR XOR PC)Two-level local predictionTournament/hybrid predictorsAlpha 21264 predictor

Why Branch Prediction Matters

Modern processors use deep pipelines (15-20+ stages). When a branch instruction is fetched, the outcome (taken/not-taken) and the target address are not known until several stages later. Without prediction, the processor must stall or flush incorrectly fetched instructions.

Minimum branch penalty: ~7 cycles in modern processors
Typical misprediction penalty: 11+ cycles (some architectures even more)
Branches are ~20% of all instructions, so even small misprediction rates cause significant performance loss

The goal: predict direction (taken vs. not-taken) and target address as early as possible, ideally at the fetch stage.

Key Points

•Branch penalty = number of pipeline stages wasted on misprediction
•Min penalty ~7 cycles, typical 11+ cycles
•Branches are ~20% of instructions — prediction accuracy is critical
•Must predict both direction AND target address

⚠

Exam Tip

Know the difference between branch direction prediction and branch target prediction. Questions often ask you to calculate CPI impact of misprediction rate.

Branch Target Buffer (BTB)

The Branch Target Buffer (BTB) is a hardware cache that stores the target address of previously taken branches. It is indexed by the PC of the branch instruction.

Lookup happens at fetch stage — before the instruction is even decoded
If the PC is found in the BTB, the processor fetches from the predicted target
If not found, the processor assumes not-taken (fetches PC+4)
BTB stores: Tag (PC) | Target Address | Valid bit

The BTB only tells you where to go if the branch is taken. A separate direction predictor determines whether the branch is taken or not-taken.

BTB miss scenarios:

First time seeing this branch (compulsory miss)
BTB capacity exceeded (capacity miss)
Conflict in BTB indexing (conflict miss)

Key Points

•BTB caches target addresses of previously taken branches
•Indexed by PC, looked up at fetch stage
•BTB miss → assume not-taken (fetch PC+4)
•BTB provides target; direction predictor provides taken/not-taken

⚠

Exam Tip

BTB is about TARGET prediction, not direction prediction. They are separate mechanisms that work together.

Last-Time (1-Bit) Predictor

The simplest direction predictor: store 1 bit per branch indicating the last outcome.

Predict the same direction as the last time this branch was executed
State machine with 2 states: Taken (T) and Not-Taken (NT)
On misprediction, flip the bit

Problem with loops: For a loop that executes N times, the 1-bit predictor mispredicts twice per loop invocation:

On the last iteration (predicts Taken, but loop exits → Not-Taken)
On the first iteration of the next invocation (predicts Not-Taken from last exit, but loop enters → Taken)

Loop accuracy: (N-2)/N for each full invocation (2 mispredictions per N iterations).

Key Points

•1-bit predictor = "predict same as last time"
•Two states: Taken, Not-Taken
•Mispredicts twice per loop invocation (on exit and re-entry)
•Loop accuracy = (N-2)/N per invocation

⚠

Exam Tip

Be careful: the question may ask per-iteration accuracy vs. per-invocation accuracy. For a loop running N iterations, you get 2 mispredictions → accuracy (N-2)/N.

2-Bit Saturating Counter Predictor

Uses a 2-bit saturating counter per branch. The counter has 4 states:

State	Value	Prediction
Strongly Not-Taken	`00`	Not-Taken
Weakly Not-Taken	`01`	Not-Taken
Weakly Taken	`10`	Taken
Strongly Taken	`11`	Taken

Transitions:

On Taken outcome: increment (saturate at 11)
On Not-Taken outcome: decrement (saturate at 00)

Key advantage over 1-bit: A single anomalous outcome does not flip the prediction. The predictor must see two consecutive mispredictions to change direction.

Loop accuracy: For a loop of N iterations, only 1 misprediction per invocation (on the exit). The re-entry is correctly predicted because the counter only drops to "weakly taken" (10), still predicting Taken. Accuracy = (N-1)/N.

Key Points

•4 states: 00 (Strongly NT), 01 (Weakly NT), 10 (Weakly T), 11 (Strongly T)
•Saturating: increment on Taken, decrement on Not-Taken, no wrap-around
•Requires two consecutive mispredictions to change predicted direction
•Loop accuracy = (N-1)/N — only mispredicts the loop exit

⚠

Exam Tip

This is extremely common on exams. Be able to trace through the state transitions for a given branch pattern (e.g., TTTNT TTTNT). Know the 4 states and their binary encodings.

Two-Level Global Prediction (GHR + PHT)

The idea: branch outcomes are often correlated with the outcomes of recent branches. A global predictor captures this inter-branch correlation.

Components:

Global History Register (GHR): A single shift register (shared by all branches) that records the last k branch outcomes (1 = Taken, 0 = Not-Taken)
Pattern History Table (PHT): An array of 2^k entries, each a 2-bit saturating counter

Operation:

Use the GHR value (k bits) to index into the PHT
The selected 2-bit counter provides the prediction
After the branch resolves, update the GHR (shift in the outcome) and update the PHT counter

Limitation: Different branches with the same global history map to the same PHT entry → aliasing/interference. Two branches that have nothing to do with each other can corrupt each other's predictions.

Key Points

•GHR = single shift register recording last k branch outcomes
•PHT = 2^k entries of 2-bit saturating counters
•GHR indexes the PHT to get prediction
•Captures inter-branch correlation (global patterns)
•Suffers from aliasing: different branches, same GHR → same PHT entry

⚠

Exam Tip

Understand the difference between global correlation (captured by GHR) and per-branch patterns (captured by local predictors). Exam questions may ask which predictor is better for which pattern.

Gshare Predictor (GHR XOR PC)

Gshare reduces aliasing in global prediction by XOR-ing the GHR with the branch PC to index the PHT.

Index = GHR XOR PC[k:0]

Same PHT structure as two-level global (2^k entries of 2-bit counters)
XOR spreads different branches across different PHT entries even when they share the same global history
Simple hardware: just XOR gates, no extra storage

Why it works: Two branches at different PCs with the same GHR value now index different PHT entries, reducing destructive aliasing.

Limitation: Still a global predictor — does not capture per-branch (local) patterns well. Some branches have periodic patterns (e.g., TNTNTN) that are better captured by local history.

Key Points

•Index = GHR XOR PC — reduces aliasing vs. pure GHR indexing
•Same PHT of 2-bit counters
•Simple to implement, better accuracy than pure global
•Still a global predictor — misses per-branch local patterns

⚠

Exam Tip

Gshare is one of the most common predictors asked about. Know that XOR of GHR and PC is the key innovation over basic two-level global.

Two-Level Local Prediction

Instead of one global history, keep a separate history register per branch. This captures per-branch patterns (e.g., alternating T/NT).

Components:

Branch History Table (BHT): Array of per-branch shift registers, indexed by PC. Each stores the last k outcomes of that specific branch.
Pattern History Table (PHT): Shared array of 2-bit counters, indexed by the per-branch history

Operation:

Use PC to look up the branch's local history register in the BHT
Use the local history value to index the PHT
Get prediction from the 2-bit counter
After resolution, update BHT (shift in outcome) and PHT counter

Advantage: Captures branch-specific periodic patterns (loops, alternating, etc.).

Disadvantage: Cannot capture correlations between different branches.

Key Points

•Per-branch history registers in BHT (indexed by PC)
•Local history indexes shared PHT of 2-bit counters
•Excellent for branch-specific patterns (TNTNTN, loops)
•Cannot see inter-branch correlations like global predictors

Tournament/Hybrid Predictor (Alpha 21264)

A tournament predictor combines a local and a global predictor, using a choice predictor to select which one to trust for each branch.

Alpha 21264 Tournament Predictor:

Local History Table (LHT): 1024 entries x 10-bit histories (indexed by PC)
Local Prediction Table: 1024 entries x 3-bit counters (indexed by local history)
Global Prediction Table: 4096 entries x 2-bit counters (indexed by 12-bit GHR)
Choice Prediction Table: 4096 entries x 2-bit counters (indexed by 12-bit GHR)

How it works:

Both the local and global predictors produce a prediction
The choice predictor decides which to use
Choice counter: ≥2 → use global, <2 → use local
Choice counter is updated when the two predictors disagree: if global was correct, increment; if local was correct, decrement

Key insight: The choice predictor only trains when the two sub-predictors disagree. If both predict the same thing, the choice doesn't matter.

Result: Gets the best of both worlds — local patterns AND global correlations. The Alpha 21264 achieved >95% accuracy on most benchmarks.

Key Points

•Combines local + global predictor with a choice predictor
•Alpha 21264: LHT 1024x10, Local Pred 1024x3, Global Pred 4096x2, Choice Pred 4096x2
•Choice updated only when local and global disagree
•Achieves best of both: local patterns + global correlations

⚠

Exam Tip

Memorize the Alpha 21264 table sizes (1024x10, 1024x3, 4096x2, 4096x2). Know the choice predictor update rule: only updates when local and global disagree.

Back to all notes