Cache Design
Memory hierarchy, cache organization, and optimization
Topics Covered
SRAM vs DRAM
Two fundamental memory technologies used in modern systems:
| Property | SRAM | DRAM |
|---|---|---|
| Cell structure | 6 transistors (6T cell) | 1 transistor + 1 capacitor (1T1C) |
| Speed | Fast (~1-2 ns) | Slower (~50-100 ns) |
| Density | Lower (6T per bit) | Higher (1T1C per bit) |
| Refresh needed? | No (static latch) | Yes (capacitor leaks charge) |
| Read type | Non-destructive | Destructive (must write-back) |
| Cost | Expensive | Cheap per bit |
| Use | Caches (L1, L2, L3) | Main memory |
DRAM destructive reads: Reading a DRAM cell discharges the capacitor. After every read, the sense amplifier must write back the data to restore the charge. This adds latency.
Key Points
- •SRAM: 6T cell, fast, no refresh, lower density, used for caches
- •DRAM: 1T1C cell, destructive reads, needs refresh every ~64ms, higher density, used for main memory
- •SRAM is ~10-100x faster than DRAM but ~10x more expensive per bit
- •DRAM reads are destructive — must write-back after each read
Exam Tip
Know why DRAM needs refresh (capacitor leaks) and why reads are destructive (charge sharing). These are fundamental concepts that underpin memory hierarchy design.
Memory Hierarchy & Locality
The memory hierarchy exploits the trade-off between speed, size, and cost:
- Registers: sub-nanosecond, bytes, in the CPU core
- L1 Cache: ~1 ns, ~32 KB, per-core
- L2 Cache: ~3-10 ns, 256 KB - 1 MB, per-core or shared
- L3 Cache: ~10-30 ns, several MB, shared across cores
- Main Memory (DRAM): ~50-100 ns, GBs
- Disk/SSD: ~10 ms (HDD) / ~100 us (SSD), TBs
Why it works — Locality:
- Temporal locality: Recently accessed data is likely to be accessed again soon (e.g., loop variables, hot data structures)
- Spatial locality: Data near recently accessed data is likely to be accessed soon (e.g., array traversals, sequential instruction fetch)
Caches exploit temporal locality by keeping recently-used data close to the processor. They exploit spatial locality by fetching entire cache blocks (typically 64 bytes) at a time.
Key Points
- •Hierarchy: Registers (sub-ns) → L1 (~1ns, 32KB) → L2 (~10ns, 512KB) → L3 → DRAM (~100ns, GBs) → Disk (~10ms)
- •Temporal locality: recently used → likely used again soon
- •Spatial locality: nearby data → likely accessed soon
- •Cache blocks (64B typical) exploit spatial locality
Cache Basics: Organization & Addressing
A cache consists of:
- Data store: holds the actual cache blocks (lines)
- Tag store: holds metadata for each block (tag bits, valid bit, dirty bit, replacement info)
Address decomposition (for a direct-mapped cache):
| Tag (upper bits) | Index (middle bits) | Block Offset (lower bits) |
- Block Offset: log2(block size) bits — selects byte within the block
- Index: log2(number of sets) bits — selects which set to look in
- Tag: remaining upper bits — identifies which block from memory is stored
Terminology:
- Hit: requested data found in cache → fast access
- Miss: data not in cache → must fetch from next level
- Hit rate: fraction of accesses that hit in cache
- Miss rate: 1 - hit rate
Key Points
- •Cache has tag store + data store
- •Address split: Tag | Index | Block Offset
- •Offset bits = log2(block size), Index bits = log2(num sets)
- •Tag uniquely identifies which memory block is cached in that set
Exam Tip
Given cache size, block size, and associativity, be able to calculate number of sets, and the bit widths of tag/index/offset fields. This is a very common calculation question.
Direct-Mapped Cache
In a direct-mapped cache, each memory block maps to exactly one cache set (1-way associative).
- Number of sets = Cache size / Block size
- Set index = (Block address) mod (Number of sets)
- On access: check the tag at the indexed set. If tag matches and valid bit is set → hit
- On miss: the existing block at that index is evicted (no choice — only one place)
Advantages: Simple, fast lookup (single comparator), low power.
Disadvantages: High conflict miss rate — two blocks that map to the same index will keep evicting each other (ping-pong / thrashing).
Example: If blocks A and B both map to set 5, accessing A, B, A, B causes 100% miss rate even if the cache has room elsewhere.
Key Points
- •Each block maps to exactly one set — 1 comparator needed
- •Simple and fast, but susceptible to conflict misses
- •Thrashing: two blocks mapping to same set evict each other repeatedly
- •Number of sets = Cache size / Block size
Set-Associative Cache
An N-way set-associative cache has N blocks (ways) per set. A memory block can go in any of the N ways within its mapped set.
- Number of sets = Cache size / (Block size x Associativity)
- On access: index selects the set, then N parallel comparators check all tags
- On miss: one of the N blocks in the set must be evicted (replacement policy decides which)
Spectrum:
- 1-way = direct-mapped
- N-way = set-associative (typical: 2, 4, 8, 16 ways)
- If #sets = 1, it's fully associative (any block can go anywhere)
Trade-off: More associativity → fewer conflict misses, but more comparators, higher latency, more power.
Key Points
- •N ways per set — N parallel tag comparisons needed
- •Reduces conflict misses compared to direct-mapped
- •More associativity = fewer conflicts but slower/more expensive
- •#sets = CacheSize / (BlockSize x Associativity)
AMAT and Performance Metrics
Average Memory Access Time (AMAT) is the key metric for cache performance:
AMAT = Hit_Rate × Hit_Latency + Miss_Rate × Miss_Latency
Or equivalently:
AMAT = Hit_Latency + Miss_Rate × Miss_Penalty
where Miss_Penalty = Miss_Latency - Hit_Latency (time to service the miss).
For multi-level caches:
AMAT = L1_hit_time + L1_miss_rate × (L2_hit_time + L2_miss_rate × L2_miss_penalty)
Local vs. Global miss rate:
- Local miss rate: misses at this level / accesses to this level
- Global miss rate: misses at this level / total CPU memory accesses
- Global L2 miss rate = L1 miss rate × L2 local miss rate
Key Points
- •AMAT = Hit_Time + Miss_Rate × Miss_Penalty
- •Multi-level: AMAT = L1_hit + L1_miss_rate × (L2_hit + L2_miss_rate × L2_penalty)
- •Local miss rate = misses / accesses to that level
- •Global miss rate = misses / total CPU accesses
Exam Tip
AMAT calculations are bread and butter for cache exam questions. Practice the multi-level version. Watch for whether the question gives local or global miss rates.
Three C's of Cache Misses
All cache misses can be classified into three categories:
- Compulsory (Cold) Misses: First access to a block that has never been in the cache. Unavoidable unless you prefetch. These occur even in an infinite cache.
- Capacity Misses: The working set is larger than the cache. Even a fully-associative cache of the same size would miss. Occurs because the cache simply cannot hold all needed data.
- Conflict Misses: Multiple blocks map to the same set, causing evictions even though the cache has unused space in other sets. Only occur in direct-mapped and set-associative caches (NOT in fully-associative).
How to identify miss type:
- Simulate with infinite cache → remaining misses are compulsory
- Simulate with same-size fully-associative cache → additional misses beyond compulsory are capacity
- Simulate with actual associativity → additional misses beyond capacity are conflict
Key Points
- •Compulsory: first-ever access to block (infinite cache still misses)
- •Capacity: working set > cache size (fully-assoc same-size cache still misses)
- •Conflict: blocks mapping to same set evict each other (only in limited associativity)
- •Reducing: compulsory (prefetching), capacity (bigger cache), conflict (more associativity)
Exam Tip
A common exam question: given an access trace, classify each miss as compulsory, capacity, or conflict. Remember: compare against infinite cache, then fully-associative same-size, then actual.
Replacement Policies
When a miss occurs in a set-associative cache and the set is full, a replacement policy decides which existing block to evict.
- LRU (Least Recently Used): Evict the block that hasn't been accessed for the longest time. Optimal for many workloads but expensive to implement (need to track access order of all ways).
- Approximate LRU — Not-MRU: Don't track full order; just track which block was Most Recently Used and never evict it. Randomly pick from the rest. Much cheaper hardware.
- Hierarchical LRU: Divide ways into groups, do LRU within groups, approximate LRU across groups.
- Random: Pick a random victim. Surprisingly competitive with LRU for high associativity. Very simple hardware.
Set thrashing: When the working set in a particular set exceeds the associativity, even LRU performs poorly — it keeps evicting blocks that will be needed soon. This is the Belady's anomaly-like problem for caches. Working set > associativity → every access can miss.
Key Points
- •LRU: evict least recently used — best but most expensive (track full order)
- •Not-MRU: protect most recently used, evict randomly from rest — cheap
- •Random: surprisingly good at high associativity, trivial hardware
- •Set thrashing: when working set in a set > associativity, performance degrades
Write Policies
Two orthogonal decisions for handling writes:
Write Hit Policy:
- Write-back: Write only to cache. Mark block as dirty. Write to memory only on eviction. Reduces memory traffic but requires dirty bits and write-back on eviction.
- Write-through: Write to both cache and memory simultaneously. Simpler, memory always up-to-date, but more memory traffic. Often uses a write buffer to avoid stalling.
Write Miss Policy:
- Write-allocate: On write miss, fetch the block into cache, then write to it. Makes sense with write-back (exploit temporal locality on future writes).
- No-write-allocate (write-around): On write miss, write directly to next level, don't bring block into cache. Makes sense with write-through.
Common pairings:
- Write-back + Write-allocate (most common in modern caches)
- Write-through + No-write-allocate
Key Points
- •Write-back: write to cache only, mark dirty, write to memory on eviction
- •Write-through: write to cache + memory, simpler but more traffic
- •Write-allocate: fetch block into cache on write miss
- •No-write-allocate: write directly to memory, skip cache
- •Common pairing: write-back + write-allocate
Exam Tip
Know the common pairings and why they make sense. Write-back + write-allocate reduces memory bandwidth by batching writes.
Improving Cache Performance
Using the AMAT formula, we can improve performance by reducing miss rate, reducing miss latency, or reducing hit time.
Reducing Miss Rate:
- Increase associativity → reduces conflict misses
- Increase cache size → reduces capacity misses
- Victim cache: Small fully-associative cache between L1 and L2 that stores recently evicted blocks. Catches conflict misses cheaply.
- Software optimization: loop interchange, loop fusion, tiling/blocking
Reducing Miss Latency:
- Multi-level caches: L1 fast but small, L2 larger with moderate latency → catches many L1 misses without going to DRAM
- Critical word first: Request the needed word first from memory, send it to processor immediately while filling rest of block
- MSHRs (Miss Status Holding Registers): Track outstanding cache misses. Enable non-blocking caches — the cache can continue servicing hits (and even other misses) while a miss is being resolved. Without MSHRs, a miss stalls all subsequent accesses.
Software optimizations:
- Loop interchange: Change loop nesting order to access arrays in row-major order (better spatial locality)
- Loop fusion: Merge loops over the same array to improve temporal locality
Key Points
- •Victim cache: small fully-assoc cache for recently evicted blocks
- •Multi-level caches reduce effective miss latency
- •Critical word first: send needed word to CPU immediately
- •MSHRs enable non-blocking caches (continue on hit while miss outstanding)
- •Software: loop interchange (spatial), loop fusion (temporal)
Exam Tip
Know what MSHRs do and why non-blocking caches matter. Also understand victim caches — they catch conflict misses with minimal hardware.