Notes/Cache Design

💾

Cache Design

L12L13Post-Midterm

Memory hierarchy, cache organization, and optimization

Topics Covered

SRAM vs DRAMMemory hierarchy & localityDirect-mapped cacheSet-associative cacheCache metrics (AMAT, hit/miss rate)Replacement policies (LRU, random)3 Cs of cache missesImproving cache performanceNon-blocking caches & MSHRsWrite policies

SRAM vs DRAM

Two fundamental memory technologies used in modern systems:

Property	SRAM	DRAM
Cell structure	6 transistors (6T cell)	1 transistor + 1 capacitor (1T1C)
Speed	Fast (~1-2 ns)	Slower (~50-100 ns)
Density	Lower (6T per bit)	Higher (1T1C per bit)
Refresh needed?	No (static latch)	Yes (capacitor leaks charge)
Read type	Non-destructive	Destructive (must write-back)
Cost	Expensive	Cheap per bit
Use	Caches (L1, L2, L3)	Main memory

DRAM destructive reads: Reading a DRAM cell discharges the capacitor. After every read, the sense amplifier must write back the data to restore the charge. This adds latency.

Key Points

•SRAM: 6T cell, fast, no refresh, lower density, used for caches
•DRAM: 1T1C cell, destructive reads, needs refresh every ~64ms, higher density, used for main memory
•SRAM is ~10-100x faster than DRAM but ~10x more expensive per bit
•DRAM reads are destructive — must write-back after each read

⚠

Exam Tip

Know why DRAM needs refresh (capacitor leaks) and why reads are destructive (charge sharing). These are fundamental concepts that underpin memory hierarchy design.

Memory Hierarchy & Locality

The memory hierarchy exploits the trade-off between speed, size, and cost:

Registers: sub-nanosecond, bytes, in the CPU core
L1 Cache: ~1 ns, ~32 KB, per-core
L2 Cache: ~3-10 ns, 256 KB - 1 MB, per-core or shared
L3 Cache: ~10-30 ns, several MB, shared across cores
Main Memory (DRAM): ~50-100 ns, GBs
Disk/SSD: ~10 ms (HDD) / ~100 us (SSD), TBs

Why it works — Locality:

Temporal locality: Recently accessed data is likely to be accessed again soon (e.g., loop variables, hot data structures)
Spatial locality: Data near recently accessed data is likely to be accessed soon (e.g., array traversals, sequential instruction fetch)

Caches exploit temporal locality by keeping recently-used data close to the processor. They exploit spatial locality by fetching entire cache blocks (typically 64 bytes) at a time.

Key Points

•Hierarchy: Registers (sub-ns) → L1 (~1ns, 32KB) → L2 (~10ns, 512KB) → L3 → DRAM (~100ns, GBs) → Disk (~10ms)
•Temporal locality: recently used → likely used again soon
•Spatial locality: nearby data → likely accessed soon
•Cache blocks (64B typical) exploit spatial locality

Cache Basics: Organization & Addressing

A cache consists of:

Data store: holds the actual cache blocks (lines)
Tag store: holds metadata for each block (tag bits, valid bit, dirty bit, replacement info)

Address decomposition (for a direct-mapped cache):

| Tag (upper bits) | Index (middle bits) | Block Offset (lower bits) |

Block Offset: log₂(block size) bits — selects byte within the block
Index: log₂(number of sets) bits — selects which set to look in
Tag: remaining upper bits — identifies which block from memory is stored

Terminology:

Hit: requested data found in cache → fast access
Miss: data not in cache → must fetch from next level
Hit rate: fraction of accesses that hit in cache
Miss rate: 1 - hit rate

Key Points

•Cache has tag store + data store
•Address split: Tag | Index | Block Offset
•Offset bits = log2(block size), Index bits = log2(num sets)
•Tag uniquely identifies which memory block is cached in that set

⚠

Exam Tip

Given cache size, block size, and associativity, be able to calculate number of sets, and the bit widths of tag/index/offset fields. This is a very common calculation question.

Direct-Mapped Cache

In a direct-mapped cache, each memory block maps to exactly one cache set (1-way associative).

Number of sets = Cache size / Block size
Set index = (Block address) mod (Number of sets)
On access: check the tag at the indexed set. If tag matches and valid bit is set → hit
On miss: the existing block at that index is evicted (no choice — only one place)

Advantages: Simple, fast lookup (single comparator), low power.

Disadvantages: High conflict miss rate — two blocks that map to the same index will keep evicting each other (ping-pong / thrashing).

Example: If blocks A and B both map to set 5, accessing A, B, A, B causes 100% miss rate even if the cache has room elsewhere.

Key Points

•Each block maps to exactly one set — 1 comparator needed
•Simple and fast, but susceptible to conflict misses
•Thrashing: two blocks mapping to same set evict each other repeatedly
•Number of sets = Cache size / Block size

Set-Associative Cache

An N-way set-associative cache has N blocks (ways) per set. A memory block can go in any of the N ways within its mapped set.

Number of sets = Cache size / (Block size x Associativity)
On access: index selects the set, then N parallel comparators check all tags
On miss: one of the N blocks in the set must be evicted (replacement policy decides which)

Spectrum:

1-way = direct-mapped
N-way = set-associative (typical: 2, 4, 8, 16 ways)
If #sets = 1, it's fully associative (any block can go anywhere)

Trade-off: More associativity → fewer conflict misses, but more comparators, higher latency, more power.

Key Points

•N ways per set — N parallel tag comparisons needed
•Reduces conflict misses compared to direct-mapped
•More associativity = fewer conflicts but slower/more expensive
•#sets = CacheSize / (BlockSize x Associativity)

AMAT and Performance Metrics

Average Memory Access Time (AMAT) is the key metric for cache performance:

AMAT = Hit_Rate × Hit_Latency + Miss_Rate × Miss_Latency

Or equivalently:

AMAT = Hit_Latency + Miss_Rate × Miss_Penalty

where Miss_Penalty = Miss_Latency - Hit_Latency (time to service the miss).

For multi-level caches:

AMAT = L1_hit_time + L1_miss_rate × (L2_hit_time + L2_miss_rate × L2_miss_penalty)

Local vs. Global miss rate:

Local miss rate: misses at this level / accesses to this level
Global miss rate: misses at this level / total CPU memory accesses
Global L2 miss rate = L1 miss rate × L2 local miss rate

Key Points

•AMAT = Hit_Time + Miss_Rate × Miss_Penalty
•Multi-level: AMAT = L1_hit + L1_miss_rate × (L2_hit + L2_miss_rate × L2_penalty)
•Local miss rate = misses / accesses to that level
•Global miss rate = misses / total CPU accesses

⚠

Exam Tip

AMAT calculations are bread and butter for cache exam questions. Practice the multi-level version. Watch for whether the question gives local or global miss rates.

Three C's of Cache Misses

All cache misses can be classified into three categories:

Compulsory (Cold) Misses: First access to a block that has never been in the cache. Unavoidable unless you prefetch. These occur even in an infinite cache.
Capacity Misses: The working set is larger than the cache. Even a fully-associative cache of the same size would miss. Occurs because the cache simply cannot hold all needed data.
Conflict Misses: Multiple blocks map to the same set, causing evictions even though the cache has unused space in other sets. Only occur in direct-mapped and set-associative caches (NOT in fully-associative).

How to identify miss type:

Simulate with infinite cache → remaining misses are compulsory
Simulate with same-size fully-associative cache → additional misses beyond compulsory are capacity
Simulate with actual associativity → additional misses beyond capacity are conflict

Key Points

•Compulsory: first-ever access to block (infinite cache still misses)
•Capacity: working set > cache size (fully-assoc same-size cache still misses)
•Conflict: blocks mapping to same set evict each other (only in limited associativity)
•Reducing: compulsory (prefetching), capacity (bigger cache), conflict (more associativity)

⚠

Exam Tip

A common exam question: given an access trace, classify each miss as compulsory, capacity, or conflict. Remember: compare against infinite cache, then fully-associative same-size, then actual.

Replacement Policies

When a miss occurs in a set-associative cache and the set is full, a replacement policy decides which existing block to evict.

LRU (Least Recently Used): Evict the block that hasn't been accessed for the longest time. Optimal for many workloads but expensive to implement (need to track access order of all ways).
Approximate LRU — Not-MRU: Don't track full order; just track which block was Most Recently Used and never evict it. Randomly pick from the rest. Much cheaper hardware.
Hierarchical LRU: Divide ways into groups, do LRU within groups, approximate LRU across groups.
Random: Pick a random victim. Surprisingly competitive with LRU for high associativity. Very simple hardware.

Set thrashing: When the working set in a particular set exceeds the associativity, even LRU performs poorly — it keeps evicting blocks that will be needed soon. This is the Belady's anomaly-like problem for caches. Working set > associativity → every access can miss.

Key Points

•LRU: evict least recently used — best but most expensive (track full order)
•Not-MRU: protect most recently used, evict randomly from rest — cheap
•Random: surprisingly good at high associativity, trivial hardware
•Set thrashing: when working set in a set > associativity, performance degrades

Write Policies

Two orthogonal decisions for handling writes:

Write Hit Policy:

Write-back: Write only to cache. Mark block as dirty. Write to memory only on eviction. Reduces memory traffic but requires dirty bits and write-back on eviction.
Write-through: Write to both cache and memory simultaneously. Simpler, memory always up-to-date, but more memory traffic. Often uses a write buffer to avoid stalling.

Write Miss Policy:

Write-allocate: On write miss, fetch the block into cache, then write to it. Makes sense with write-back (exploit temporal locality on future writes).
No-write-allocate (write-around): On write miss, write directly to next level, don't bring block into cache. Makes sense with write-through.

Common pairings:

Write-back + Write-allocate (most common in modern caches)
Write-through + No-write-allocate

Key Points

•Write-back: write to cache only, mark dirty, write to memory on eviction
•Write-through: write to cache + memory, simpler but more traffic
•Write-allocate: fetch block into cache on write miss
•No-write-allocate: write directly to memory, skip cache
•Common pairing: write-back + write-allocate

⚠

Exam Tip

Know the common pairings and why they make sense. Write-back + write-allocate reduces memory bandwidth by batching writes.

Improving Cache Performance

Using the AMAT formula, we can improve performance by reducing miss rate, reducing miss latency, or reducing hit time.

Reducing Miss Rate:

Increase associativity → reduces conflict misses
Increase cache size → reduces capacity misses
Victim cache: Small fully-associative cache between L1 and L2 that stores recently evicted blocks. Catches conflict misses cheaply.
Software optimization: loop interchange, loop fusion, tiling/blocking

Reducing Miss Latency:

Multi-level caches: L1 fast but small, L2 larger with moderate latency → catches many L1 misses without going to DRAM
Critical word first: Request the needed word first from memory, send it to processor immediately while filling rest of block
MSHRs (Miss Status Holding Registers): Track outstanding cache misses. Enable non-blocking caches — the cache can continue servicing hits (and even other misses) while a miss is being resolved. Without MSHRs, a miss stalls all subsequent accesses.

Software optimizations:

Loop interchange: Change loop nesting order to access arrays in row-major order (better spatial locality)
Loop fusion: Merge loops over the same array to improve temporal locality

Key Points

•Victim cache: small fully-assoc cache for recently evicted blocks
•Multi-level caches reduce effective miss latency
•Critical word first: send needed word to CPU immediately
•MSHRs enable non-blocking caches (continue on hit while miss outstanding)
•Software: loop interchange (spatial), loop fusion (temporal)

⚠

Exam Tip

Know what MSHRs do and why non-blocking caches matter. Also understand victim caches — they catch conflict misses with minimal hardware.

Back to all notes