Memory Systems
DRAM organization, timing, and scheduling
Topics Covered
DRAM Subsystem Organization
The DRAM subsystem has a deep hierarchy:
- Channel: Independent memory controller connection with its own data bus (e.g., 64-bit wide). Multiple channels multiply bandwidth.
- DIMM (Dual Inline Memory Module): Physical stick of memory containing multiple chips. Multiple DIMMs can share a channel.
- Rank: A group of DRAM chips that operate in lockstep on the same DIMM. A 64-bit channel uses 8 chips per rank (each chip contributes 8 bits).
- Chip: Individual DRAM chip containing multiple banks.
- Bank: Independent array within a chip with its own row buffer (sense amplifiers). Banks can be accessed concurrently.
- Row/Column: Within a bank, data is organized as a 2D array. A row is activated into the row buffer; columns are read from the row buffer.
Key hierarchy: Channel > DIMM > Rank > Chip > Bank > Row/Column
Transferring a 64-byte cache block: Across a rank, each of 8 chips contributes 8 bytes. This requires 8 column reads (each 8B burst) to fill a 64B cache line.
Key Points
- •Hierarchy: Channel > DIMM > Rank > Chip > Bank > Row/Column
- •Channel = independent data bus (64-bit wide)
- •Rank = group of chips operating in lockstep (8 chips x 8 bits = 64 bits)
- •Bank = independent array with own row buffer, enables concurrency
- •64B cache line = 8 chips each contributing 8 bytes across the rank
Exam Tip
The hierarchy is crucial. Know how a cache block is spread across chips in a rank and why multiple banks/channels improve bandwidth.
Page Mode DRAM & Row Buffer
DRAM banks use a row buffer (implemented by sense amplifiers) to hold an entire activated row.
Three DRAM commands:
- ACTIVATE (ACT / RAS): Opens a row — copies the entire row from the DRAM array into the row buffer. Destructive to the array (must write back eventually). Takes
t_RAStime. - READ/WRITE (CAS): Accesses a specific column within the open row buffer. Much faster than a full row activation. Takes
t_CAStime. - PRECHARGE (PRE): Closes the current row — writes the row buffer contents back to the array and prepares the bank for a new row activation. Takes
t_RPtime.
Row buffer states:
- Row buffer hit: Requested row is already open in the row buffer → just CAS → fastest (t_CAS)
- Row buffer miss (closed bank): No row is open → ACT + CAS (t_RAS + t_CAS)
- Row buffer conflict: A different row is open → PRE + ACT + CAS → slowest (t_RP + t_RAS + t_CAS)
Key Points
- •Row buffer = sense amplifiers holding one activated row
- •ACT opens a row (t_RAS), CAS reads/writes column (t_CAS), PRE closes row (t_RP)
- •Row hit: CAS only (fast)
- •Row miss (closed): ACT + CAS (medium)
- •Row conflict (wrong row open): PRE + ACT + CAS (slow)
Exam Tip
Distinguish the three row buffer states clearly. Many exam questions give access patterns and ask for the total latency based on hit/miss/conflict.
DRAM Refresh
DRAM capacitors leak charge and must be refreshed periodically (typically every 64 ms for the entire memory).
Refresh strategies:
- Burst refresh: Refresh all rows back-to-back. Memory is unavailable for a long burst. Simpler but causes long stalls.
- Distributed refresh: Spread refresh commands throughout the 64ms window. Each refresh command handles a few rows. Shorter individual stalls but more frequent interruptions.
Impact: During refresh, the bank being refreshed cannot service memory requests → performance and energy overhead. As DRAM capacity grows, more rows need refreshing → refresh overhead increases.
Refresh penalty grows with capacity: More rows → more refresh commands → more time spent refreshing → less time available for actual accesses.
Key Points
- •DRAM must be refreshed every ~64ms (capacitor charge leaks)
- •Burst refresh: all at once (long stall), distributed: spread out (shorter stalls)
- •Refresh blocks the bank — no accesses during refresh
- •Refresh overhead grows with DRAM capacity (more rows to refresh)
Multiple Banks & Channels
Multiple banks within a chip/rank enable bank-level parallelism:
- Each bank has its own row buffer and operates independently
- While one bank is activating a row, another can be servicing a CAS
- Interleaving accesses across banks hides latency
Multiple channels provide additional bandwidth:
- Each channel has its own data bus, command bus, and memory controller port
- Channels operate completely independently
- Two channels → 2x peak bandwidth
Address mapping/interleaving: How physical addresses are mapped to channels, ranks, banks, rows, and columns significantly affects performance.
- Using lower address bits for bank/channel selection provides better interleaving because lower bits have more entropy (vary more across consecutive accesses)
- Row interleaving: consecutive cache blocks go to different banks → maximizes bank-level parallelism for sequential access patterns
Key Points
- •Multiple banks enable bank-level parallelism (concurrent operations)
- •Multiple channels multiply bandwidth (independent data buses)
- •Address interleaving: use lower bits for bank/channel selection (more entropy)
- •Good interleaving maximizes parallelism across banks/channels
Exam Tip
Understand address interleaving — which bits map to which part of the hierarchy. Lower bits change more frequently, so mapping them to bank/channel spreads accesses evenly.
DRAM Scheduling Policies
The memory controller decides the order in which queued memory requests are serviced. This order significantly impacts performance.
FCFS (First-Come, First-Served):
- Service requests in arrival order
- Simple, fair, but ignores row buffer state
- May cause unnecessary row conflicts when a row hit request is waiting behind a row conflict request
FR-FCFS (First-Ready, First-Come, First-Served):
- Priority 1: Row buffer hits first (they're fastest — CAS only)
- Priority 2: Among equal-priority requests, oldest first (FCFS)
- Maximizes row buffer hit rate → maximizes DRAM throughput
- But can be unfair — streaming accesses (many hits to same row) can starve other requests
Key insight: FR-FCFS prioritizes throughput over fairness. This can cause quality-of-service issues in multi-core/multi-application systems.
Key Points
- •FCFS: simple, fair, but misses row buffer optimization opportunities
- •FR-FCFS: prioritize row hits → maximize throughput, then FCFS for ties
- •FR-FCFS maximizes row buffer hit rate but can be unfair
- •Memory scheduling directly impacts both throughput and fairness
Exam Tip
Know how to trace through a sequence of memory requests under both FCFS and FR-FCFS and calculate total latency. FR-FCFS reorders to exploit row buffer hits.
Memory Latency Components
The total memory access latency has several components:
- CPU to memory controller: On-chip interconnect latency
- Controller latency / queuing: Time spent waiting in the memory controller queue + scheduling decision time
- Controller to DRAM: Bus transfer of command (address/command bus)
- Bank access latency:
- Row hit:
t_CAS(Column Access Strobe) - Row miss (closed):
t_RAS + t_CAS - Row conflict:
t_RP + t_RAS + t_CAS(Precharge + Activate + CAS)
- Row hit:
- DRAM to controller (data bus): Data transfer across the memory bus. For a 64B cache line at 64-bit bus width: 8 transfers (each 8 bytes)
- Controller to CPU: On-chip interconnect back to the processor
Total latency can easily be 100+ ns for a row conflict, vs. ~50 ns for a row buffer hit.
Key Points
- •Latency path: CPU→controller→DRAM command→bank access→data bus→controller→CPU
- •Bank latency dominates: t_CAS (hit), t_RAS+t_CAS (miss), t_RP+t_RAS+t_CAS (conflict)
- •Data transfer: 64B line = 8 transfers on 64-bit bus
- •Total memory latency is typically 50-100+ ns