Notes/SIMD, Multicore & Accelerators

🔀

SIMD, Multicore & Accelerators

L9L10Pre-Midterm

Flynn's taxonomy, vectorization, systolic arrays, TPU

Topics Covered

Flynn's taxonomy (SISD/SIMD/MISD/MIMD)SIMD/vector processingVector chainingMemory bankingAmdahl's lawMulti-core processorsSystolic arraysTPU architecture

Flynn's Taxonomy

Flynn's taxonomy classifies architectures by the number of instruction and data streams:

	Single Data	Multiple Data
Single Instruction	SISD: Traditional uniprocessor (Von Neumann)	SIMD: Array/vector processors. Same op on multiple data elements
Multiple Instruction	MISD: Rare (systolic arrays sometimes classified here)	MIMD: Multi-core processors. Each core runs its own program on its own data

SISD: One instruction stream, one data stream. Classic single-core processor.
SIMD: One instruction controls many processing elements, each operating on different data. GPUs, vector extensions (AVX, SSE).
MISD: Multiple instructions on same data. Very rare; sometimes fault-tolerant systems.
MIMD: Multiple independent processors, each with own instruction and data. Multi-core CPUs, clusters.

Key Points

•SISD: single core, single data stream (classic Von Neumann)
•SIMD: one instruction, multiple data (vector/array processors, GPUs)
•MISD: multiple instructions, single data (rare)
•MIMD: multiple cores, each independent (multi-core, clusters)

SIMD / Vector Processing

Vector processing applies the same operation to an entire vector (array) of data elements with a single instruction.

Key components:

Vector registers: Each holds N data elements (e.g., 64 doubles)
Vector functional units: Pipelined units that process one element per cycle
Vector instructions: VLD (vector load), VST (vector store), VADD (vector add), VMUL, VSHFR (vector shift right), etc.

Advantages over SISD:

One instruction replaces an entire loop → fewer instruction fetches and decodes
Known regular access pattern → better memory/cache behavior
No control flow within vector op → no branch mispredictions
Parallelism at the data level, amortizing instruction overhead

Vector chaining: The output of one vector functional unit is forwarded directly as input to the next vector functional unit, without waiting for the entire vector to complete. Analogous to data forwarding in a scalar pipeline.

Example: VMUL V3, V1, V2 followed by VADD V5, V3, V4 — as each element of V3 is produced by VMUL, it's immediately consumed by VADD.

Key Points

•Vector register holds N elements; one instruction operates on all
•VLD, VST, VADD, VMUL, VSHFR — vector ISA instructions
•Advantages: fewer fetches/decodes, regular memory access, no branches
•Vector chaining: forward elements between vector FUs as they're produced

Memory Banking for Vector Access

Vector loads/stores need to supply one element per cycle to the vector unit. A single memory bank cannot sustain this bandwidth.

Solution: Multiple independent memory banks

Distribute consecutive addresses across different banks
If there are B banks, consecutive elements go to banks 0, 1, 2, ..., B-1, 0, 1, ...
Access to different banks can proceed in parallel
Need enough banks to hide the access latency of each bank

Bank conflict: If two accesses go to the same bank simultaneously, one must wait. Stride-N accesses with N being a multiple of B cause systematic bank conflicts.

Rule of thumb: Need at least as many banks as the memory access latency (in cycles) to sustain one element per cycle.

Key Points

•Multiple banks enable parallel memory accesses for vector loads
•Consecutive elements interleaved across banks
•Bank conflict: two accesses to same bank must serialize
•Need #banks >= memory latency to sustain 1 element/cycle

Amdahl's Law

Amdahl's Law gives the theoretical maximum speedup of a program when only a fraction of it can be parallelized:

Speedup = 1 / ((1 - f) + f/S)

where:

f = fraction of execution time that can be sped up (parallelizable portion)
S = speedup factor for the parallelizable portion (e.g., number of cores)
(1-f) = serial (non-parallelizable) portion

Key implications:

Even with infinite speedup of the parallel portion (S → infinity), the maximum speedup is 1/(1-f)
If 10% of the program is serial (f=0.9), maximum speedup is 10x, no matter how many cores
Focus on the common case: To maximize overall speedup, the fraction f should be as large as possible AND S should be as large as possible
Diminishing returns: doubling cores from 100 to 200 helps far less than doubling from 1 to 2

Key Points

•Speedup = 1/((1-f) + f/S) where f = parallelizable fraction, S = speedup of that fraction
•Max speedup with infinite S: 1/(1-f) — limited by serial portion
•Make the common case fast: maximize both f and S
•Diminishing returns with more parallelism if serial portion exists

⚠

Exam Tip

Be able to calculate speedup given f and S. Also know how to solve for f given a desired speedup. Very commonly tested formula.

Multi-Core Processors

Multi-core = multiple independent processor cores on the same die.

Why multi-core? Power wall — increasing single-core frequency hit diminishing returns due to cubic power scaling (P proportional to V²f, and V must increase with f). Instead, use multiple simpler cores at lower frequency.

Advantages:

More power-efficient than a single fast core (2 cores at half freq ≈ same perf at 1/4 power)
Simpler individual cores → easier to design and verify
Higher aggregate throughput for parallel workloads

Disadvantages:

Requires parallel software — single-threaded programs don't benefit
Shared resources (caches, memory, interconnect) cause interference between cores
Single-thread performance may decrease due to shared resource contention
Programming complexity: synchronization, deadlocks, data races

Key Points

•Multiple cores on one die — motivated by power wall
•More power-efficient: 2 cores at half freq ≈ 1/4 power of one fast core
•Requires parallel software — Amdahl's law limits speedup
•Shared resources cause interference between applications

Systolic Arrays & TPU

A systolic array is a grid of processing elements (PEs) where data flows rhythmically through the array (like blood through the heart — hence "systolic").

Key properties:

Each PE performs a simple operation (typically multiply-accumulate)
Data flows between neighboring PEs in a regular pattern
High throughput with minimal memory bandwidth — data is reused as it flows
Very regular structure — easy to layout in silicon

Convolution example (W1 design):

Weights stay fixed in the PEs (weight-stationary)
Input data (x values) flow from left to right
Output data (y values) flow from right to left (opposite direction)
Each PE computes: y_out = y_in + weight × x_in and passes x to the right

Matrix Multiplication: Systolic arrays naturally compute matrix multiplication — one matrix flows through rows, the other through columns, results accumulate in each PE.

Google TPU (Tensor Processing Unit):

Contains a 256 × 256 systolic array of 8-bit multiply-accumulate (MAC) units
Total: 256 × 256 = 65,536 MACs
Designed for neural network inference (matrix multiplications are the core operation)
Weight-stationary dataflow: weights loaded into PEs, activations flow through
Achieves very high throughput for matrix operations with minimal memory bandwidth

Key Points

•Systolic array: grid of PEs, data flows rhythmically through the array
•W1 convolution: weights stay, x flows right, y flows left
•Naturally computes matrix multiplication and convolution
•Google TPU: 256x256 systolic array, 65536 8-bit MACs
•Weight-stationary: weights preloaded, activations flow through

⚠

Exam Tip

Know the W1 systolic array design for convolution: which data stays, which moves, and in what direction. Also know TPU specs: 256x256, 8-bit MACs, 65536 units. Be able to trace through a small systolic array computation.

Back to all notes