Notes/SIMD, Multicore & Accelerators
🔀

SIMD, Multicore & Accelerators

L9L10Pre-Midterm

Flynn's taxonomy, vectorization, systolic arrays, TPU

Topics Covered

Flynn's taxonomy (SISD/SIMD/MISD/MIMD)SIMD/vector processingVector chainingMemory bankingAmdahl's lawMulti-core processorsSystolic arraysTPU architecture
01

Flynn's Taxonomy

Flynn's taxonomy classifies architectures by the number of instruction and data streams:

Single Data Multiple Data
Single Instruction SISD: Traditional uniprocessor (Von Neumann) SIMD: Array/vector processors. Same op on multiple data elements
Multiple Instruction MISD: Rare (systolic arrays sometimes classified here) MIMD: Multi-core processors. Each core runs its own program on its own data
  • SISD: One instruction stream, one data stream. Classic single-core processor.
  • SIMD: One instruction controls many processing elements, each operating on different data. GPUs, vector extensions (AVX, SSE).
  • MISD: Multiple instructions on same data. Very rare; sometimes fault-tolerant systems.
  • MIMD: Multiple independent processors, each with own instruction and data. Multi-core CPUs, clusters.

Key Points

  • •SISD: single core, single data stream (classic Von Neumann)
  • •SIMD: one instruction, multiple data (vector/array processors, GPUs)
  • •MISD: multiple instructions, single data (rare)
  • •MIMD: multiple cores, each independent (multi-core, clusters)
02

SIMD / Vector Processing

Vector processing applies the same operation to an entire vector (array) of data elements with a single instruction.

Key components:

  • Vector registers: Each holds N data elements (e.g., 64 doubles)
  • Vector functional units: Pipelined units that process one element per cycle
  • Vector instructions: VLD (vector load), VST (vector store), VADD (vector add), VMUL, VSHFR (vector shift right), etc.

Advantages over SISD:

  • One instruction replaces an entire loop → fewer instruction fetches and decodes
  • Known regular access pattern → better memory/cache behavior
  • No control flow within vector op → no branch mispredictions
  • Parallelism at the data level, amortizing instruction overhead

Vector chaining: The output of one vector functional unit is forwarded directly as input to the next vector functional unit, without waiting for the entire vector to complete. Analogous to data forwarding in a scalar pipeline.

Example: VMUL V3, V1, V2 followed by VADD V5, V3, V4 — as each element of V3 is produced by VMUL, it's immediately consumed by VADD.

Key Points

  • •Vector register holds N elements; one instruction operates on all
  • •VLD, VST, VADD, VMUL, VSHFR — vector ISA instructions
  • •Advantages: fewer fetches/decodes, regular memory access, no branches
  • •Vector chaining: forward elements between vector FUs as they're produced
03

Memory Banking for Vector Access

Vector loads/stores need to supply one element per cycle to the vector unit. A single memory bank cannot sustain this bandwidth.

Solution: Multiple independent memory banks

  • Distribute consecutive addresses across different banks
  • If there are B banks, consecutive elements go to banks 0, 1, 2, ..., B-1, 0, 1, ...
  • Access to different banks can proceed in parallel
  • Need enough banks to hide the access latency of each bank

Bank conflict: If two accesses go to the same bank simultaneously, one must wait. Stride-N accesses with N being a multiple of B cause systematic bank conflicts.

Rule of thumb: Need at least as many banks as the memory access latency (in cycles) to sustain one element per cycle.

Key Points

  • •Multiple banks enable parallel memory accesses for vector loads
  • •Consecutive elements interleaved across banks
  • •Bank conflict: two accesses to same bank must serialize
  • •Need #banks >= memory latency to sustain 1 element/cycle
04

Amdahl's Law

Amdahl's Law gives the theoretical maximum speedup of a program when only a fraction of it can be parallelized:

Speedup = 1 / ((1 - f) + f/S)

where:

  • f = fraction of execution time that can be sped up (parallelizable portion)
  • S = speedup factor for the parallelizable portion (e.g., number of cores)
  • (1-f) = serial (non-parallelizable) portion

Key implications:

  • Even with infinite speedup of the parallel portion (S → infinity), the maximum speedup is 1/(1-f)
  • If 10% of the program is serial (f=0.9), maximum speedup is 10x, no matter how many cores
  • Focus on the common case: To maximize overall speedup, the fraction f should be as large as possible AND S should be as large as possible
  • Diminishing returns: doubling cores from 100 to 200 helps far less than doubling from 1 to 2

Key Points

  • •Speedup = 1/((1-f) + f/S) where f = parallelizable fraction, S = speedup of that fraction
  • •Max speedup with infinite S: 1/(1-f) — limited by serial portion
  • •Make the common case fast: maximize both f and S
  • •Diminishing returns with more parallelism if serial portion exists
âš 

Exam Tip

Be able to calculate speedup given f and S. Also know how to solve for f given a desired speedup. Very commonly tested formula.

05

Multi-Core Processors

Multi-core = multiple independent processor cores on the same die.

Why multi-core? Power wall — increasing single-core frequency hit diminishing returns due to cubic power scaling (P proportional to V2f, and V must increase with f). Instead, use multiple simpler cores at lower frequency.

Advantages:

  • More power-efficient than a single fast core (2 cores at half freq ≈ same perf at 1/4 power)
  • Simpler individual cores → easier to design and verify
  • Higher aggregate throughput for parallel workloads

Disadvantages:

  • Requires parallel software — single-threaded programs don't benefit
  • Shared resources (caches, memory, interconnect) cause interference between cores
  • Single-thread performance may decrease due to shared resource contention
  • Programming complexity: synchronization, deadlocks, data races

Key Points

  • •Multiple cores on one die — motivated by power wall
  • •More power-efficient: 2 cores at half freq ≈ 1/4 power of one fast core
  • •Requires parallel software — Amdahl's law limits speedup
  • •Shared resources cause interference between applications
06

Systolic Arrays & TPU

A systolic array is a grid of processing elements (PEs) where data flows rhythmically through the array (like blood through the heart — hence "systolic").

Key properties:

  • Each PE performs a simple operation (typically multiply-accumulate)
  • Data flows between neighboring PEs in a regular pattern
  • High throughput with minimal memory bandwidth — data is reused as it flows
  • Very regular structure — easy to layout in silicon

Convolution example (W1 design):

  • Weights stay fixed in the PEs (weight-stationary)
  • Input data (x values) flow from left to right
  • Output data (y values) flow from right to left (opposite direction)
  • Each PE computes: y_out = y_in + weight × x_in and passes x to the right

Matrix Multiplication: Systolic arrays naturally compute matrix multiplication — one matrix flows through rows, the other through columns, results accumulate in each PE.

Google TPU (Tensor Processing Unit):

  • Contains a 256 × 256 systolic array of 8-bit multiply-accumulate (MAC) units
  • Total: 256 × 256 = 65,536 MACs
  • Designed for neural network inference (matrix multiplications are the core operation)
  • Weight-stationary dataflow: weights loaded into PEs, activations flow through
  • Achieves very high throughput for matrix operations with minimal memory bandwidth

Key Points

  • •Systolic array: grid of PEs, data flows rhythmically through the array
  • •W1 convolution: weights stay, x flows right, y flows left
  • •Naturally computes matrix multiplication and convolution
  • •Google TPU: 256x256 systolic array, 65536 8-bit MACs
  • •Weight-stationary: weights preloaded, activations flow through
âš 

Exam Tip

Know the W1 systolic array design for convolution: which data stays, which moves, and in what direction. Also know TPU specs: 256x256, 8-bit MACs, 65536 units. Be able to trace through a small systolic array computation.

Back to all notes