The memory hierarchy — registers, SMEM, L2, HBM

Every cycle a thread doesn't have data is a wasted cycle. The kernel-optimisation game is moving the right bytes into the right place at the right time. This lesson is the pyramid of "places."

The numbers

H100 SXM5 numbers, rounded for keep-in-head purposes:

Tier	Where	Size (per SM unless noted)	Bandwidth	Latency
Registers	per warp partition	65,536 × 32-bit = 256 KB	~10s of TB/s aggregate	1 cycle
Shared memory (SMEM)	per SM, on-chip	up to ~228 KB per SM (default per-block cap is 48 KB; opt in to larger via `cudaFuncSetAttribute`)	~19 TB/s aggregate across SMs	~20–30 cycles
L1 cache	per SM (same physical SRAM as SMEM, partitioned)	share of 228 KB	same	~30 cycles
L2 cache	shared by all SMs	50 MB total	~5–6 TB/s	~200 cycles
HBM3	off-chip device memory	80 GB	3.35 TB/s	~400–800 cycles
System memory (via PCIe)	host RAM	TB-scale	~64 GB/s gen5	~10,000 cycles

Read the two columns "bandwidth" and "latency" together. SMEM is roughly 6× the bandwidth of HBM and ~20× the latency advantage. That's the gain you capture every time you load a value once into SMEM and reuse it across many threads or many ops.

3D · the memory pyramid, isometric

The pyramid stacked physically. Each tier is a 3D cube; the footprint scales with capacity (log) and the colour with latency (green = fast, red = slow). Click a tier for its full numbers. The vertical stack reflects on-chip → off-chip → off-package distance.

Animated · the journey of a single load

Click "load" to send a request from a thread (top) down through the hierarchy. Each tier introduces a latency penalty proportional to real cycle counts. Toggle the cache state to model L1-hit, L2-hit, or HBM-miss scenarios.

Load latency · one request, one path

A dot drops from "thread" (top) through tiers. It stops at the first tier that "has" the data (toggleable). Total cycles displayed.

scenario: speed: 4×

cycles elapsed

—

stopped at

—

cost vs reg

—

verdict

—

2D · roofline with bandwidth roofs per tier

The classic roofline plot, augmented: each memory tier contributes its own slope (bandwidth ceiling). Plot a kernel's measured arithmetic intensity to see which tier's bandwidth binds it. Move the AI slider to walk a kernel from "register-resident" past SMEM, L2, and HBM ceilings.

Registers — the fastest, scarcest tier

Each thread has its own private set of 32-bit registers. The compiler allocates them — you don't write register float r; in CUDA. But you do choose indirectly: the number of registers per thread is set per kernel, and it bounds how many threads can be resident on an SM.

The math:

threads_resident_per_SM ≤ 65,536 / registers_per_thread

If your kernel uses 64 registers per thread, you can have up to 1024 threads = 32 warps resident. If it uses 128 registers, only 512 threads = 16 warps. Fewer resident warps = less latency hiding = potentially slower (lesson 27 on occupancy).

You can inspect register usage with nvcc -Xptxas -v or cuobjdump. Limit it with __launch_bounds__ or the -maxrregcount flag.

Shared memory — programmer-managed cache

SMEM is on-chip SRAM physically shared with the L1 cache. The split is configurable (per kernel): you ask for some amount of SMEM via __shared__ declarations, and the rest of that 228 KB becomes L1.

SMEM has properties that make it the workhorse of CUDA optimisation:

Shared by all threads in a block. Thread 5 can write a value; thread 17 can read it (after __syncthreads()).
~6× the bandwidth of HBM. Loading once from HBM and reusing from SMEM across many threads is the source of nearly every meaningful CUDA speedup.
Banked. SMEM is divided into 32 banks (one per warp lane). If 32 threads simultaneously access addresses that fall on different banks, all 32 happen at once. If two threads hit the same bank, the accesses serialise. This is "bank conflict" — a separate, more subtle optimization problem.

The canonical usage pattern (which lesson 24 builds out fully):

__shared__ float tile[16][16];
// each thread loads one element
tile[threadIdx.y][threadIdx.x] = global_array[idx];
__syncthreads();
// now everyone in the block can read any tile element

L1 / L2 cache — usually invisible

Unlike SMEM, the caches are hardware-managed. When a thread reads a global address, the GPU first checks L1, then L2, then HBM. If the L2 was warm because another SM read nearby bytes, you get an L2 hit at ~5 TB/s instead of an HBM trip at 3.35 TB/s.

This is normally good — you don't have to think about it. Two cases when it matters:

Reads that thrash L2. If your access pattern brings 100 MB into L2 each block but L2 is 50 MB, you waste the cache. Tiling (lesson 24) keeps a working set inside L2.
Writes that bypass L1. Stores to global memory go straight to L2 by default, not L1 (because L1 is L2-coherent only for reads on most architectures). Compiler does this for you.

HBM — the bandwidth wall

HBM is the off-chip device memory. For decode (lesson 11), HBM bandwidth is the speed of light. Every weight matrix gets streamed through it once per token; the per-token decode time is exactly weight_bytes / HBM_bw.

Two facts about HBM that drive lesson-23 and lesson-24:

HBM has a "transaction size." When you read 1 byte, the memory controller actually reads 32 (or 128) bytes — the minimum cache line. If 32 threads read scattered single bytes, you waste 31/32 of the bandwidth. Lesson 23 dwells on this.
Reads/writes happen on a stream. The controller can sustain peak bandwidth only when many in-flight transactions are queued — which means enough warps must be issuing loads concurrently. Sparse access patterns starve the controller.

What "fits where" — a real example

Consider a 16×16 tiled matmul (the lesson-24 case). For each output tile of 16×16:

One tile of A (16 × K columns) — sits in SMEM. 16 × K × 4 bytes for fp32.
One tile of B (K × 16 rows) — sits in SMEM. Same size.
One tile of C (16 × 16 partial sums) — sits in registers across the 256 threads, ~1 register per output element per thread. The accumulator never touches SMEM.

Memory accounting: at K=16 and fp32, A + B tiles are 16·16·4 + 16·16·4 = 2 KB. Trivially fits. The win: 16 × 16 = 256 output elements, each updated K times = 4096 multiply-adds, using 2 KB of input from SMEM (loaded once from HBM). HBM traffic per FLOP is dramatically lower than the naïve "load both operands from HBM per multiply" approach.

Constant memory and texture memory (legacy)

Two more memory spaces exist in CUDA, mostly historical:

Constant memory (64 KB per kernel). Read-only, cached, broadcast-optimised when all threads in a warp read the same address. Modern code uses it sparingly — kernel parameters, lookup tables.
Texture memory — wraps HBM with hardware filtering/interpolation. Graphics-era. Tensor-core kernels don't touch it.

The accounting habit, revisited

For any kernel, ask:

What do I need to load from HBM? (= the inputs)
What do I write back to HBM? (= the outputs)
What lives in SMEM during the kernel? (= the reused / shared intermediate)
What lives in registers? (= the per-thread accumulators)
How many flops per byte of HBM I/O? (= arithmetic intensity)

If your kernel does many flops per HBM byte → compute-bound, optimise FLOP throughput. If your kernel does few flops per HBM byte → memory-bound, the only knob is reducing HBM traffic. The number is your kernel's roofline coordinate (lesson 01).

Interactive · move bytes through the pyramid

Pick a kernel's working-set size and reuse pattern. The widget shows where the bytes naturally land (which tier is sufficient), the achievable bandwidth, and the resulting throughput. Try a small footprint with high reuse — SMEM dominates. Try a large footprint — HBM is your speed.

Takeaway

Memory is a pyramid. Each tier down is ~10× larger, ~5× lower bandwidth, and ~10× higher latency. The kernel-writer's job is to put each piece of data in the smallest tier that fits its reuse — load once from HBM into SMEM, reuse many times from SMEM, accumulate in registers. Lesson 23 makes sure your HBM reads are coalesced; lesson 24 builds the canonical SMEM-tile kernel.