gpu_kernel_serving / 02 · memory hierarchy lesson 2 / 17

The memory hierarchy — registers, SMEM, L2, HBM

Every cycle a thread doesn't have data is a wasted cycle. The kernel-optimisation game is moving the right bytes into the right place at the right time. This lesson is the pyramid of "places."

The numbers

H100 SXM5 numbers, rounded for keep-in-head purposes:

TierWhereSize (per SM unless noted)BandwidthLatency
Registersper warp partition65,536 × 32-bit = 256 KB~10s of TB/s aggregate1 cycle
Shared memory (SMEM)per SM, on-chipup to ~228 KB per SM (default per-block cap is 48 KB; opt in to larger via cudaFuncSetAttribute)~19 TB/s aggregate across SMs~20–30 cycles
L1 cacheper SM (same physical SRAM as SMEM, partitioned)share of 228 KBsame~30 cycles
L2 cacheshared by all SMs50 MB total~5–6 TB/s~200 cycles
HBM3off-chip device memory80 GB3.35 TB/s~400–800 cycles
System memory (via PCIe)host RAMTB-scale~64 GB/s gen5~10,000 cycles

Read the two columns "bandwidth" and "latency" together. SMEM is roughly 6× the bandwidth of HBM and ~20× the latency advantage. That's the gain you capture every time you load a value once into SMEM and reuse it across many threads or many ops.

Registers · 256 KB/SM · ~tens of TB/s · 1 cycle private to each thread; allocated at compile time Shared memory (SMEM) / L1 · ~228 KB/SM · ~19 TB/s shared by all threads of one block; manual placement (48 KB default per-block cap) L2 cache · 50 MB total · ~5–6 TB/s global across SMs; hardware-managed HBM3 · 80 GB · 3.35 TB/s the roofline's denominator (lesson 01) PCIe / Host ~64 GB/s, microseconds

3D · the memory pyramid, isometric

The pyramid stacked physically. Each tier is a 3D cube; the footprint scales with capacity (log) and the colour with latency (green = fast, red = slow). Click a tier for its full numbers. The vertical stack reflects on-chip → off-chip → off-package distance.

Memory pyramid · click a tier
Cube footprint ~ log(capacity); colour ~ latency. Top = closest to ALU (1 cycle); bottom = furthest (10k cycles for host).
selected tier
— click —
capacity
bandwidth
latency

Animated · the journey of a single load

Click "load" to send a request from a thread (top) down through the hierarchy. Each tier introduces a latency penalty proportional to real cycle counts. Toggle the cache state to model L1-hit, L2-hit, or HBM-miss scenarios.

Load latency · one request, one path
A dot drops from "thread" (top) through tiers. It stops at the first tier that "has" the data (toggleable). Total cycles displayed.
cycles elapsed
stopped at
cost vs reg
verdict

2D · roofline with bandwidth roofs per tier

The classic roofline plot, augmented: each memory tier contributes its own slope (bandwidth ceiling). Plot a kernel's measured arithmetic intensity to see which tier's bandwidth binds it. Move the AI slider to walk a kernel from "register-resident" past SMEM, L2, and HBM ceilings.

Roofline · per-tier bandwidth ceilings
Compute ceiling (horizontal orange) = peak FLOPS. Sloped lines = BW × AI per tier. Your kernel's marker sits on the lowest binding roof.
binding tier
achievable
% of peak
verdict

Registers — the fastest, scarcest tier

Each thread has its own private set of 32-bit registers. The compiler allocates them — you don't write register float r; in CUDA. But you do choose indirectly: the number of registers per thread is set per kernel, and it bounds how many threads can be resident on an SM.

The math:

threads_resident_per_SM ≤ 65,536 / registers_per_thread

If your kernel uses 64 registers per thread, you can have up to 1024 threads = 32 warps resident. If it uses 128 registers, only 512 threads = 16 warps. Fewer resident warps = less latency hiding = potentially slower (lesson 27 on occupancy).

You can inspect register usage with nvcc -Xptxas -v or cuobjdump. Limit it with __launch_bounds__ or the -maxrregcount flag.

Shared memory — programmer-managed cache

SMEM is on-chip SRAM physically shared with the L1 cache. The split is configurable (per kernel): you ask for some amount of SMEM via __shared__ declarations, and the rest of that 228 KB becomes L1.

SMEM has properties that make it the workhorse of CUDA optimisation:

The canonical usage pattern (which lesson 24 builds out fully):

__shared__ float tile[16][16];
// each thread loads one element
tile[threadIdx.y][threadIdx.x] = global_array[idx];
__syncthreads();
// now everyone in the block can read any tile element

L1 / L2 cache — usually invisible

Unlike SMEM, the caches are hardware-managed. When a thread reads a global address, the GPU first checks L1, then L2, then HBM. If the L2 was warm because another SM read nearby bytes, you get an L2 hit at ~5 TB/s instead of an HBM trip at 3.35 TB/s.

This is normally good — you don't have to think about it. Two cases when it matters:

HBM — the bandwidth wall

HBM is the off-chip device memory. For decode (lesson 11), HBM bandwidth is the speed of light. Every weight matrix gets streamed through it once per token; the per-token decode time is exactly weight_bytes / HBM_bw.

Two facts about HBM that drive lesson-23 and lesson-24:

What "fits where" — a real example

Consider a 16×16 tiled matmul (the lesson-24 case). For each output tile of 16×16:

Memory accounting: at K=16 and fp32, A + B tiles are 16·16·4 + 16·16·4 = 2 KB. Trivially fits. The win: 16 × 16 = 256 output elements, each updated K times = 4096 multiply-adds, using 2 KB of input from SMEM (loaded once from HBM). HBM traffic per FLOP is dramatically lower than the naïve "load both operands from HBM per multiply" approach.

Constant memory and texture memory (legacy)

Two more memory spaces exist in CUDA, mostly historical:

The accounting habit, revisited

For any kernel, ask:

  1. What do I need to load from HBM? (= the inputs)
  2. What do I write back to HBM? (= the outputs)
  3. What lives in SMEM during the kernel? (= the reused / shared intermediate)
  4. What lives in registers? (= the per-thread accumulators)
  5. How many flops per byte of HBM I/O? (= arithmetic intensity)

If your kernel does many flops per HBM byte → compute-bound, optimise FLOP throughput. If your kernel does few flops per HBM byte → memory-bound, the only knob is reducing HBM traffic. The number is your kernel's roofline coordinate (lesson 01).

Interactive · move bytes through the pyramid

Pick a kernel's working-set size and reuse pattern. The widget shows where the bytes naturally land (which tier is sufficient), the achievable bandwidth, and the resulting throughput. Try a small footprint with high reuse — SMEM dominates. Try a large footprint — HBM is your speed.

Working-set placement · which tier feeds your loop?
Each access reads access_bytes; we do reuse accesses per loaded element. The smallest tier that fits the working set decides the effective bandwidth.
tier that fits
effective BW
throughput (GB/s)
verdict
Takeaway
Memory is a pyramid. Each tier down is ~10× larger, ~5× lower bandwidth, and ~10× higher latency. The kernel-writer's job is to put each piece of data in the smallest tier that fits its reuse — load once from HBM into SMEM, reuse many times from SMEM, accumulate in registers. Lesson 23 makes sure your HBM reads are coalesced; lesson 24 builds the canonical SMEM-tile kernel.