all_lessons/gpu_kernel_serving/10 · bandwidth machinelesson 10 / 17

GPU as a bandwidth machine

Modern accelerators have more arithmetic than they can feed. Almost every serving question reduces to "where are the bytes, and how many trips do they take through HBM?" This lesson installs that habit before any model word appears.

The question this lesson answers

Why is a GPU so different to program than a CPU, and why does LLM serving end up obsessing over memory rather than math? Because on current GPUs the ratio of peak arithmetic to peak HBM bandwidth has grown past 200 FLOPs per byte. Any operation that does fewer FLOPs per byte read is bottlenecked by HBM, regardless of how clever the math is. Most serving operations sit on the wrong side of that ratio.

So the first habit to install is this: before you talk about kernels, attention, paged caches, or radix trees, count the bytes. Every later trick in this track exists to either (a) move bytes fewer times or (b) make better use of bytes already nearby.

The memory hierarchy with real numbers

Treat HBM, L2, shared memory (SRAM), and registers as four layers. Each layer is smaller, faster, and closer to the compute units than the one before it. The numbers below are order-of-magnitude figures for an NVIDIA H100 SXM; newer parts (B200) move them by a constant factor but the ratios are similar.

HBM380 GB~3.35 TB/soff-chip, shared L2 cache~50 MB~6 TB/son-chip, shared SRAM / SMEM228 KB/SM~33 TB/s per SMper streaming multiprocessor Registers256 KB/SMessentially freeper thread peak FP16 tensor-core throughput on H100: ~989 TFLOP/s (FP16), ~1979 TFLOP/s (FP8) arithmetic intensity break-even: 989e12 / 3.35e12 ≈ 295 FLOPs per byte read from HBM memory-bound region (most serving work) compute-bound region x-axis: arithmetic intensity (FLOPs/byte). Decode attention sits at ~1; FP16 GEMMs at batch 32 sit near 100; large prefill GEMMs cross the line.

Three numbers anchor everything else in the track:

The roofline equation

Every kernel takes time bounded below by two terms — the time to move its bytes and the time to do its math — plus a fixed cost per launch. Whichever term is largest is the bottleneck:

time ≈ max( bytes / bandwidth , flops / peak_flops ) + launch_overhead

Read this carefully. The "+ launch" matters at small problem sizes. Decode in serving is exactly that regime: per step, each layer's attention reads ~MB of KV, does ~MFLOP, finishes in microseconds — so a 5–10 µs kernel launch is no longer negligible. Lesson 06 returns to this.

What a kernel actually is

A kernel is a small program launched over a grid of thread blocks. Each block is scheduled onto one streaming multiprocessor (SM). Threads in a block execute in warps of 32, run the same instruction in lockstep when possible, and share fast on-chip memory.

Grid (one kernel launch) many independent blocks scheduled onto → SM (one of ~132) runs warps from 1+ blocks warp 0 · 32 threads warp 1 · 32 threads warp 2 · 32 threads tensor cores Shared mem 228 KB Reg file 256 KB L1 ~unified "Occupancy" = how many warps fit at once. More warps → better latency hiding when one warp stalls on HBM.

The programmer's job is to choose three things: how big the grid is (problem decomposition), how big a block is (resource sharing), and what each thread does (memory pattern and arithmetic). The hardware then decides which blocks run where and when.

The three habits: tile, fuse, coalesce

Every kernel optimization in this track is one of these three habits — a disciplined attempt to keep bytes near compute and avoid scattered HBM traffic.

HabitWhat it meansWhat it buys
TileOperate on a small sub-block that fits in SMEM or registers; loop over tiles.Reuse loaded bytes for many FLOPs. Raises arithmetic intensity.
FuseRun several ops back-to-back on the same data before writing back.Skips intermediate HBM round trips. Saves bandwidth.
CoalesceHave adjacent threads in a warp read adjacent addresses.One wide memory transaction instead of 32 scattered ones.

Fusion as memory accounting

Consider z = relu((x + b) * s) with x of shape [N, H] in bf16. Three unfused kernels each read the tensor and write the tensor — six HBM trips. A fused kernel reads once and writes once — two trips. The math is unchanged. For N·H = 4M elements that is the difference between 48 MB of HBM traffic and 16 MB. At 3.35 TB/s that's saving ~10 µs of pure bandwidth time, which on a per-decode-step basis adds up.

# Triton-style sketch of the fused version
@kernel
def add_bias_relu_scale(x_ptr, b_ptr, z_ptr, s, N, H):
    pid = program_id(0)
    offs = pid * BLOCK + arange(0, BLOCK)
    x = load(x_ptr + offs)             # one HBM read
    b = load(b_ptr + (offs % H))       # tiny, cached
    z = relu((x + b) * s)              # stays in registers
    store(z_ptr + offs, z)             # one HBM write

Interactive · roofline accountant

Set the bytes and FLOPs your operation does and the GPU's peak rates. The widget tells you which side dominates and the operation's arithmetic intensity. This is the calculator to run before requesting "a faster kernel."

Memory-bound or compute-bound?

Decode attention reads ~10 MB and does ~5 MFLOP per layer per request → intensity ≈ 0.5. A 4K×4K bf16 GEMM at batch 128 reads ~40 MB and does 4 GFLOP → intensity ≈ 100. Try both.

Launch overhead is a real third axis

The "+ launch_overhead" term looks small but compounds. A typical kernel launch is 5–10 µs of CPU-side work plus driver and queue latency. A transformer decode step launches 80–200 kernels per token (attention + GEMMs + small bookkeeping). At 5 µs each that is 0.4–1.0 ms of pure dispatch, often comparable to the actual math. This is why CUDA graphs (lesson 06) exist.

What this gives you for the rest of the track

Two reflexes:

  1. Count bytes before naming a kernel. Before asking for FlashAttention, paged KV, or radix reuse, compute how much HBM the naive version moves. Each later technique is justified by which bytes it eliminates.
  2. Place the operation on the roofline. Prefill GEMMs end up compute-bound; decode attention ends up bandwidth-bound; small bookkeeping ops end up launch-bound. Different bottlenecks, different fixes.
The next step
Lesson 02 takes a transformer forward pass and lays out every kernel it launches, with byte/FLOP labels. That is where "KV cache" stops being a phrase and becomes a number you can compute.