GPU as a bandwidth machine
Modern accelerators have more arithmetic than they can feed. Almost every serving question reduces to "where are the bytes, and how many trips do they take through HBM?" This lesson installs that habit before any model word appears.
The question this lesson answers
Why is a GPU so different to program than a CPU, and why does LLM serving end up obsessing over memory rather than math? Because on current GPUs the ratio of peak arithmetic to peak HBM bandwidth has grown past 200 FLOPs per byte. Any operation that does fewer FLOPs per byte read is bottlenecked by HBM, regardless of how clever the math is. Most serving operations sit on the wrong side of that ratio.
So the first habit to install is this: before you talk about kernels, attention, paged caches, or radix trees, count the bytes. Every later trick in this track exists to either (a) move bytes fewer times or (b) make better use of bytes already nearby.
The memory hierarchy with real numbers
Treat HBM, L2, shared memory (SRAM), and registers as four layers. Each layer is smaller, faster, and closer to the compute units than the one before it. The numbers below are order-of-magnitude figures for an NVIDIA H100 SXM; newer parts (B200) move them by a constant factor but the ratios are similar.
Three numbers anchor everything else in the track:
- HBM bandwidth (~3.35 TB/s on H100): the rate at which the GPU can pull any new bytes onto the chip.
- Peak compute (~989 TFLOP/s FP16): the rate at which tensor cores can do math on bytes already on the chip.
- Arithmetic intensity break-even (~295 FLOPs/byte): below this you waste tensor cores; above this you waste bandwidth.
The roofline equation
Every kernel takes time bounded below by two terms — the time to move its bytes and the time to do its math — plus a fixed cost per launch. Whichever term is largest is the bottleneck:
Read this carefully. The "+ launch" matters at small problem sizes. Decode in serving is exactly that regime: per step, each layer's attention reads ~MB of KV, does ~MFLOP, finishes in microseconds — so a 5–10 µs kernel launch is no longer negligible. Lesson 06 returns to this.
What a kernel actually is
A kernel is a small program launched over a grid of thread blocks. Each block is scheduled onto one streaming multiprocessor (SM). Threads in a block execute in warps of 32, run the same instruction in lockstep when possible, and share fast on-chip memory.
The programmer's job is to choose three things: how big the grid is (problem decomposition), how big a block is (resource sharing), and what each thread does (memory pattern and arithmetic). The hardware then decides which blocks run where and when.
The three habits: tile, fuse, coalesce
Every kernel optimization in this track is one of these three habits — a disciplined attempt to keep bytes near compute and avoid scattered HBM traffic.
| Habit | What it means | What it buys |
|---|---|---|
| Tile | Operate on a small sub-block that fits in SMEM or registers; loop over tiles. | Reuse loaded bytes for many FLOPs. Raises arithmetic intensity. |
| Fuse | Run several ops back-to-back on the same data before writing back. | Skips intermediate HBM round trips. Saves bandwidth. |
| Coalesce | Have adjacent threads in a warp read adjacent addresses. | One wide memory transaction instead of 32 scattered ones. |
Fusion as memory accounting
Consider z = relu((x + b) * s) with x of shape [N, H] in bf16. Three unfused kernels each read the tensor and write the tensor — six HBM trips. A fused kernel reads once and writes once — two trips. The math is unchanged. For N·H = 4M elements that is the difference between 48 MB of HBM traffic and 16 MB. At 3.35 TB/s that's saving ~10 µs of pure bandwidth time, which on a per-decode-step basis adds up.
# Triton-style sketch of the fused version
@kernel
def add_bias_relu_scale(x_ptr, b_ptr, z_ptr, s, N, H):
pid = program_id(0)
offs = pid * BLOCK + arange(0, BLOCK)
x = load(x_ptr + offs) # one HBM read
b = load(b_ptr + (offs % H)) # tiny, cached
z = relu((x + b) * s) # stays in registers
store(z_ptr + offs, z) # one HBM write
Interactive · roofline accountant
Set the bytes and FLOPs your operation does and the GPU's peak rates. The widget tells you which side dominates and the operation's arithmetic intensity. This is the calculator to run before requesting "a faster kernel."
Launch overhead is a real third axis
The "+ launch_overhead" term looks small but compounds. A typical kernel launch is 5–10 µs of CPU-side work plus driver and queue latency. A transformer decode step launches 80–200 kernels per token (attention + GEMMs + small bookkeeping). At 5 µs each that is 0.4–1.0 ms of pure dispatch, often comparable to the actual math. This is why CUDA graphs (lesson 06) exist.
What this gives you for the rest of the track
Two reflexes:
- Count bytes before naming a kernel. Before asking for FlashAttention, paged KV, or radix reuse, compute how much HBM the naive version moves. Each later technique is justified by which bytes it eliminates.
- Place the operation on the roofline. Prefill GEMMs end up compute-bound; decode attention ends up bandwidth-bound; small bookkeeping ops end up launch-bound. Different bottlenecks, different fixes.