all_lessons/gpu_kernel_serving24 lessons · CUDA → serving → optimization

GPU Kernels for LLM Serving

One linear path from "thread, block, warp" to "tokens streamed to a user" — then a third part on how to make them fast: profile, hypothesize, write Triton or compile, re-profile. Twenty-four lessons, strictly linear.

How to read this track
Part I (01–09): CUDA primitives. Part II (10–17): serving — what vLLM/SGLang actually do. Part III (18–24): the optimization loop — PyTorch dispatch, profiling, Triton, torch.compile, performant patterns. Each lesson assumes only the lessons before it.
Mental stance
Treat a serving engine as a controller for kernels. The framework decides which work exists; the scheduler decides which work runs together; the kernel decides how bytes move through HBM, SRAM, registers, and tensor cores. Optimizing the wrong layer is the most common mistake in this domain.

The dependency chain

The track is a directed line, not a wheel. CUDA primitives generate the kernel ideas; the kernel ideas generate the memory-layout ideas; the memory-layout ideas generate the scheduling ideas; the scheduling ideas generate the framework. If you skip a step the next step looks like a folk trick instead of a forced move.

Part I — CUDA from first principles 01 executionSIMT, SMs 02 memoryHBM→SMEM→reg 03 vector addfirst kernel 04 coalescewarp loads 05 tilingSMEM matmul 06 warpsdivergence 07 reducetree pattern 08 occupancylatency hide 09 tensor coresmma Part II — serving kernels on top of those primitives 10 bandwidthroofline synthesis 11 forwardkernel chain, KV 12 attentionprefill/decode, flash 13 paged KVblock tables 14 prefix reusehash & radix 15 schedulebatch, chunk, graphs 16 other kernelsGEMM, quant, MoE 17 frameworkHTTP → kernel → stream Part III — from measurement to optimization 18 dispatchPyTorch → kernel 19 fused kernelanatomy 20 profilersthree tools 21 read profileroofline + stalls 22 Tritonwrite fused kernels 23 compileDynamo+Inductor 24 patterns & synthesisPyTorch-level habits Part III is the optimization loop. Each box assumes Parts I–II plus everything to its left. "My code is slow" → profile (20) → identify class (21) → fix at the right layer (22, 23, 24).

Part I · CUDA from first principles

Hardware execution model, memory pyramid, your first kernel, the coalesced-access trap, tile-and-share, warp divergence, the reduction template, occupancy as a budget, and the tensor-core instruction. By the end you can read any kernel in Part II.

01
GPU execution model
A GPU is a hardware lattice with a specific shape. Threads, warps, blocks, SMs — and how every CUDA concept maps onto a piece of the lattice.
02
Memory hierarchy
HBM, L2, shared memory, registers — the pyramid of "places," with bandwidths. Why the kernel game is about moving the right bytes to the right place at the right time.
03
First kernel — vector add
The smallest useful CUDA program. __global__, launch syntax, host/device split, the boundary check. Walk this and you can read any kernel.
04
Coalesced memory access
A warp's 32 threads can land in one memory transaction or up to 32 — depending on stride. The single biggest beginner trap.
05
Shared memory & tiled matmul
Tile A and B into SMEM, reuse each byte O(N) times, drop HBM traffic by ~N, become compute-bound. The reason FlashAttention works.
06
Warps, divergence, sync
32 threads run the same instruction. When they disagree, the warp serialises. Branch warp-aligned or use predication.
07
Reductions
Warp shuffles, shared memory, atomics — the parallel-tree template behind softmax, normalization, and every "produce one number from many."
08
Occupancy
More warps in flight → better latency hiding, but each costs registers and SMEM. Maximum occupancy isn't always fastest.
09
Tensor cores
One hardware instruction does a 16×16 matrix multiply. Every meaningful matmul on a modern GPU lands here.

Part II · LLM serving on top of Part I

Where the primitives become FlashAttention, paged KV, radix prefix reuse, continuous batching, CUDA graphs, and the framework that turns HTTP requests into batches kernels can execute.

10
GPU as a bandwidth machine
Roofline synthesis of Part I, with H100 numbers. The accountant you run before asking for a custom kernel.
11
Transformer forward as a kernel chain
A forward pass is a sequence of GEMMs and an attention. Compute KV cache size from first principles and see exactly where bytes go each token.
12
Attention asymmetry & FlashAttention
Prefill compute-bound, decode bandwidth-bound, same math. FlashAttention's online softmax in seven lines, and what it does not fix.
13
Paged KV: virtual memory for caches
Why contiguous allocation wastes HBM, how block tables fix it, what indirection costs the kernel.
14
Prefix reuse: hash caches and radix trees
When prompts share prefixes, share the KV. Hash caches, radix tries, eviction, cache-aware routing.
15
Scheduling and graph capture
Continuous batching, chunked prefill, CUDA graphs — making the scheduler's traffic look regular and cheap to launch.
16
Beyond attention: GEMM, quant, MoE, sampling
The rest of the decode chain. Quantization as packing-plus-kernel, MoE dispatch, GPU sampling.
17
The serving framework: synthesis
HTTP → tokenizer → queue → scheduler → worker → kernel → sampler → stream. Where time goes, how to debug, vLLM and SGLang as compositions.

Part III · From measurement to optimization

Now that you know what kernels exist (Parts I–II), this part is the loop: trace PyTorch → kernel, profile, read the profile, write Triton or compile, tune the PyTorch-level patterns that surround everything. The toolkit for making real code faster.

18
PyTorch op → kernel launch
The dispatch chain (Python → ATen → CUDA → SM). Where the 10–30 µs per op goes, and why eager mode is sometimes 5× slower. Sync points.
19
Anatomy of a fused kernel
A working Triton RMSNorm read line by line: program model, masked loads, on-chip reduction, autotune. The template that recurs in every fused kernel.
20
Profiling toolkit: three views
torch.profiler asks "which line." nsys asks "where are the gaps." ncu asks "why is this kernel slow." Three tools, three questions, three overhead profiles.
21
Reading a profile: roofline in practice
Three numbers (DRAM%, SM%, occupancy) place a kernel on the four-quadrant grid. Three worked examples; the named stall reasons and their cures.
22
Writing Triton kernels
The DSL in one table; a fused GELU+linear with autotune; the design loop; when Triton wins and when libraries beat it. All the production gotchas.
23
torch.compile: compiler-driven fusion
Dynamo → AOTAutograd → Inductor. Graph breaks, dynamic shapes, cache, modes. When compile beats hand-Triton and how to verify the speedup.
24
Performant PyTorch patterns & synthesis
Seven habits that decide whether your kernels run healthy: async H2D, allocator, contiguous, mixed precision, autograd, dataloader, sync discipline. Plus the closing synthesis.

What each lesson adds

LessonNew first-principles ideaNew trade-off you can quantify
01Threads run as warps inside blocks inside SMs.SM-residency vs warps in flight.
02Memory closer to compute is smaller and faster.Bandwidth at each tier.
03Launch syntax + host/device split.Launch overhead vs work.
04Coalescing — one transaction vs many.Stride vs transaction count.
05Tile and reuse — turns memory-bound into compute-bound.Tile size vs SMEM cap.
06Warp execution and divergence.Branch alignment vs predication cost.
07Parallel reduction tree.Warp shuffle vs SMEM vs atomic.
08Occupancy is a per-SM budget.Registers/thread vs warps/SM.
09Tensor cores do the matmul; everything feeds them.Tile shape vs mma fragment shape.
10Time = max(bytes/BW, FLOPs/peak) + launch.Fusion saves a round trip; how many?
11KV cache size = 2·L·H·d·b per token.Per-step bytes vs FLOPs at a given context length.
12Prefill and decode have different arithmetic intensities.FlashAttention HBM savings as f(T, d).
13KV memory is virtual memory with page tables.Block size: tail vs table vs indirection.
14Prefix sharing is workload structure, not a kernel feature.Prefill compute saved at a given shared fraction.
15Scheduler makes irregular traffic look regular.Token budget; CUDA graph win at small batch.
16Quantization is a layout decision.Weight bandwidth saved vs dequant overhead.
17End-to-end = API + queue + prefill + decode + stream.Where 1 ms of latency improvement pays back.
18Each PyTorch op pays a 10–30 µs CPU-side dispatch tax.Step-level CPU vs GPU time as f(ops, fusion).
19Every fused kernel is prologue → load → math → epilogue.Fused HBM bytes vs unfused, with launch overhead.
20Three tools answer three different questions.Tool choice vs symptom; overhead vs coverage.
21DRAM% × SM% × occupancy place a kernel on a 2×2 grid.Bottleneck class → first fix to try.
22Triton is tile-level; CUDA is thread-level.Triton vs library vs no-op decision.
23Dynamo + AOTAutograd + Inductor fuse for you.Predicted speedup as f(launch share, fusion potential, graph breaks).
24Seven habits keep the GPU fed.Where the next hour of work goes given a current profile.

Primary references

TopicReference
vLLM paged kernelvLLM paged attention kernel note
vLLM backend routingvLLM attention backend feature support
vLLM serving architecturevLLM architecture overview
SGLang attention backendsSGLang attention backend docs
SGLang prefix reuseSGLang RadixAttention concept doc
SGLang gateway/routerSGLang Model Gateway docs
Kernel library layerNVIDIA FlashInfer overview
FlashAttention originalDao et al., FlashAttention