GPU Kernels for LLM Serving
One linear path from "thread, block, warp" to "tokens streamed to a user" — then a third part on how to make them fast: profile, hypothesize, write Triton or compile, re-profile. Twenty-four lessons, strictly linear.
The dependency chain
The track is a directed line, not a wheel. CUDA primitives generate the kernel ideas; the kernel ideas generate the memory-layout ideas; the memory-layout ideas generate the scheduling ideas; the scheduling ideas generate the framework. If you skip a step the next step looks like a folk trick instead of a forced move.
Part I · CUDA from first principles
Hardware execution model, memory pyramid, your first kernel, the coalesced-access trap, tile-and-share, warp divergence, the reduction template, occupancy as a budget, and the tensor-core instruction. By the end you can read any kernel in Part II.
__global__, launch syntax, host/device split, the boundary check. Walk this and you can read any kernel.Part II · LLM serving on top of Part I
Where the primitives become FlashAttention, paged KV, radix prefix reuse, continuous batching, CUDA graphs, and the framework that turns HTTP requests into batches kernels can execute.
Part III · From measurement to optimization
Now that you know what kernels exist (Parts I–II), this part is the loop: trace PyTorch → kernel, profile, read the profile, write Triton or compile, tune the PyTorch-level patterns that surround everything. The toolkit for making real code faster.
What each lesson adds
| Lesson | New first-principles idea | New trade-off you can quantify |
|---|---|---|
| 01 | Threads run as warps inside blocks inside SMs. | SM-residency vs warps in flight. |
| 02 | Memory closer to compute is smaller and faster. | Bandwidth at each tier. |
| 03 | Launch syntax + host/device split. | Launch overhead vs work. |
| 04 | Coalescing — one transaction vs many. | Stride vs transaction count. |
| 05 | Tile and reuse — turns memory-bound into compute-bound. | Tile size vs SMEM cap. |
| 06 | Warp execution and divergence. | Branch alignment vs predication cost. |
| 07 | Parallel reduction tree. | Warp shuffle vs SMEM vs atomic. |
| 08 | Occupancy is a per-SM budget. | Registers/thread vs warps/SM. |
| 09 | Tensor cores do the matmul; everything feeds them. | Tile shape vs mma fragment shape. |
| 10 | Time = max(bytes/BW, FLOPs/peak) + launch. | Fusion saves a round trip; how many? |
| 11 | KV cache size = 2·L·H·d·b per token. | Per-step bytes vs FLOPs at a given context length. |
| 12 | Prefill and decode have different arithmetic intensities. | FlashAttention HBM savings as f(T, d). |
| 13 | KV memory is virtual memory with page tables. | Block size: tail vs table vs indirection. |
| 14 | Prefix sharing is workload structure, not a kernel feature. | Prefill compute saved at a given shared fraction. |
| 15 | Scheduler makes irregular traffic look regular. | Token budget; CUDA graph win at small batch. |
| 16 | Quantization is a layout decision. | Weight bandwidth saved vs dequant overhead. |
| 17 | End-to-end = API + queue + prefill + decode + stream. | Where 1 ms of latency improvement pays back. |
| 18 | Each PyTorch op pays a 10–30 µs CPU-side dispatch tax. | Step-level CPU vs GPU time as f(ops, fusion). |
| 19 | Every fused kernel is prologue → load → math → epilogue. | Fused HBM bytes vs unfused, with launch overhead. |
| 20 | Three tools answer three different questions. | Tool choice vs symptom; overhead vs coverage. |
| 21 | DRAM% × SM% × occupancy place a kernel on a 2×2 grid. | Bottleneck class → first fix to try. |
| 22 | Triton is tile-level; CUDA is thread-level. | Triton vs library vs no-op decision. |
| 23 | Dynamo + AOTAutograd + Inductor fuse for you. | Predicted speedup as f(launch share, fusion potential, graph breaks). |
| 24 | Seven habits keep the GPU fed. | Where the next hour of work goes given a current profile. |
Primary references
| Topic | Reference |
|---|---|
| vLLM paged kernel | vLLM paged attention kernel note |
| vLLM backend routing | vLLM attention backend feature support |
| vLLM serving architecture | vLLM architecture overview |
| SGLang attention backends | SGLang attention backend docs |
| SGLang prefix reuse | SGLang RadixAttention concept doc |
| SGLang gateway/router | SGLang Model Gateway docs |
| Kernel library layer | NVIDIA FlashInfer overview |
| FlashAttention original | Dao et al., FlashAttention |