GPU Kernels for LLM Serving

One linear path from "thread, block, warp" to "tokens streamed to a user" — then a third part on how to make them fast: profile, hypothesize, write Triton or compile, re-profile. Twenty-four lessons, strictly linear.

How to read this track

Part I (01–09): CUDA primitives. Part II (10–17): serving — what vLLM/SGLang actually do. Part III (18–24): the optimization loop — PyTorch dispatch, profiling, Triton, torch.compile, performant patterns. Each lesson assumes only the lessons before it.

Mental stance

Treat a serving engine as a controller for kernels. The framework decides which work exists; the scheduler decides which work runs together; the kernel decides how bytes move through HBM, SRAM, registers, and tensor cores. Optimizing the wrong layer is the most common mistake in this domain.

The dependency chain

The track is a directed line, not a wheel. CUDA primitives generate the kernel ideas; the kernel ideas generate the memory-layout ideas; the memory-layout ideas generate the scheduling ideas; the scheduling ideas generate the framework. If you skip a step the next step looks like a folk trick instead of a forced move.

Part I · CUDA from first principles

Hardware execution model, memory pyramid, your first kernel, the coalesced-access trap, tile-and-share, warp divergence, the reduction template, occupancy as a budget, and the tensor-core instruction. By the end you can read any kernel in Part II.

GPU execution model

A GPU is a hardware lattice with a specific shape. Threads, warps, blocks, SMs — and how every CUDA concept maps onto a piece of the lattice.

Memory hierarchy

HBM, L2, shared memory, registers — the pyramid of "places," with bandwidths. Why the kernel game is about moving the right bytes to the right place at the right time.

First kernel — vector add

The smallest useful CUDA program. __global__, launch syntax, host/device split, the boundary check. Walk this and you can read any kernel.

Coalesced memory access

A warp's 32 threads can land in one memory transaction or up to 32 — depending on stride. The single biggest beginner trap.

Shared memory & tiled matmul

Tile A and B into SMEM, reuse each byte O(N) times, drop HBM traffic by ~N, become compute-bound. The reason FlashAttention works.

Warps, divergence, sync

32 threads run the same instruction. When they disagree, the warp serialises. Branch warp-aligned or use predication.

Reductions

Warp shuffles, shared memory, atomics — the parallel-tree template behind softmax, normalization, and every "produce one number from many."

Occupancy

More warps in flight → better latency hiding, but each costs registers and SMEM. Maximum occupancy isn't always fastest.

Tensor cores

One hardware instruction does a 16×16 matrix multiply. Every meaningful matmul on a modern GPU lands here.

Part II · LLM serving on top of Part I

Where the primitives become FlashAttention, paged KV, radix prefix reuse, continuous batching, CUDA graphs, and the framework that turns HTTP requests into batches kernels can execute.

GPU as a bandwidth machine

Roofline synthesis of Part I, with H100 numbers. The accountant you run before asking for a custom kernel.

Transformer forward as a kernel chain

A forward pass is a sequence of GEMMs and an attention. Compute KV cache size from first principles and see exactly where bytes go each token.

Attention asymmetry & FlashAttention

Prefill compute-bound, decode bandwidth-bound, same math. FlashAttention's online softmax in seven lines, and what it does not fix.

Paged KV: virtual memory for caches

Why contiguous allocation wastes HBM, how block tables fix it, what indirection costs the kernel.

Prefix reuse: hash caches and radix trees

When prompts share prefixes, share the KV. Hash caches, radix tries, eviction, cache-aware routing.

Scheduling and graph capture

Continuous batching, chunked prefill, CUDA graphs — making the scheduler's traffic look regular and cheap to launch.

Beyond attention: GEMM, quant, MoE, sampling

The rest of the decode chain. Quantization as packing-plus-kernel, MoE dispatch, GPU sampling.

The serving framework: synthesis

HTTP → tokenizer → queue → scheduler → worker → kernel → sampler → stream. Where time goes, how to debug, vLLM and SGLang as compositions.

Part III · From measurement to optimization

Now that you know what kernels exist (Parts I–II), this part is the loop: trace PyTorch → kernel, profile, read the profile, write Triton or compile, tune the PyTorch-level patterns that surround everything. The toolkit for making real code faster.

PyTorch op → kernel launch

The dispatch chain (Python → ATen → CUDA → SM). Where the 10–30 µs per op goes, and why eager mode is sometimes 5× slower. Sync points.

Anatomy of a fused kernel

A working Triton RMSNorm read line by line: program model, masked loads, on-chip reduction, autotune. The template that recurs in every fused kernel.

Profiling toolkit: three views

torch.profiler asks "which line." nsys asks "where are the gaps." ncu asks "why is this kernel slow." Three tools, three questions, three overhead profiles.

Reading a profile: roofline in practice

Three numbers (DRAM%, SM%, occupancy) place a kernel on the four-quadrant grid. Three worked examples; the named stall reasons and their cures.

Writing Triton kernels

The DSL in one table; a fused GELU+linear with autotune; the design loop; when Triton wins and when libraries beat it. All the production gotchas.

torch.compile: compiler-driven fusion

Dynamo → AOTAutograd → Inductor. Graph breaks, dynamic shapes, cache, modes. When compile beats hand-Triton and how to verify the speedup.

Performant PyTorch patterns & synthesis

Seven habits that decide whether your kernels run healthy: async H2D, allocator, contiguous, mixed precision, autograd, dataloader, sync discipline. Plus the closing synthesis.

What each lesson adds

Lesson	New first-principles idea	New trade-off you can quantify
01	Threads run as warps inside blocks inside SMs.	SM-residency vs warps in flight.
02	Memory closer to compute is smaller and faster.	Bandwidth at each tier.
03	Launch syntax + host/device split.	Launch overhead vs work.
04	Coalescing — one transaction vs many.	Stride vs transaction count.
05	Tile and reuse — turns memory-bound into compute-bound.	Tile size vs SMEM cap.
06	Warp execution and divergence.	Branch alignment vs predication cost.
07	Parallel reduction tree.	Warp shuffle vs SMEM vs atomic.
08	Occupancy is a per-SM budget.	Registers/thread vs warps/SM.
09	Tensor cores do the matmul; everything feeds them.	Tile shape vs mma fragment shape.
10	Time = max(bytes/BW, FLOPs/peak) + launch.	Fusion saves a round trip; how many?
11	KV cache size = 2·L·H·d·b per token.	Per-step bytes vs FLOPs at a given context length.
12	Prefill and decode have different arithmetic intensities.	FlashAttention HBM savings as f(T, d).
13	KV memory is virtual memory with page tables.	Block size: tail vs table vs indirection.
14	Prefix sharing is workload structure, not a kernel feature.	Prefill compute saved at a given shared fraction.
15	Scheduler makes irregular traffic look regular.	Token budget; CUDA graph win at small batch.
16	Quantization is a layout decision.	Weight bandwidth saved vs dequant overhead.
17	End-to-end = API + queue + prefill + decode + stream.	Where 1 ms of latency improvement pays back.
18	Each PyTorch op pays a 10–30 µs CPU-side dispatch tax.	Step-level CPU vs GPU time as f(ops, fusion).
19	Every fused kernel is prologue → load → math → epilogue.	Fused HBM bytes vs unfused, with launch overhead.
20	Three tools answer three different questions.	Tool choice vs symptom; overhead vs coverage.
21	DRAM% × SM% × occupancy place a kernel on a 2×2 grid.	Bottleneck class → first fix to try.
22	Triton is tile-level; CUDA is thread-level.	Triton vs library vs no-op decision.
23	Dynamo + AOTAutograd + Inductor fuse for you.	Predicted speedup as f(launch share, fusion potential, graph breaks).
24	Seven habits keep the GPU fed.	Where the next hour of work goes given a current profile.

Primary references

Topic	Reference
vLLM paged kernel	vLLM paged attention kernel note
vLLM backend routing	vLLM attention backend feature support
vLLM serving architecture	vLLM architecture overview
SGLang attention backends	SGLang attention backend docs
SGLang prefix reuse	SGLang RadixAttention concept doc
SGLang gateway/router	SGLang Model Gateway docs
Kernel library layer	NVIDIA FlashInfer overview
FlashAttention original	Dao et al., FlashAttention