vLLM, from first principles

A linearized tour of why vLLM is fast. Each lesson isolates one mechanism — read top to bottom, or skip to whichever bottleneck you're currently fighting.

Who this is for

You know what a transformer is and you've served some kind of LLM. The numbers — bytes per token, HBM bandwidth, batch utilization — are about to start meaning something to you. By the end you can read vLLM's scheduler source and predict its behavior on your workload.

The system you're learning

Every vLLM optimization aims at one object: the KV cache, the per-sequence state that grows by one row every decode step. Throughput is gated on how many concurrent sequences fit in HBM; latency is gated on how many bytes leave HBM per token. The three architectural layers below each tackle one of those axes.

The four questions this series answers

How are PagedAttention and FlashAttention different? They are. Different groups, different problems, both used. Lessons 02 + 03.
Why does everyone use vLLM? Paged KV + continuous batching + Flash kernels pushes throughput 2–24× over HF generate, with an OpenAI-compatible URL. Lessons 02 + 04 + 03.
What's the core piece? The block manager and the scheduler that talks to it. Everything else either depends on paged KV or becomes dramatically easier because of it. Lessons 02 + 04.
What optimizations are layered on top? Twelve of them, ranked by throughput impact below.

Part I · Fundamentals (lessons 01–05 · the core curriculum)

The KV cache — where every optimization is aimed

Bytes-per-token, the one formula. Why a 2048-slot pre-allocation wastes 95% of HBM. The two consequences (throughput ↔ HBM occupancy; decode is memory-bound) that drive every later lesson.

PagedAttention — vLLM's central innovation

The OS virtual-memory analogy: fixed-size blocks + per-sequence block tables + refcounts + copy-on-write. Beam search and prefix sharing fall out for free. The CUDA cost: one pointer indirection.

FlashAttention — tiled online softmax

Orthogonal to paging. Rewrite attention as a streaming op over (Q, K, V) tiles, keep (m, ℓ, O) running statistics in SRAM, never materialize the T×T matrix. 75× HBM reduction → 2–4× wall-clock.

Continuous batching — the scheduling win

Static batching wastes 50% of cycles waiting for stragglers. Iteration-level batching admits new requests every step. Why this is impossible without paged KV.

Optimization catalog — APC, quant, CUDA graphs, TP

Automatic Prefix Caching (with the runnable demo), KV quantization, tensor parallelism, CUDA graph capture, and a first look at speculative decoding (deep-dived in lesson 07).

Part II · System layer and deep dives (lessons 06–12 · stand alone after Part I)

Each of these is a self-contained deep dive on one mechanism. Pick whichever your current workload is bottlenecked on.

Serving architecture — HTTP to tokens at scale

The async engine loop, FastAPI plumbing, Ray workers, the TP/PP/DP knobs. Why asyncio (not threads) and why the engine runs on a dedicated thread.

Speculative decoding — draft, verify, accept

The asymmetry that makes it work, the rejection-sampling rule that keeps it exact, the speedup formula, and the Medusa / EAGLE / Lookahead family.

Prefill / decode disaggregation

Prefill is compute-bound; decode is memory-bound. Separate them onto different hardware pools. The KV-transfer cost and when it's worth it.

GQA / MQA — the architecture change that shrinks the cache

Multi-head → grouped-query → multi-query. Why 8× KV reduction with no measurable quality loss. If you don't know GQA, your serving math is wrong by an order of magnitude.

Chunked prefill — saturate both bottlenecks

Break long prefills into chunks, pack with decode steps in one fused forward. FLOPs and HBM fire simultaneously. The max_num_batched_tokens tradeoff.

Preemption and swap — when the cache fills up

Recompute vs swap, why vLLM picks swap, who gets evicted, and why your p99 latency cliff lines up with the block pool's saturation point.

Multi-LoRA serving — one base, N adapters

Why we never merge in the multi-tenant case. Unified adapter memory + grouped GEMM (BGMV / SGMV). 1000 adapters in one process.

Optimization ranking, by throughput impact

Rank	Technique	Where	Mechanism
1	PagedAttention	02	60–80% more effective KV per GB of HBM
2	Continuous batching	04	Decode-step utilization 45% → 90%
3	FlashAttention	03	HBM traffic O(T²) → O(T) per attention op
4	GQA / MQA	09	Model-architecture: 4–8× smaller KV
5	Chunked prefill	10	Both bottlenecks saturated in one step
6	Automatic Prefix Caching	05	Shared system prompts → ~95% prefill skip
7	Quantization (weights + KV)	05	fp16 → fp8: 2× bytes/step, 2× cache capacity
8	Speculative decoding	07	K draft tokens / 1 target step, ~2× decode
9	Preemption + swap	11	Stability under load, p99 protection
10	Tensor parallelism	06	Spread weights + activations across N GPUs
11	CUDA graph capture	05	Eliminate Python overhead per decode step
12	Disaggregated prefill/decode	08	Opposite workloads → opposite hardware
13	Multi-LoRA (S-LoRA / Punica)	12	Grouped GEMM, unified adapter pool

Common misconceptions

PagedAttention ≠ FlashAttention. Different papers, different groups. vLLM ships a kernel that does Flash-style tiling over a Paged-style block layout.
Continuous batching is not a vLLM invention. Orca (OSDI 2022) had it first. vLLM made it practical by pairing it with paged KV.
Decode is memory-bound; prefill is compute-bound. Same model, same kernels — opposite bottlenecks. Many design choices flow from this.
Block size is a tradeoff, not a constant. 16 is empirically sweet on H100/A100. Smaller → less tail waste, more launch overhead.
Speculative decoding is exact, not approximate. The rejection rule preserves the target model's sampling distribution exactly.

How to use this

Sequential. Lessons 01–05 build the mental model. After that the order can be your own.
Touch the widgets. Each lesson has one interactive piece — the surprise it surfaces is the lesson. Find the configuration that breaks the system; the bug is the point.
Open the code. Every lesson links to a runnable Python file under vllm/. The lessons explain why; the code is what.

Companion code

Each lesson corresponds to one Python file under vllm/. Run any of them with uv run python vllm/0X_*.py from the repo root.