vLLM, from first principles
A linearized tour of why vLLM is fast. Each lesson isolates one mechanism — read top to bottom, or skip to whichever bottleneck you're currently fighting.
Who this is for
You know what a transformer is and you've served some kind of LLM. The numbers — bytes per token, HBM bandwidth, batch utilization — are about to start meaning something to you. By the end you can read vLLM's scheduler source and predict its behavior on your workload.
The system you're learning
Every vLLM optimization aims at one object: the KV cache, the per-sequence state that grows by one row every decode step. Throughput is gated on how many concurrent sequences fit in HBM; latency is gated on how many bytes leave HBM per token. The three architectural layers below each tackle one of those axes.
The four questions this series answers
- How are PagedAttention and FlashAttention different? They are. Different groups, different problems, both used. Lessons 02 + 03.
- Why does everyone use vLLM? Paged KV + continuous batching + Flash kernels pushes throughput 2–24× over
HF generate, with an OpenAI-compatible URL. Lessons 02 + 04 + 03. - What's the core piece? The block manager and the scheduler that talks to it. Everything else either depends on paged KV or becomes dramatically easier because of it. Lessons 02 + 04.
- What optimizations are layered on top? Twelve of them, ranked by throughput impact below.
Part I · Fundamentals (lessons 01–05 · the core curriculum)
01
The KV cache — where every optimization is aimed
Bytes-per-token, the one formula. Why a 2048-slot pre-allocation wastes 95% of HBM. The two consequences (throughput ↔ HBM occupancy; decode is memory-bound) that drive every later lesson.
02
PagedAttention — vLLM's central innovation
The OS virtual-memory analogy: fixed-size blocks + per-sequence block tables + refcounts + copy-on-write. Beam search and prefix sharing fall out for free. The CUDA cost: one pointer indirection.
03
FlashAttention — tiled online softmax
Orthogonal to paging. Rewrite attention as a streaming op over (Q, K, V) tiles, keep (m, ℓ, O) running statistics in SRAM, never materialize the T×T matrix. 75× HBM reduction → 2–4× wall-clock.
04
Continuous batching — the scheduling win
Static batching wastes 50% of cycles waiting for stragglers. Iteration-level batching admits new requests every step. Why this is impossible without paged KV.
05
Optimization catalog — APC, quant, CUDA graphs, TP
Automatic Prefix Caching (with the runnable demo), KV quantization, tensor parallelism, CUDA graph capture, and a first look at speculative decoding (deep-dived in lesson 07).
Part II · System layer and deep dives (lessons 06–12 · stand alone after Part I)
Each of these is a self-contained deep dive on one mechanism. Pick whichever your current workload is bottlenecked on.
06
Serving architecture — HTTP to tokens at scale
The async engine loop, FastAPI plumbing, Ray workers, the TP/PP/DP knobs. Why
asyncio (not threads) and why the engine runs on a dedicated thread.07
Speculative decoding — draft, verify, accept
The asymmetry that makes it work, the rejection-sampling rule that keeps it exact, the speedup formula, and the Medusa / EAGLE / Lookahead family.
08
Prefill / decode disaggregation
Prefill is compute-bound; decode is memory-bound. Separate them onto different hardware pools. The KV-transfer cost and when it's worth it.
09
GQA / MQA — the architecture change that shrinks the cache
Multi-head → grouped-query → multi-query. Why 8× KV reduction with no measurable quality loss. If you don't know GQA, your serving math is wrong by an order of magnitude.
10
Chunked prefill — saturate both bottlenecks
Break long prefills into chunks, pack with decode steps in one fused forward. FLOPs and HBM fire simultaneously. The
max_num_batched_tokens tradeoff.11
Preemption and swap — when the cache fills up
Recompute vs swap, why vLLM picks swap, who gets evicted, and why your p99 latency cliff lines up with the block pool's saturation point.
12
Multi-LoRA serving — one base, N adapters
Why we never merge in the multi-tenant case. Unified adapter memory + grouped GEMM (BGMV / SGMV). 1000 adapters in one process.
Optimization ranking, by throughput impact
| Rank | Technique | Where | Mechanism |
|---|---|---|---|
| 1 | PagedAttention | 02 | 60–80% more effective KV per GB of HBM |
| 2 | Continuous batching | 04 | Decode-step utilization 45% → 90% |
| 3 | FlashAttention | 03 | HBM traffic O(T²) → O(T) per attention op |
| 4 | GQA / MQA | 09 | Model-architecture: 4–8× smaller KV |
| 5 | Chunked prefill | 10 | Both bottlenecks saturated in one step |
| 6 | Automatic Prefix Caching | 05 | Shared system prompts → ~95% prefill skip |
| 7 | Quantization (weights + KV) | 05 | fp16 → fp8: 2× bytes/step, 2× cache capacity |
| 8 | Speculative decoding | 07 | K draft tokens / 1 target step, ~2× decode |
| 9 | Preemption + swap | 11 | Stability under load, p99 protection |
| 10 | Tensor parallelism | 06 | Spread weights + activations across N GPUs |
| 11 | CUDA graph capture | 05 | Eliminate Python overhead per decode step |
| 12 | Disaggregated prefill/decode | 08 | Opposite workloads → opposite hardware |
| 13 | Multi-LoRA (S-LoRA / Punica) | 12 | Grouped GEMM, unified adapter pool |
Common misconceptions
- PagedAttention ≠ FlashAttention. Different papers, different groups. vLLM ships a kernel that does Flash-style tiling over a Paged-style block layout.
- Continuous batching is not a vLLM invention. Orca (OSDI 2022) had it first. vLLM made it practical by pairing it with paged KV.
- Decode is memory-bound; prefill is compute-bound. Same model, same kernels — opposite bottlenecks. Many design choices flow from this.
- Block size is a tradeoff, not a constant. 16 is empirically sweet on H100/A100. Smaller → less tail waste, more launch overhead.
- Speculative decoding is exact, not approximate. The rejection rule preserves the target model's sampling distribution exactly.
How to use this
- Sequential. Lessons 01–05 build the mental model. After that the order can be your own.
- Touch the widgets. Each lesson has one interactive piece — the surprise it surfaces is the lesson. Find the configuration that breaks the system; the bug is the point.
- Open the code. Every lesson links to a runnable Python file under
vllm/. The lessons explain why; the code is what.
Companion code
Each lesson corresponds to one Python file under vllm/. Run any of them with uv run python vllm/0X_*.py from the repo root.