Disaggregated prefill / decode — opposite phases, opposite pools
Prefill is compute-bound. Decode is memory-bound. Mixing them on the same GPU forces every step to be the worse of the two. Split them onto separate pools and each pool saturates its own bottleneck — at the cost of one KV-cache transfer per request.
Recap — why these phases want different things
From lesson 01's roofline and lesson 11's decode latency math:
| Phase | Per-token work | Bottleneck | What it wants |
|---|---|---|---|
| Prefill | FLOPs on the full prompt all at once · large matmul | Tensor cores | Lots of compute, large batch dimensions, peak FLOPs/s |
| Decode | One forward, one new token. Read all weights, do tiny matmul | HBM bandwidth | High aggregate HBM BW, KV cache that fits, low latency |
Same model. Same hardware. Same kernels. Opposite bottlenecks. The classical co-located serving (one pool runs both) suffers from a few specific pathologies:
- A long prefill stalls decodes behind it. A request arrives with a 16k-token prompt. To prefill that, the GPU spends ~150 ms on a single huge forward. During those 150 ms, every other request in the batch is waiting — their decode tokens delayed.
- Scheduling is multi-dimensional. The scheduler has to decide how much prefill to do per step, how many decodes to batch, when to admit a new request. This is what chunked prefill (vLLM lesson 10) addresses — chop long prefills into pieces and pack each piece into a decode-heavy step. Helps, but doesn't eliminate the tension.
- The two phases want different hardware. A H100 with maximal FLOPs is wasted on decode (you only use ~1% of its compute peak). A node with maximal HBM bandwidth and lots of memory is wasted on prefill (you barely touch HBM beyond the input).
Disaggregated serving (DistServe, SplitWise, Mooncake, 2024) takes the obvious step: two pools.
The architecture
A request flows:
- Router picks a prefill GPU and a decode GPU (or pair) based on load and the request's expected length.
- Prefill GPU runs the full forward pass on the prompt. Result: the KV cache for that prompt's tokens.
- KV cache is transferred from the prefill GPU to the decode GPU over the network.
- Decode GPU receives the KV, begins generating tokens one at a time, streams to the user.
Each pool's scheduler is simpler (one workload class instead of two). Each pool's hardware can be specialised (different GPU SKUs, different TP shapes). Each pool's GPU utilisation rises closer to its physical limit.
Animated · disaggregated vs co-located, same request
The same request travels through both architectures side by side. Co-located (top lane) sequentially does prefill then decode on one pool, stalling any other request behind it. Disaggregated (bottom lane) runs prefill on the small specialised pool, transfers the KV cache over IB, and lets the decode pool start while the prefill pool is already free for the next request. Play to watch a sequence of requests stream through.
The new cost — KV transfer
Nothing is free. For a prompt of T tokens, the KV cache is:
For a Llama-70B model (h_kv = 8, d_k = 128, L = 80) at T = 2048 in bf16: 2 · 8 · 128 · 2048 · 80 · 2 = 671 MB. At a fleet-wide 1000 req/s (across all replicas — not a single deployment), that's 671 GB/s of KV transfer load to schedule across the rack.
At inter-node 50 GB/s per NIC, you'd need ~14 NICs of aggregate KV bandwidth — feasible but not free. The whole point of disaggregation is contingent on the KV transfer being cheap enough that the per-pool efficiency gains outweigh the transfer cost.
Optimisations:
- Layer-wise streaming. Don't wait for the entire prefill to finish before starting transfer. As soon as prefill finishes layer l, that layer's KV starts transferring. Decode can begin as soon as the first few layers' KV has arrived. Overlap transfer with the rest of prefill.
- Co-located racks. Put prefill and decode GPUs on the same NVLink island when possible — then KV transfer runs at ~900 GB/s instead of ~50 GB/s.
- KV compression. FP8 or even INT4 KV cache halves or quarters transfer cost. Same approach also reduces decode-time HBM read.
- Larger KV blocks. Per-message overhead matters. Sending one 670 MB chunk is faster than 16 × 42 MB chunks at the same total bytes, because of fewer NCCL launches.
3D · KV bytes flowing over IB, prefill rack → decode rack
Isometric view of two racks. Each rack holds n GPUs (stacked cubes). When a prefill finishes a layer, the bytes pop out and stream across the IB link to the decode rack. Slide the bandwidth knob: at IB 50 GB/s the stream is sparse; at NVLink-island 900 GB/s the link saturates instantly.
Why pool sizes don't match
One prefill GPU can feed several decode GPUs. The reason: a prefill GPU produces one prompt-worth of KV per ~100 ms (for typical prompts), and a decode GPU then spends seconds-to-minutes generating tokens against that KV. Numerical example (very rough):
| Quantity | Value |
|---|---|
| Prefill latency for 2k prompt on H100 | ~150 ms |
| Decode latency per token (TP=8) | ~6 ms |
| Avg generated length per request | 200 tokens (= 1200 ms) |
| Decodes one prefill "fills" | ~1200 / 150 ≈ 8 decode batches |
| Pool ratio (prefill : decode), GPUs | roughly 1 : 4 to 1 : 8 |
This ratio depends on the workload mix (avg prompt vs avg completion length) and is a tunable in production. Many providers auto-scale the two pools independently based on observed queue length.
2D · utilisation heatmap, co-located vs disaggregated
Each row is one GPU, each column a 10 ms time slice. The cell colour is utilisation: darker = idle, bright = saturated on its bottleneck (FLOPs for prefill, HBM bandwidth for decode). The co-located heatmap shows the classic stripey pattern — prefills crowd out decodes, decode-only steps underuse FLOPs. The disaggregated heatmap shows two flat, mostly-bright bands.
When NOT to disaggregate
- Very short prompts. If prompts are 50 tokens, prefill is ~10 ms (at our 5000 tok/s/GPU rate) — the KV-transfer cost is on the same order as the saving. Co-located prefill+decode is fine.
- Single-machine deployments. The complexity is not worth it for one node.
- Latency-extreme workloads. Adding a network hop adds tens of ms; chat with sub-200 ms TTFT may not tolerate it.
- When chunked prefill suffices. If the in-pool scheduling problem is mild, the simpler vLLM/SGLang-style chunked prefill might be enough.
Disaggregation as a 3D parallelism analogue
Notice the parallel: training disaggregates the work by parallelism axis (TP, PP, DP), with the rule "fastest fabric for chattiest comm". Inference disaggregates by phase, with the analogous rule "match the phase to the hardware best at its bottleneck". The two share a deep idea: when one workload class doesn't suit one hardware shape, find the heterogeneity and let it be load-bearing.
Interactive · is disaggregation worth it for your workload?
Adjust avg prompt length, avg completion length, and KV bandwidth. The widget compares co-located scheduling (with chunked prefill) against disaggregated serving for the same number of GPUs. Outputs: effective tokens/sec, average TTFT, p99 TTFT.
vllm/lessons/ — KV cache, PagedAttention, continuous batching, prefill chunking, speculative decoding, GQA/MQA — are all decode-pool-internal optimizations that compose with everything in this series. The pieces in RL/lessons/ — rollout, reference, algorithm, trainer, weight-sync — compose with training-side parallelism to build a real RL post-training stack. All three lesson series are the same system, sliced different ways.