Disaggregated prefill / decode — opposite phases, opposite pools

Prefill is compute-bound. Decode is memory-bound. Mixing them on the same GPU forces every step to be the worse of the two. Split them onto separate pools and each pool saturates its own bottleneck — at the cost of one KV-cache transfer per request.

Recap — why these phases want different things

From lesson 01's roofline and lesson 11's decode latency math:

Phase	Per-token work	Bottleneck	What it wants
Prefill	FLOPs on the full prompt all at once · large matmul	Tensor cores	Lots of compute, large batch dimensions, peak FLOPs/s
Decode	One forward, one new token. Read all weights, do tiny matmul	HBM bandwidth	High aggregate HBM BW, KV cache that fits, low latency

Same model. Same hardware. Same kernels. Opposite bottlenecks. The classical co-located serving (one pool runs both) suffers from a few specific pathologies:

A long prefill stalls decodes behind it. A request arrives with a 16k-token prompt. To prefill that, the GPU spends ~150 ms on a single huge forward. During those 150 ms, every other request in the batch is waiting — their decode tokens delayed.
Scheduling is multi-dimensional. The scheduler has to decide how much prefill to do per step, how many decodes to batch, when to admit a new request. This is what chunked prefill (vLLM lesson 10) addresses — chop long prefills into pieces and pack each piece into a decode-heavy step. Helps, but doesn't eliminate the tension.
The two phases want different hardware. A H100 with maximal FLOPs is wasted on decode (you only use ~1% of its compute peak). A node with maximal HBM bandwidth and lots of memory is wasted on prefill (you barely touch HBM beyond the input).

Disaggregated serving (DistServe, SplitWise, Mooncake, 2024) takes the obvious step: two pools.

The architecture

then decode GPU Prefill Pool few GPUs · compute-bound large batched prefills Decode Pool many GPUs · memory-bound continuous batched decodes request KV cache transfer network: NVLink / IB / RDMA ~50–900 GB/s · governs KV transfer time

A request flows:

Router picks a prefill GPU and a decode GPU (or pair) based on load and the request's expected length.
Prefill GPU runs the full forward pass on the prompt. Result: the KV cache for that prompt's tokens.
KV cache is transferred from the prefill GPU to the decode GPU over the network.
Decode GPU receives the KV, begins generating tokens one at a time, streams to the user.

Each pool's scheduler is simpler (one workload class instead of two). Each pool's hardware can be specialised (different GPU SKUs, different TP shapes). Each pool's GPU utilisation rises closer to its physical limit.

Animated · disaggregated vs co-located, same request

The same request travels through both architectures side by side. Co-located (top lane) sequentially does prefill then decode on one pool, stalling any other request behind it. Disaggregated (bottom lane) runs prefill on the small specialised pool, transfers the KV cache over IB, and lets the decode pool start while the prefill pool is already free for the next request. Play to watch a sequence of requests stream through.

The new cost — KV transfer

Nothing is free. For a prompt of T tokens, the KV cache is:

KV_bytes = 2 · h_kv · d_k · T · L · dtype_bytes

For a Llama-70B model (h_kv = 8, d_k = 128, L = 80) at T = 2048 in bf16: 2 · 8 · 128 · 2048 · 80 · 2 = 671 MB. At a fleet-wide 1000 req/s (across all replicas — not a single deployment), that's 671 GB/s of KV transfer load to schedule across the rack.

At inter-node 50 GB/s per NIC, you'd need ~14 NICs of aggregate KV bandwidth — feasible but not free. The whole point of disaggregation is contingent on the KV transfer being cheap enough that the per-pool efficiency gains outweigh the transfer cost.

Optimisations:

Layer-wise streaming. Don't wait for the entire prefill to finish before starting transfer. As soon as prefill finishes layer l, that layer's KV starts transferring. Decode can begin as soon as the first few layers' KV has arrived. Overlap transfer with the rest of prefill.
Co-located racks. Put prefill and decode GPUs on the same NVLink island when possible — then KV transfer runs at ~900 GB/s instead of ~50 GB/s.
KV compression. FP8 or even INT4 KV cache halves or quarters transfer cost. Same approach also reduces decode-time HBM read.
Larger KV blocks. Per-message overhead matters. Sending one 670 MB chunk is faster than 16 × 42 MB chunks at the same total bytes, because of fewer NCCL launches.

3D · KV bytes flowing over IB, prefill rack → decode rack

Isometric view of two racks. Each rack holds n GPUs (stacked cubes). When a prefill finishes a layer, the bytes pop out and stream across the IB link to the decode rack. Slide the bandwidth knob: at IB 50 GB/s the stream is sparse; at NVLink-island 900 GB/s the link saturates instantly.

Why pool sizes don't match

One prefill GPU can feed several decode GPUs. The reason: a prefill GPU produces one prompt-worth of KV per ~100 ms (for typical prompts), and a decode GPU then spends seconds-to-minutes generating tokens against that KV. Numerical example (very rough):

Quantity	Value
Prefill latency for 2k prompt on H100	~150 ms
Decode latency per token (TP=8)	~6 ms
Avg generated length per request	200 tokens (= 1200 ms)
Decodes one prefill "fills"	~1200 / 150 ≈ 8 decode batches
Pool ratio (prefill : decode), GPUs	roughly 1 : 4 to 1 : 8

This ratio depends on the workload mix (avg prompt vs avg completion length) and is a tunable in production. Many providers auto-scale the two pools independently based on observed queue length.

2D · utilisation heatmap, co-located vs disaggregated

Each row is one GPU, each column a 10 ms time slice. The cell colour is utilisation: darker = idle, bright = saturated on its bottleneck (FLOPs for prefill, HBM bandwidth for decode). The co-located heatmap shows the classic stripey pattern — prefills crowd out decodes, decode-only steps underuse FLOPs. The disaggregated heatmap shows two flat, mostly-bright bands.

When NOT to disaggregate

Very short prompts. If prompts are 50 tokens, prefill is ~10 ms (at our 5000 tok/s/GPU rate) — the KV-transfer cost is on the same order as the saving. Co-located prefill+decode is fine.
Single-machine deployments. The complexity is not worth it for one node.
Latency-extreme workloads. Adding a network hop adds tens of ms; chat with sub-200 ms TTFT may not tolerate it.
When chunked prefill suffices. If the in-pool scheduling problem is mild, the simpler vLLM/SGLang-style chunked prefill might be enough.

Disaggregation as a 3D parallelism analogue

Notice the parallel: training disaggregates the work by parallelism axis (TP, PP, DP), with the rule "fastest fabric for chattiest comm". Inference disaggregates by phase, with the analogous rule "match the phase to the hardware best at its bottleneck". The two share a deep idea: when one workload class doesn't suit one hardware shape, find the heterogeneity and let it be load-bearing.

Interactive · is disaggregation worth it for your workload?

Adjust avg prompt length, avg completion length, and KV bandwidth. The widget compares co-located scheduling (with chunked prefill) against disaggregated serving for the same number of GPUs. Outputs: effective tokens/sec, average TTFT, p99 TTFT.

Co-located vs disaggregated serving · throughput & latency

A toy queue simulator. Co-located has one pool that does both phases (with chunked prefill packing). Disaggregated has two pools split by the slider. Watch how disaggregation wins on TTFT for long prompts and loses on cost when prompts are short.

total GPUs: 32 prefill share: 25

avg prompt (tok): 1024 avg completion (tok): 200 KV xfer BW (GB/s): 200

co-located TTFT

—

disagg TTFT

—

co-located TPS

—

disagg TPS

—

Takeaway

Disaggregation buys per-pool utilisation by splitting one mixed workload into two specialised ones, paying the cost of one KV transfer per request. The break-even depends on prompt length, network bandwidth, and the value of per-phase utilisation. It pairs naturally with continuous batching, paged KV, and chunked prefill — but at extreme scale (large model, long context, high QPS), disaggregation is the highest-impact serving lever after PagedAttention itself.

Where this goes next

We've walked the bandwidth pyramid down and back up. Training disperses one model across many GPUs; inference reassembles it for users. The pieces in vllm/lessons/ — KV cache, PagedAttention, continuous batching, prefill chunking, speculative decoding, GQA/MQA — are all decode-pool-internal optimizations that compose with everything in this series. The pieces in RL/lessons/ — rollout, reference, algorithm, trainer, weight-sync — compose with training-side parallelism to build a real RL post-training stack. All three lesson series are the same system, sliced different ways.