system_ml / 12 · disaggregated serving lesson 12 / 19

Disaggregated prefill / decode — opposite phases, opposite pools

Prefill is compute-bound. Decode is memory-bound. Mixing them on the same GPU forces every step to be the worse of the two. Split them onto separate pools and each pool saturates its own bottleneck — at the cost of one KV-cache transfer per request.

Recap — why these phases want different things

From lesson 01's roofline and lesson 11's decode latency math:

PhasePer-token workBottleneckWhat it wants
PrefillFLOPs on the full prompt all at once · large matmulTensor coresLots of compute, large batch dimensions, peak FLOPs/s
DecodeOne forward, one new token. Read all weights, do tiny matmulHBM bandwidthHigh aggregate HBM BW, KV cache that fits, low latency

Same model. Same hardware. Same kernels. Opposite bottlenecks. The classical co-located serving (one pool runs both) suffers from a few specific pathologies:

Disaggregated serving (DistServe, SplitWise, Mooncake, 2024) takes the obvious step: two pools.

The architecture

disaggregated serving · two specialised pools Router picks prefill GPU
then decode GPU Prefill Pool few GPUs · compute-bound large batched prefills Decode Pool many GPUs · memory-bound continuous batched decodes request KV cache transfer network: NVLink / IB / RDMA ~50–900 GB/s · governs KV transfer time

A request flows:

  1. Router picks a prefill GPU and a decode GPU (or pair) based on load and the request's expected length.
  2. Prefill GPU runs the full forward pass on the prompt. Result: the KV cache for that prompt's tokens.
  3. KV cache is transferred from the prefill GPU to the decode GPU over the network.
  4. Decode GPU receives the KV, begins generating tokens one at a time, streams to the user.

Each pool's scheduler is simpler (one workload class instead of two). Each pool's hardware can be specialised (different GPU SKUs, different TP shapes). Each pool's GPU utilisation rises closer to its physical limit.

Animated · disaggregated vs co-located, same request

The same request travels through both architectures side by side. Co-located (top lane) sequentially does prefill then decode on one pool, stalling any other request behind it. Disaggregated (bottom lane) runs prefill on the small specialised pool, transfers the KV cache over IB, and lets the decode pool start while the prefill pool is already free for the next request. Play to watch a sequence of requests stream through.

Two architectures · same workload · per-GPU timeline
Time runs left to right. Blue = prefill compute, green = decode compute, orange = KV cache transfer. Idle = empty. Each row is one GPU.
co-located TTFT (this req)
disagg TTFT (this req)
co-located req completed
disagg req completed

The new cost — KV transfer

Nothing is free. For a prompt of T tokens, the KV cache is:

KV_bytes  =  2 · h_kv · d_k · T · L · dtype_bytes

For a Llama-70B model (h_kv = 8, d_k = 128, L = 80) at T = 2048 in bf16: 2 · 8 · 128 · 2048 · 80 · 2 = 671 MB. At a fleet-wide 1000 req/s (across all replicas — not a single deployment), that's 671 GB/s of KV transfer load to schedule across the rack.

At inter-node 50 GB/s per NIC, you'd need ~14 NICs of aggregate KV bandwidth — feasible but not free. The whole point of disaggregation is contingent on the KV transfer being cheap enough that the per-pool efficiency gains outweigh the transfer cost.

Optimisations:

  • Layer-wise streaming. Don't wait for the entire prefill to finish before starting transfer. As soon as prefill finishes layer l, that layer's KV starts transferring. Decode can begin as soon as the first few layers' KV has arrived. Overlap transfer with the rest of prefill.
  • Co-located racks. Put prefill and decode GPUs on the same NVLink island when possible — then KV transfer runs at ~900 GB/s instead of ~50 GB/s.
  • KV compression. FP8 or even INT4 KV cache halves or quarters transfer cost. Same approach also reduces decode-time HBM read.
  • Larger KV blocks. Per-message overhead matters. Sending one 670 MB chunk is faster than 16 × 42 MB chunks at the same total bytes, because of fewer NCCL launches.

3D · KV bytes flowing over IB, prefill rack → decode rack

Isometric view of two racks. Each rack holds n GPUs (stacked cubes). When a prefill finishes a layer, the bytes pop out and stream across the IB link to the decode rack. Slide the bandwidth knob: at IB 50 GB/s the stream is sparse; at NVLink-island 900 GB/s the link saturates instantly.

KV transfer · prefill rack → decode rack (isometric)
Layer-wise streaming: each "byte particle" represents one layer's KV (~8 MB per layer at T=2k). Travel time = bytes / bandwidth.
total KV (per req)
transfer time
link utilisation
overlap w/ prefill?

Why pool sizes don't match

One prefill GPU can feed several decode GPUs. The reason: a prefill GPU produces one prompt-worth of KV per ~100 ms (for typical prompts), and a decode GPU then spends seconds-to-minutes generating tokens against that KV. Numerical example (very rough):

QuantityValue
Prefill latency for 2k prompt on H100~150 ms
Decode latency per token (TP=8)~6 ms
Avg generated length per request200 tokens (= 1200 ms)
Decodes one prefill "fills"~1200 / 150 ≈ 8 decode batches
Pool ratio (prefill : decode), GPUsroughly 1 : 4 to 1 : 8

This ratio depends on the workload mix (avg prompt vs avg completion length) and is a tunable in production. Many providers auto-scale the two pools independently based on observed queue length.

2D · utilisation heatmap, co-located vs disaggregated

Each row is one GPU, each column a 10 ms time slice. The cell colour is utilisation: darker = idle, bright = saturated on its bottleneck (FLOPs for prefill, HBM bandwidth for decode). The co-located heatmap shows the classic stripey pattern — prefills crowd out decodes, decode-only steps underuse FLOPs. The disaggregated heatmap shows two flat, mostly-bright bands.

Per-GPU utilisation over time · two architectures
Same total request stream is fed to both. Co-located mixes both phases; disaggregated splits them. Try long prompts to see the gap widen.
co-located avg util
disagg avg util
co-located variance
disagg variance

When NOT to disaggregate

  • Very short prompts. If prompts are 50 tokens, prefill is ~10 ms (at our 5000 tok/s/GPU rate) — the KV-transfer cost is on the same order as the saving. Co-located prefill+decode is fine.
  • Single-machine deployments. The complexity is not worth it for one node.
  • Latency-extreme workloads. Adding a network hop adds tens of ms; chat with sub-200 ms TTFT may not tolerate it.
  • When chunked prefill suffices. If the in-pool scheduling problem is mild, the simpler vLLM/SGLang-style chunked prefill might be enough.

Disaggregation as a 3D parallelism analogue

Notice the parallel: training disaggregates the work by parallelism axis (TP, PP, DP), with the rule "fastest fabric for chattiest comm". Inference disaggregates by phase, with the analogous rule "match the phase to the hardware best at its bottleneck". The two share a deep idea: when one workload class doesn't suit one hardware shape, find the heterogeneity and let it be load-bearing.

Interactive · is disaggregation worth it for your workload?

Adjust avg prompt length, avg completion length, and KV bandwidth. The widget compares co-located scheduling (with chunked prefill) against disaggregated serving for the same number of GPUs. Outputs: effective tokens/sec, average TTFT, p99 TTFT.

Co-located vs disaggregated serving · throughput & latency
A toy queue simulator. Co-located has one pool that does both phases (with chunked prefill packing). Disaggregated has two pools split by the slider. Watch how disaggregation wins on TTFT for long prompts and loses on cost when prompts are short.
co-located TTFT
disagg TTFT
co-located TPS
disagg TPS
Takeaway
Disaggregation buys per-pool utilisation by splitting one mixed workload into two specialised ones, paying the cost of one KV transfer per request. The break-even depends on prompt length, network bandwidth, and the value of per-phase utilisation. It pairs naturally with continuous batching, paged KV, and chunked prefill — but at extreme scale (large model, long context, high QPS), disaggregation is the highest-impact serving lever after PagedAttention itself.
Where this goes next
We've walked the bandwidth pyramid down and back up. Training disperses one model across many GPUs; inference reassembles it for users. The pieces in vllm/lessons/ — KV cache, PagedAttention, continuous batching, prefill chunking, speculative decoding, GQA/MQA — are all decode-pool-internal optimizations that compose with everything in this series. The pieces in RL/lessons/ — rollout, reference, algorithm, trainer, weight-sync — compose with training-side parallelism to build a real RL post-training stack. All three lesson series are the same system, sliced different ways.