Prefill/decode disaggregation
Two phases with opposite bottlenecks share one GPU. The fix: don't. Run them on different hardware pools and pay one KV transfer for the privilege.
The asymmetry, one more time
Lesson 01 established it; here it drives the architecture. The same model, the same kernels, two phases:
| prefill | decode | |
|---|---|---|
| shape | T tokens in parallel, one forward | 1 token per forward, T forwards over the response |
| attention | O(T²) — full QKT matrix | O(T) — one query against T keys/values |
| matmul shape | matrix × matrix (high arithmetic intensity) | matrix × vector (low arithmetic intensity) |
| bottleneck | compute-bound: latency ≈ FLOPs / GPU FLOP rate | memory-bound: latency ≈ KV bytes / HBM bandwidth |
| what scales it | better TFLOPs (H100, MI300X) | more HBM, faster HBM |
One run, opposite bottlenecks. That's the architectural smell. The two phases want different hardware, and forcing them to share a stream costs you on both ends.
The co-located failure mode
Default vLLM does both phases on the same GPU, same stream. A long prefill blocks the queue for as long as it takes — hundreds of milliseconds for a 4k-token prompt — and every decode in flight waits.
CO-LOCATED, ONE GPU, ONE STREAM
t=0 t=80 t=120 t=320 t=400
├─decode─┤├arr.─┤├──── prefill ────────┤├─decode─
r0,r1 r2 r2 (4000 tok) r0,r1,r2
↑ ↑
r0, r1 stalled here TTFT(r2) = ~280 ms
TTFT(r1) ← punished TTFT(r0,r1) ← also punished
The cost lands on tail latency. p50 looks fine; p99 is awful because every request unlucky enough to arrive behind a long prefill picks up the prefill's full duration as queuing delay.
Chunked prefill (lesson 10) softens this — chunk the prompt into 512-token slices and pack each slice with one decode step in the same forward. But each chunk still steals compute from decode. The interference is reduced, not removed.
The disaggregated architecture
Run the two phases on two distinct GPU pools. Route based on phase, not on request.
┌───────────────┐
│ router / LB │ (short-prompts can skip pool A)
└───────┬───────┘
│
┌────────────┴────────────┐
▼ ▼
┌───────────┐ ┌───────────┐
│ PREFILL │ │ DECODE │
│ pool A │ │ pool B │
│ │ │ │
│ few GPUs │ │ many GPUs │
│ big FLOPs │ KV xfer │ lots HBM │
│ │ ──────────► │ │
│ batched │ │ small per │
│ long-ctx │ │ step batch│
│ prefill │ │ many seqs │
└───────────┘ └───────────┘
↑ │
└──── request arrives └──► streams tokens to client
Key property: a long prefill on pool A does not block any decode on pool B. The two queues are independent. Tail latency on the decode side decouples from prompt-length distribution on the prefill side.
The flow per request:
- Router admits to prefill pool A.
- Pool A runs the prompt through the model, producing one full KV cache.
- KV cache transfers to a GPU on pool B.
- Pool B owns the request; runs decode steps; streams tokens to the client.
The cost: KV transfer
You've now added a network hop. The KV cache for a T-token prompt is exactly what lesson 01 quantified:
For a 7B fp16 (32 layers, 32 kv heads, d_head 128): ~500 KB/tok (or ~60 KB/tok on a GQA model with fp8 KV — see lesson 09; this 8× reduction is what makes cross-node disagg practical at scale). A 4096-token prompt is 2 GB. Transfer time depends entirely on the fabric:
| fabric | peak BW | time for 2 GB | verdict |
|---|---|---|---|
| NVLink (intra-node) | ~800 GB/s | ~2.5 ms | negligible |
| PCIe 4 (x16) | ~30 GB/s | ~65 ms | tolerable |
| RDMA / InfiniBand | ~25–50 GB/s | ~40–80 ms | non-trivial |
| regular ethernet (10 GbE) | ~1 GB/s | ~2000 ms | do not attempt |
This is the central tradeoff. On NVLink the transfer is free; disagg almost always wins. Over RDMA the transfer competes with the interference it's meant to save — so disagg only wins on workloads with enough long prefills.
Optimizations that make disagg practical
The systems
| system | contribution |
|---|---|
| DistServe (OSDI 2024) | First formal analysis. Defines goodput-under-SLO: requests served per second that meet both TTFT and per-token-latency targets. Shows when disagg beats co-location analytically — a function of prompt-length distribution and fabric BW. |
| Mooncake (Moonshot AI) | Production. Tiered KV cache: HBM → DRAM → SSD. Decode pool fetches KV from whichever tier holds it. Massive prefix-cache hit rates from chat-style workloads. |
| vLLM disagg (2026) | Experimental as of this writing. Inherits the DistServe + Mooncake ideas; ships with NCCL-based KV transport and the layer-wise streaming optimization. |
Interactive · bimodal-workload timeline
Below: a sequence of requests arrives on a Poisson-ish schedule, with prompts drawn from a bimodal distribution — most short (50-200 tokens), some long (2000-4000 tokens). Compare two architectures on the same workload:
- Top timeline. Co-located: one GPU, serial. Prefills (orange) block decodes (blue).
- Bottom timeline. Disaggregated: prefill pool runs to the left of the slash, decode pool to the right. The two pools are independent.
Things to try:
- Default settings, NVLink. Disagg wins on p99 TTFT cleanly — the long prompts don't poison the queue for everyone else.
- Same settings on PCIe. p99 still improves, mean E2E gets slightly worse — the transfer fee shows up.
- Drop long-prompt % to 0. Now disagg adds the transfer cost without saving any interference. Co-location is faster.
- Crank long-prompt % to 60%. p99 TTFT on the co-located timeline becomes unusable. Disagg flattens to within a factor of ~2 of the mean.
Takeaways
- Prefill and decode have opposite bottlenecks. Co-location forces them to share a stream they don't want to share.
- Long prefills under co-location punish every queued request's TTFT. p99 is where it shows.
- Disagg trades that interference for a KV transfer. NVLink makes it free; PCIe is tolerable; RDMA depends on workload mix.
- Layer-wise streaming hides most of the transfer in the prefill itself.
- The architecture only wins when long-prompt mass × interference cost > transfer overhead. Below that, just co-locate.