vllm_lessons / 08 · P/D disaggregation lesson 8 / 12

Prefill/decode disaggregation

Two phases with opposite bottlenecks share one GPU. The fix: don't. Run them on different hardware pools and pay one KV transfer for the privilege.

The asymmetry, one more time

Lesson 01 established it; here it drives the architecture. The same model, the same kernels, two phases:

prefilldecode
shapeT tokens in parallel, one forward1 token per forward, T forwards over the response
attentionO(T²) — full QKT matrixO(T) — one query against T keys/values
matmul shapematrix × matrix (high arithmetic intensity)matrix × vector (low arithmetic intensity)
bottleneckcompute-bound: latency ≈ FLOPs / GPU FLOP ratememory-bound: latency ≈ KV bytes / HBM bandwidth
what scales itbetter TFLOPs (H100, MI300X)more HBM, faster HBM

One run, opposite bottlenecks. That's the architectural smell. The two phases want different hardware, and forcing them to share a stream costs you on both ends.

The co-located failure mode

Default vLLM does both phases on the same GPU, same stream. A long prefill blocks the queue for as long as it takes — hundreds of milliseconds for a 4k-token prompt — and every decode in flight waits.

CO-LOCATED, ONE GPU, ONE STREAM

t=0     t=80    t=120    t=320         t=400
├─decode─┤├arr.─┤├──── prefill ────────┤├─decode─
   r0,r1     r2       r2 (4000 tok)        r0,r1,r2
                ↑                       ↑
                r0, r1 stalled here     TTFT(r2) = ~280 ms
                TTFT(r1) ← punished     TTFT(r0,r1) ← also punished

The cost lands on tail latency. p50 looks fine; p99 is awful because every request unlucky enough to arrive behind a long prefill picks up the prefill's full duration as queuing delay.

Chunked prefill (lesson 10) softens this — chunk the prompt into 512-token slices and pack each slice with one decode step in the same forward. But each chunk still steals compute from decode. The interference is reduced, not removed.

The disaggregated architecture

Run the two phases on two distinct GPU pools. Route based on phase, not on request.

                   ┌───────────────┐
                   │  router / LB  │  (short-prompts can skip pool A)
                   └───────┬───────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
        ┌───────────┐             ┌───────────┐
        │ PREFILL   │             │  DECODE   │
        │ pool A    │             │  pool B   │
        │           │             │           │
        │  few GPUs │             │ many GPUs │
        │ big FLOPs │  KV xfer    │ lots HBM  │
        │           │ ──────────► │           │
        │ batched   │             │ small per │
        │ long-ctx  │             │ step batch│
        │ prefill   │             │ many seqs │
        └───────────┘             └───────────┘
            ↑                          │
            └──── request arrives      └──► streams tokens to client

Key property: a long prefill on pool A does not block any decode on pool B. The two queues are independent. Tail latency on the decode side decouples from prompt-length distribution on the prefill side.

The flow per request:

  1. Router admits to prefill pool A.
  2. Pool A runs the prompt through the model, producing one full KV cache.
  3. KV cache transfers to a GPU on pool B.
  4. Pool B owns the request; runs decode steps; streams tokens to the client.

The cost: KV transfer

You've now added a network hop. The KV cache for a T-token prompt is exactly what lesson 01 quantified:

KV bytes = T · 2 · L · hkv · dhead · bytes_per_dtype

For a 7B fp16 (32 layers, 32 kv heads, d_head 128): ~500 KB/tok (or ~60 KB/tok on a GQA model with fp8 KV — see lesson 09; this 8× reduction is what makes cross-node disagg practical at scale). A 4096-token prompt is 2 GB. Transfer time depends entirely on the fabric:

fabricpeak BWtime for 2 GBverdict
NVLink (intra-node)~800 GB/s~2.5 msnegligible
PCIe 4 (x16)~30 GB/s~65 mstolerable
RDMA / InfiniBand~25–50 GB/s~40–80 msnon-trivial
regular ethernet (10 GbE)~1 GB/s~2000 msdo not attempt

This is the central tradeoff. On NVLink the transfer is free; disagg almost always wins. Over RDMA the transfer competes with the interference it's meant to save — so disagg only wins on workloads with enough long prefills.

Optimizations that make disagg practical

1. Layer-wise streaming
Don't wait for the whole prefill to finish before transferring KV. As layer L's KV is written, start sending it while layer L+1 is computing. If transfer BW per layer ≥ compute time per layer, the transfer hides entirely in the prefill — added TTFT cost ≈ one layer's overhead, not the full transfer.
2. Block-aligned transfer
Paged KV blocks (lesson 02) are fixed-size, contiguous in physical memory. They're the natural transfer unit. The DMA engine moves one block at a time; the receiving end registers the same block in its own block manager. No serialization, no copy on either end.
3. Shared prefix cache
If the prompt's prefix hits the prefill pool's APC (lesson 05), don't prefill — just transfer the cached KV. Chat system prompts hit 50-90% on production traffic, so most "prefills" become network-only operations.
4. Mixed (1-phase) routing
For prompts under ~128 tokens, the prefill is cheap enough that transfer + prefill > "just do it on the decode pool". The router has a length threshold; short prompts skip the two-phase path entirely.
5. SLO-aware priority
Once the two phases are decoupled, you can reorder either queue independently. High-priority requests jump the prefill queue; bulk requests fill the gaps. Co-location made this nearly impossible — reordering interferes with both phases at once.

The systems

systemcontribution
DistServe (OSDI 2024)First formal analysis. Defines goodput-under-SLO: requests served per second that meet both TTFT and per-token-latency targets. Shows when disagg beats co-location analytically — a function of prompt-length distribution and fabric BW.
Mooncake (Moonshot AI)Production. Tiered KV cache: HBM → DRAM → SSD. Decode pool fetches KV from whichever tier holds it. Massive prefix-cache hit rates from chat-style workloads.
vLLM disagg (2026)Experimental as of this writing. Inherits the DistServe + Mooncake ideas; ships with NCCL-based KV transport and the layer-wise streaming optimization.

Interactive · bimodal-workload timeline

Below: a sequence of requests arrives on a Poisson-ish schedule, with prompts drawn from a bimodal distribution — most short (50-200 tokens), some long (2000-4000 tokens). Compare two architectures on the same workload:

Things to try:

Co-located vs disaggregated, bimodal prompt mix
Each request is one colored band. Orange = prefill, blue = decode. Total request count and long-prompt % control the workload; the fabric controls the per-token transfer time.
mean TTFT (co)
p99 TTFT (co)
mean TTFT (disagg)
p99 TTFT (disagg)
mean E2E (co)
mean E2E (disagg)
TTFT > 500 ms (co)
TTFT > 500 ms (disagg)
show the timing model
// Per-token costs (ms):
const PREFILL_PER_TOK = 0.5;            // compute-bound, batch-of-1
const DECODE_STEP_MS  = 10;             // one decode step over the active set
const XFER_PER_TOK = {                  // KV transfer bandwidth → per-token cost
  nvlink: 0.005,
  pcie:   0.10,
  rdma:   0.12,
};
// Co-located: prefill and decode share one stream, serialized.
// Disagg: prefill pool FIFO + decode pool independent loop; KV transfer adds latency
//         before the request is eligible for decode.

Takeaways

What to keep