Prefill/decode disaggregation

Two phases with opposite bottlenecks share one GPU. The fix: don't. Run them on different hardware pools and pay one KV transfer for the privilege.

The asymmetry, one more time

Lesson 01 established it; here it drives the architecture. The same model, the same kernels, two phases:

	prefill	decode
shape	T tokens in parallel, one forward	1 token per forward, T forwards over the response
attention	O(T²) — full QK^T matrix	O(T) — one query against T keys/values
matmul shape	matrix × matrix (high arithmetic intensity)	matrix × vector (low arithmetic intensity)
bottleneck	compute-bound: latency ≈ FLOPs / GPU FLOP rate	memory-bound: latency ≈ KV bytes / HBM bandwidth
what scales it	better TFLOPs (H100, MI300X)	more HBM, faster HBM

One run, opposite bottlenecks. That's the architectural smell. The two phases want different hardware, and forcing them to share a stream costs you on both ends.

The co-located failure mode

Default vLLM does both phases on the same GPU, same stream. A long prefill blocks the queue for as long as it takes — hundreds of milliseconds for a 4k-token prompt — and every decode in flight waits.

CO-LOCATED, ONE GPU, ONE STREAM

t=0     t=80    t=120    t=320         t=400
├─decode─┤├arr.─┤├──── prefill ────────┤├─decode─
   r0,r1     r2       r2 (4000 tok)        r0,r1,r2
                ↑                       ↑
                r0, r1 stalled here     TTFT(r2) = ~280 ms
                TTFT(r1) ← punished     TTFT(r0,r1) ← also punished

The cost lands on tail latency. p50 looks fine; p99 is awful because every request unlucky enough to arrive behind a long prefill picks up the prefill's full duration as queuing delay.

Chunked prefill (lesson 10) softens this — chunk the prompt into 512-token slices and pack each slice with one decode step in the same forward. But each chunk still steals compute from decode. The interference is reduced, not removed.

The disaggregated architecture

Run the two phases on two distinct GPU pools. Route based on phase, not on request.

                   ┌───────────────┐
                   │  router / LB  │  (short-prompts can skip pool A)
                   └───────┬───────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
        ┌───────────┐             ┌───────────┐
        │ PREFILL   │             │  DECODE   │
        │ pool A    │             │  pool B   │
        │           │             │           │
        │  few GPUs │             │ many GPUs │
        │ big FLOPs │  KV xfer    │ lots HBM  │
        │           │ ──────────► │           │
        │ batched   │             │ small per │
        │ long-ctx  │             │ step batch│
        │ prefill   │             │ many seqs │
        └───────────┘             └───────────┘
            ↑                          │
            └──── request arrives      └──► streams tokens to client

Key property: a long prefill on pool A does not block any decode on pool B. The two queues are independent. Tail latency on the decode side decouples from prompt-length distribution on the prefill side.

The flow per request:

Router admits to prefill pool A.
Pool A runs the prompt through the model, producing one full KV cache.
KV cache transfers to a GPU on pool B.
Pool B owns the request; runs decode steps; streams tokens to the client.

The cost: KV transfer

You've now added a network hop. The KV cache for a T-token prompt is exactly what lesson 01 quantified:

KV bytes = T · 2 · L · h_kv · d_head · bytes_per_dtype

For a 7B fp16 (32 layers, 32 kv heads, d_head 128): ~500 KB/tok (or ~60 KB/tok on a GQA model with fp8 KV — see lesson 09; this 8× reduction is what makes cross-node disagg practical at scale). A 4096-token prompt is 2 GB. Transfer time depends entirely on the fabric:

fabric	peak BW	time for 2 GB	verdict
NVLink (intra-node)	~800 GB/s	~2.5 ms	negligible
PCIe 4 (x16)	~30 GB/s	~65 ms	tolerable
RDMA / InfiniBand	~25–50 GB/s	~40–80 ms	non-trivial
regular ethernet (10 GbE)	~1 GB/s	~2000 ms	do not attempt

This is the central tradeoff. On NVLink the transfer is free; disagg almost always wins. Over RDMA the transfer competes with the interference it's meant to save — so disagg only wins on workloads with enough long prefills.

Optimizations that make disagg practical

1. Layer-wise streaming

Don't wait for the whole prefill to finish before transferring KV. As layer L's KV is written, start sending it while layer L+1 is computing. If transfer BW per layer ≥ compute time per layer, the transfer hides entirely in the prefill — added TTFT cost ≈ one layer's overhead, not the full transfer.

2. Block-aligned transfer

Paged KV blocks (lesson 02) are fixed-size, contiguous in physical memory. They're the natural transfer unit. The DMA engine moves one block at a time; the receiving end registers the same block in its own block manager. No serialization, no copy on either end.

3. Shared prefix cache

If the prompt's prefix hits the prefill pool's APC (lesson 05), don't prefill — just transfer the cached KV. Chat system prompts hit 50-90% on production traffic, so most "prefills" become network-only operations.

4. Mixed (1-phase) routing

For prompts under ~128 tokens, the prefill is cheap enough that transfer + prefill > "just do it on the decode pool". The router has a length threshold; short prompts skip the two-phase path entirely.

5. SLO-aware priority

Once the two phases are decoupled, you can reorder either queue independently. High-priority requests jump the prefill queue; bulk requests fill the gaps. Co-location made this nearly impossible — reordering interferes with both phases at once.

The systems

system	contribution
DistServe (OSDI 2024)	First formal analysis. Defines goodput-under-SLO: requests served per second that meet both TTFT and per-token-latency targets. Shows when disagg beats co-location analytically — a function of prompt-length distribution and fabric BW.
Mooncake (Moonshot AI)	Production. Tiered KV cache: HBM → DRAM → SSD. Decode pool fetches KV from whichever tier holds it. Massive prefix-cache hit rates from chat-style workloads.
vLLM disagg (2026)	Experimental as of this writing. Inherits the DistServe + Mooncake ideas; ships with NCCL-based KV transport and the layer-wise streaming optimization.

Interactive · bimodal-workload timeline

Below: a sequence of requests arrives on a Poisson-ish schedule, with prompts drawn from a bimodal distribution — most short (50-200 tokens), some long (2000-4000 tokens). Compare two architectures on the same workload:

Top timeline. Co-located: one GPU, serial. Prefills (orange) block decodes (blue).
Bottom timeline. Disaggregated: prefill pool runs to the left of the slash, decode pool to the right. The two pools are independent.

Things to try:

Default settings, NVLink. Disagg wins on p99 TTFT cleanly — the long prompts don't poison the queue for everyone else.
Same settings on PCIe. p99 still improves, mean E2E gets slightly worse — the transfer fee shows up.
Drop long-prompt % to 0. Now disagg adds the transfer cost without saving any interference. Co-location is faster.
Crank long-prompt % to 60%. p99 TTFT on the co-located timeline becomes unusable. Disagg flattens to within a factor of ~2 of the mean.

Co-located vs disaggregated, bimodal prompt mix

Each request is one colored band. Orange = prefill, blue = decode. Total request count and long-prompt % control the workload; the fabric controls the per-token transfer time.

requests: 40 long-prompt %: 20% fabric:

mean TTFT (co)

—

p99 TTFT (co)

—

mean TTFT (disagg)

—

p99 TTFT (disagg)

—

mean E2E (co)

—

mean E2E (disagg)

—

TTFT > 500 ms (co)

—

TTFT > 500 ms (disagg)

—

show the timing model

// Per-token costs (ms):
const PREFILL_PER_TOK = 0.5;            // compute-bound, batch-of-1
const DECODE_STEP_MS  = 10;             // one decode step over the active set
const XFER_PER_TOK = {                  // KV transfer bandwidth → per-token cost
  nvlink: 0.005,
  pcie:   0.10,
  rdma:   0.12,
};
// Co-located: prefill and decode share one stream, serialized.
// Disagg: prefill pool FIFO + decode pool independent loop; KV transfer adds latency
//         before the request is eligible for decode.

Takeaways

What to keep

Prefill and decode have opposite bottlenecks. Co-location forces them to share a stream they don't want to share.
Long prefills under co-location punish every queued request's TTFT. p99 is where it shows.
Disagg trades that interference for a KV transfer. NVLink makes it free; PCIe is tolerable; RDMA depends on workload mix.
Layer-wise streaming hides most of the transfer in the prefill itself.
The architecture only wins when long-prompt mass × interference cost > transfer overhead. Below that, just co-locate.