all_lessons / ml_system_design / 05 · at scale lesson 5 / 20

Scaling inference — replication, TP, and disaggregation

Lesson 04 sized one replica and found its KV batch ceiling. But a real service has two problems that one replica can't solve: Little's Law (lesson 03) hands you a concurrency target far bigger than one GPU's batch ceiling, and some models don't even fit on one GPU (70B = 140 GB > 80 GB). This lesson is the three orthogonal axes that fix those — and the recipe for choosing between them, all in bytes, FLOPs, and dollars.

Three axes, three different binding constraints
Each scaling move answers a different question. Replication answers "not enough QPS." Tensor parallel answers "model doesn't fit" or "decode too slow at batch 1." Prefill/decode disaggregation answers "long prompts are stalling my decodes." Reaching for the wrong axis wastes GPUs. The decision tree at the end picks for you.

1 · Replication — buy throughput linearly

The simplest axis: run N identical replicas behind a router, each holding the full model (2N bytes from lesson 02) plus its own KV cache. A replica is exactly the single-replica design of lesson 04, copied. Replicas share nothing, so throughput scales linearly and tokens/s/GPU stays flat — this is the cheap axis.

From lesson 03, Little's Law gave a concurrency target L = λ·W. Lesson 04 gave one replica's batch ceiling (KV-limited concurrent requests). The replica count is just the ratio:

replicas = ceil( target_concurrency / per_replica_capacity )
Worked: the chat service from lesson 03
Lesson 03 sized a chat fleet at L ≈ 15,300 concurrent requests, and lesson 04's KV math fit ~50 such requests per H100. Replicas = ceil(15300/50) = 306. Llama-3-70B fits one H100? No — 140 GB > 80 GB. So each "replica" is itself a TP group (axis 2), and we replicate that group 306-ish times. Replication handles the QPS; TP handles the fit. They compose.

Use replication alone when the model fits one GPU (7B = 14 GB) and you only need more QPS. It is the first thing to reach for and the last thing to over-think.

2 · Tensor parallel — shard one model across GPUs

When the model doesn't fit one GPU, you split each weight matrix across k GPUs that hold one model collaboratively — tensor parallelism (TP), detailed in system_ml 06. TP is needed for two distinct reasons:

(a) Weights don't fit. At 2N bytes: 70B = 140 GB needs TP ≥ 2; 405B = 810 GB needs TP ≥ 11 — which exceeds the 8 GPUs in one node, so it spills across nodes (a problem we flag below). Minimum TP to fit:

TP_fit = ceil( 2N_bytes / (HBM − overhead) )

(b) Decode is too slow even at batch 1. From lesson 01, decode is bandwidth-bound: per-token latency ≈ model_bytes / BW. TP splits the weights across k GPUs, so each GPU streams only 2N/k bytes per token — decode latency drops ~linearly with TP degree. A 70B at fp16 on one H100 streams 140 GB / 3.35 TB·s ≈ 42 ms/token; at TP=4 each GPU streams 35 GB ≈ 10 ms/token. That is how you hit a tight TPOT SLO.

TP is NVLink-bound — keep it inside a node
TP does an all-reduce every layer (lesson 02's collectives), so it lives or dies on interconnect bandwidth. Intra-node NVLink is ~900 GB/s; inter-node InfiniBand is ~50 GB/s — the 18× gap from lesson 02. Cross-node TP makes the per-layer all-reduce the bottleneck and wrecks both latency and throughput. Rule: TP ≤ 8 (one node). Models needing TP > 8 (like 405B) combine TP-within-node with other parallelism — see system_ml 11 for the full TP-vs-replication trade.

The cost of TP: the per-layer all-reduce is pure overhead, so tokens/s/GPU drops slightly with TP degree (typically a few % to ~15% at TP=8). That's why TP is for fit and latency, not for cheap throughput — for throughput you replicate.

3 · Prefill/decode disaggregation — stop the two phases fighting

From lesson 01, the two phases have opposite roofline character: prefill is compute-bound (big matmuls over the whole prompt), decode is bandwidth-bound (one token, weights re-read). Co-locate them on the same GPU and a long prefill monopolizes the SMs while in-flight decodes stall — exactly the TTFT-vs-TPOT contention lesson 04 saw inside one replica.

Disaggregation runs separate GPU pools: a prefill pool and a decode pool. A request prefills on the prefill pool, then its KV cache is transferred over the network to a decode pool that streams tokens. Now a 4K-token prefill never stalls another user's decode — they're on different hardware.

co-located: prefill blocks decode on the same GPU long prefill (compute-bound) decodes wait → TPOT spikes disaggregated: separate pools, KV transferred between prefill pool KV decode pool (steady TPOT) cost: one KV-cache transfer per request over the network (lesson 02's bytes/token × prompt length).

Disaggregation wins when the workload has both long prompts and tight TTFT and TPOT SLOs — the regime where co-location forces you to choose one or the other. Its cost is the KV transfer: a 4K-token Llama-70B prompt at 320 KB/token (lesson 02) is ~1.3 GB to ship per request, so disaggregation needs fast interconnect and pays off only when the contention it removes is worse than the transfer it adds. Mechanics in vLLM 08 and system_ml 12.

The decision recipe

The three axes compose, but you apply them in order, each gated by a specific constraint:

model fits one GPU (2N < HBM)? yes replicate to meet concurrency no TP within node until it fits, then replicate the TP group TPOT SLO still missed at batch 1? raise TP degree long prompts + strict TTFT and TPOT? disaggregate
  1. Fits one GPU? Yes → replicate. No → TP within a node until the weights fit (TP ≤ 8), then replicate that TP group for QPS.
  2. Latency SLO still missed at batch 1? → raise TP degree (more bandwidth per token, lower decode latency).
  3. Long prompts and strict TTFT and TPOT? → disaggregate prefill from decode.

Routing — which replica gets the request

With many replicas you need a router. Three policies:

PolicyHowBest when
Round-robinnext replica in rotationuniform, stateless requests
Least-loadedfewest in-flight / lowest queuevariable request sizes
Cache-awareroute by prefix → same replicashared system prompts / agents

Cache-aware routing sends requests that share a prefix to the same replica so they hit its already-resident KV cache — turning a redundant prefill into a free cache read (lesson 03's prefix-sharing structure). For a fleet serving one big system prompt this can cut prefill cost dramatically; mechanics in SGLang 05.

Autoscaling — scale on the right signal

Do NOT autoscale on GPU utilization
A serving GPU that is memory-bound and fully batched (lesson 01: decode is bandwidth-bound) can show modest SM utilization while being completely full — every KV block taken, queue building. Naive util-based autoscaling reads "50% utilized, don't scale" and under-provisions exactly when you're saturated. Scale on queue depth, goodput, or KV-cache utilization — signals that track the real binding constraint.

Cold-start lag. Spinning up a new replica means loading weights into HBM: 140 GB for a 70B at, say, a few GB/s from storage is minutes. Demand can spike faster than you can scale. The fix is a warm pool (pre-loaded idle replicas) or over-provisioning for the autoscale-lag window — you can't scale reactively when the reaction takes minutes.

Multi-tenancy — many models on one fleet

A platform serves many models. Two structures: per-model min replicas (each model guaranteed capacity — predictable latency, wasteful for cold models) versus a shared pool (models swap in on demand — efficient, but pays cold-start lag on a miss). The cheap middle path for fine-tunes is LoRA multiplexing: one base model in HBM, many small low-rank adapters swapped per request — serve dozens of fine-tunes off one base model's weight bytes instead of dozens of full copies (see vLLM 12).

Fleet cost — why replication is cheap and TP is not

From lesson 02, the serving unit cost:

$/1M tok = ($/gpu-hr) / (achieved_tok_per_s_per_gpu · 3600) · 1e6

The denominator is tokens/s/GPU. Replication copies a whole replica, so per-GPU throughput is unchanged — cost/token stays flat as you add QPS. TP splits one model and adds a per-layer all-reduce, so tokens/s/GPU drops a little — cost/token rises slightly with TP degree. The cost lesson: TP is the tax you pay for fit or latency; replication is the free axis for throughput. Never use TP to buy QPS that replication could buy more cheaply.

Interactive · Replicate vs TP planner

Pick a model and GPU, set the concurrency target from Little's Law (lesson 03), and slide the TP degree. Watch the headline flip between "model fits — replication is cheapest" and "model needs TP ≥ k to fit," and see how total GPUs and relative decode latency move.

Replicate vs TP planner

Assumptions: weights = 2N bytes (fp16); ~8 GB/GPU reserved for activations + framework overhead; the rest is KV room; per-request KV footprint fixed at ~2.6 GB (lesson 02's 8K-context Llama-70B); decode latency ∝ 1/TP; $3/GPU-hr. Order-of-magnitude only.

min TP to fit
replicas
total GPUs
rel. decode latency

What carries forward