Scaling inference — replication, TP, and disaggregation
Lesson 04 sized one replica and found its KV batch ceiling. But a real service has two problems that one replica can't solve: Little's Law (lesson 03) hands you a concurrency target far bigger than one GPU's batch ceiling, and some models don't even fit on one GPU (70B = 140 GB > 80 GB). This lesson is the three orthogonal axes that fix those — and the recipe for choosing between them, all in bytes, FLOPs, and dollars.
1 · Replication — buy throughput linearly
The simplest axis: run N identical replicas behind a router, each holding the full model (2N bytes from lesson 02) plus its own KV cache. A replica is exactly the single-replica design of lesson 04, copied. Replicas share nothing, so throughput scales linearly and tokens/s/GPU stays flat — this is the cheap axis.
From lesson 03, Little's Law gave a concurrency target L = λ·W. Lesson 04 gave one replica's batch ceiling (KV-limited concurrent requests). The replica count is just the ratio:
Use replication alone when the model fits one GPU (7B = 14 GB) and you only need more QPS. It is the first thing to reach for and the last thing to over-think.
2 · Tensor parallel — shard one model across GPUs
When the model doesn't fit one GPU, you split each weight matrix across k GPUs that hold one model collaboratively — tensor parallelism (TP), detailed in system_ml 06. TP is needed for two distinct reasons:
(a) Weights don't fit. At 2N bytes: 70B = 140 GB needs TP ≥ 2; 405B = 810 GB needs TP ≥ 11 — which exceeds the 8 GPUs in one node, so it spills across nodes (a problem we flag below). Minimum TP to fit:
(b) Decode is too slow even at batch 1. From lesson 01, decode is bandwidth-bound: per-token latency ≈ model_bytes / BW. TP splits the weights across k GPUs, so each GPU streams only 2N/k bytes per token — decode latency drops ~linearly with TP degree. A 70B at fp16 on one H100 streams 140 GB / 3.35 TB·s ≈ 42 ms/token; at TP=4 each GPU streams 35 GB ≈ 10 ms/token. That is how you hit a tight TPOT SLO.
The cost of TP: the per-layer all-reduce is pure overhead, so tokens/s/GPU drops slightly with TP degree (typically a few % to ~15% at TP=8). That's why TP is for fit and latency, not for cheap throughput — for throughput you replicate.
3 · Prefill/decode disaggregation — stop the two phases fighting
From lesson 01, the two phases have opposite roofline character: prefill is compute-bound (big matmuls over the whole prompt), decode is bandwidth-bound (one token, weights re-read). Co-locate them on the same GPU and a long prefill monopolizes the SMs while in-flight decodes stall — exactly the TTFT-vs-TPOT contention lesson 04 saw inside one replica.
Disaggregation runs separate GPU pools: a prefill pool and a decode pool. A request prefills on the prefill pool, then its KV cache is transferred over the network to a decode pool that streams tokens. Now a 4K-token prefill never stalls another user's decode — they're on different hardware.
Disaggregation wins when the workload has both long prompts and tight TTFT and TPOT SLOs — the regime where co-location forces you to choose one or the other. Its cost is the KV transfer: a 4K-token Llama-70B prompt at 320 KB/token (lesson 02) is ~1.3 GB to ship per request, so disaggregation needs fast interconnect and pays off only when the contention it removes is worse than the transfer it adds. Mechanics in vLLM 08 and system_ml 12.
The decision recipe
The three axes compose, but you apply them in order, each gated by a specific constraint:
- Fits one GPU? Yes → replicate. No → TP within a node until the weights fit (TP ≤ 8), then replicate that TP group for QPS.
- Latency SLO still missed at batch 1? → raise TP degree (more bandwidth per token, lower decode latency).
- Long prompts and strict TTFT and TPOT? → disaggregate prefill from decode.
Routing — which replica gets the request
With many replicas you need a router. Three policies:
| Policy | How | Best when |
|---|---|---|
| Round-robin | next replica in rotation | uniform, stateless requests |
| Least-loaded | fewest in-flight / lowest queue | variable request sizes |
| Cache-aware | route by prefix → same replica | shared system prompts / agents |
Cache-aware routing sends requests that share a prefix to the same replica so they hit its already-resident KV cache — turning a redundant prefill into a free cache read (lesson 03's prefix-sharing structure). For a fleet serving one big system prompt this can cut prefill cost dramatically; mechanics in SGLang 05.
Autoscaling — scale on the right signal
Cold-start lag. Spinning up a new replica means loading weights into HBM: 140 GB for a 70B at, say, a few GB/s from storage is minutes. Demand can spike faster than you can scale. The fix is a warm pool (pre-loaded idle replicas) or over-provisioning for the autoscale-lag window — you can't scale reactively when the reaction takes minutes.
Multi-tenancy — many models on one fleet
A platform serves many models. Two structures: per-model min replicas (each model guaranteed capacity — predictable latency, wasteful for cold models) versus a shared pool (models swap in on demand — efficient, but pays cold-start lag on a miss). The cheap middle path for fine-tunes is LoRA multiplexing: one base model in HBM, many small low-rank adapters swapped per request — serve dozens of fine-tunes off one base model's weight bytes instead of dozens of full copies (see vLLM 12).
Fleet cost — why replication is cheap and TP is not
From lesson 02, the serving unit cost:
The denominator is tokens/s/GPU. Replication copies a whole replica, so per-GPU throughput is unchanged — cost/token stays flat as you add QPS. TP splits one model and adds a per-layer all-reduce, so tokens/s/GPU drops a little — cost/token rises slightly with TP degree. The cost lesson: TP is the tax you pay for fit or latency; replication is the free axis for throughput. Never use TP to buy QPS that replication could buy more cheaply.
Interactive · Replicate vs TP planner
Pick a model and GPU, set the concurrency target from Little's Law (lesson 03), and slide the TP degree. Watch the headline flip between "model fits — replication is cheapest" and "model needs TP ≥ k to fit," and see how total GPUs and relative decode latency move.
What carries forward
- Three orthogonal axes: replication (throughput, cheap, flat tok/s/GPU), tensor parallel (fit + latency, NVLink-bound, slight tok/s/GPU tax), disaggregation (decouples prefill/decode contention, costs a KV transfer).
- The recipe: fits? replicate. Doesn't? TP within a node then replicate the group. Latency still bad? raise TP. Long prompts + tight TTFT and TPOT? disaggregate.
- Route cache-aware when prefixes are shared; autoscale on queue depth / goodput / KV-util, never GPU util; keep a warm pool for cold-start lag.
- Cost: $/1M tok ∝ 1/(tok/s/GPU). Replication keeps that flat; TP lowers it slightly — so size for throughput with replication, reach for TP only for fit or latency.