system_ml / 11 · TP vs replication lesson 11 / 19

Inference — TP for latency, replicas for throughput

Same hardware, opposite goals. Training cares about the gradient AllReduce; inference cares about per-token decode time. The math behind "TP=8 or 8× replicas?" comes down to whether the workload is memory-bound or compute-bound.

What changes at inference time

Training is dominated by big batches doing big matmuls — comfortably compute-bound (above the roofline ridge, lesson 01). Inference splits into two regimes:

This is the core asymmetry. Training and prefill scale on compute; decode scales on HBM bandwidth. Most of inference's tokens-per-second is set by decode. So the question becomes: how fast can we read weights from HBM, per output token, per request?

The decode latency calculation

Per token, you read every weight once. For a 70B bf16 model, that's 140 GB. On one H100 at 3.35 TB/s HBM bandwidth, the absolute minimum per-token time is:

t_token_min  =  |weights_in_bytes| / HBM_BW  =  140 / 3350 ≈ 42 ms

That's 24 tokens/sec, single GPU, single request, peak bandwidth utilization. In practice you achieve ~60–80% of that.

Now use TP=8. The weight read is spread across 8 GPUs' HBM bandwidths: each one reads only 1/8 of the weights, so the read time drops by 8×. The cost is a per-layer AllReduce, which on NVLink is tiny relative to the saving. New per-token time:

t_token_TP=8  =  140 / (8 · 3350) + small_AR ≈ 5–6 ms

Roughly an 8× latency improvement. But the per-token throughput per GPU is unchanged: TP=8 uses 8× the GPUs to serve one request 8× faster. From the cluster's point of view, it's the same total work.

Animated · one decode token, bytes flowing from HBM

Pick a TP degree and watch where the bytes physically come from. At TP=1, a single GPU streams the entire 140 GB of weights through its HBM bus, layer by layer. At TP=8, each of the 8 GPUs reads only its 1/8 shard — in parallel, eight times faster. Each blue particle is a chunk of weights moving from HBM into a tensor-core compute unit.

HBM → compute · weight bytes per decode step
The horizontal lane is one GPU's HBM bus. Particles flow from HBM (left) into the SM (right) at a rate set by HBM bandwidth. TP=N puts N lanes in parallel — each carries 1/N of the bytes.
bytes per GPU
elapsed (modeled)
token complete?
speedup vs TP=1

Throughput vs latency — the choice

Now consider a serving fleet with 8 GPUs and two options:

OptionLatency / requestPer-GPU throughput, batch=1 (tok/s)
1 replica TP=8~5 ms/tok (8× faster)~24 tok/s (unchanged)
8 replicas TP=1~42 ms/tok~24 tok/s (unchanged)

(Both shapes batch many concurrent requests in production — we'll bring that in below. At batch=1, the table compares the single-stream extreme.)

The two options give the same total tokens-per-second. What differs is how they're allocated: TP=8 lets each request run faster; replication lets more requests run simultaneously. Pick by what matters. Latency-sensitive workloads (chat, autocomplete, coding agents) want TP high. Bulk-throughput workloads (batched embedding, document processing) want replication high.

The catch — batching changes the math

The single-stream comparison above assumes batch = 1. As you batch more requests onto the same forward pass, the per-token cost goes up only slightly (you do more arithmetic but read the same weights), so the cost per request drops. With batch B = 32 sharing the same forward:

cost_per_request  ≈  t_token_single / B  +  small_per_request_overhead

Decode batching, plus continuous batching (lesson 04 in the vLLM series), is what makes shared GPU serving economical. The TP-vs-replicas question becomes: with batching, is one TP=8 replica running batch 32 cheaper than 8 TP=1 replicas each running batch 32? In wall-clock per request: the TP=8 replica is still ~8× faster (memory-bound dynamics survive batching for a while). In total cluster throughput: roughly equal again — the GPU work is the same, just spread.

So the right framing: at any batch size, TP and replication trade latency for parallelism, not for total throughput. The exception is when batching itself becomes a constraint:

PP is rare at inference time, here's why

PP across the layers of a model adds a per-stage bubble at inference: a single request must walk all N stages sequentially, paying N - 1 stage-time of bubble before its first token comes out. For chat use cases (TTFT < 500 ms), this is too high a price. PP shows up at inference only in two cases:

The decision table

WorkloadRight answer
Chat, code complete (latency king, batch small)TP high (4–8), replicas to fit QPS
Bulk inference (no per-request latency budget)TP low (1–2), many replicas, big batch
Long-context RAG (prefill heavy)TP high for prefill, replicas for QPS
MoE serving (huge total weights)EP across GPUs, TP=1, replicas for QPS
Single-GPU-fits dense modelTP=1, just replicate

TP at serving differs from TP at training

Two subtle differences:

  1. No backward. No gradient AllReduce. The per-step comm cost drops by 2× compared to training, since training also pays the symmetric AllReduce in backward.
  2. The per-step compute is smaller (batch is smaller, sequence dim is smaller during decode). So the comm/compute ratio at decode is worse than at training — TP's AllReduces become a larger fraction of the time. Modern serving stacks (vLLM, TRT-LLM) work hard to fuse the AllReduce into the matmul kernel where possible to hide it.

This is why some serving stacks use TP=4 even when TP=8 would be available — the marginal latency improvement saturates because the AllReduce cost is no longer dwarfed by compute.

Animated · Amdahl ladder · TP scaling, sublinear because of comm

Pure HBM-bandwidth math says decode latency falls linearly with TP. In practice the curve flattens because (a) every TP step costs an AllReduce on NVLink, (b) at higher batch sizes some of the work becomes compute-bound and stops benefiting from more HBM lanes. The widget below traces latency as you sweep TP from 1 to 16; move the batch-size slider to watch the curve bend.

Decode latency vs TP · batched and unbatched
Dashed: ideal Amdahl curve (latency / TP). Solid: realistic curve with AllReduce overhead + a per-token compute floor that grows with batch size.
latency @ TP=1
latency @ TP=8
marginal gain TP=4→8
comm fraction @ TP=8

2D · choose your (TP, replicas) point on the frontier

Below is the same throughput-vs-latency frontier as the original widget, but interactive: pick a TP degree and a number of replicas with the sliders. The chosen point is highlighted; nearby alternatives are dim. Configs whose per-rank KV-cache exceeds HBM are greyed out as infeasible. The Pareto front is the chain in green.

Pareto explorer · pick TP and replicas
Per-rank KV-cache budget = HBM − weight shard. Configs whose batch × seq pushes past this are infeasible (X marker). Move sliders to position yourself on the frontier.
chosen latency
chosen throughput
total GPUs
on Pareto front?

Interactive · which shape gives you the QPS×latency point you want?

Set the cluster size, the model, and the per-request constraints. The widget lays out feasible (TP, replicas) shapes and overlays them on a "throughput × latency" plot. The Pareto front is the family of shapes that aren't dominated. Anything not on the front is a waste.

Pareto: throughput vs single-request latency
Each dot is one (TP, replicas) shape using all cluster GPUs. X = single-request decode latency (ms/tok). Y = aggregate cluster throughput (tok/s). The "knee" is usually the best buy if you care about both.
latency-best shape
throughput-best shape
balanced "knee"
infeasible (can't fit)
Takeaway
Decode is memory-bound: per-token time is set by HBM bandwidth, not FLOPs. TP=N spreads the read across N GPUs, dividing latency. Replication multiplies parallel requests. Per-cluster throughput is the same either way — until KV-cache or batching effects break the symmetry. The right choice is whichever knob your bottleneck is on.