Inference — TP for latency, replicas for throughput

Same hardware, opposite goals. Training cares about the gradient AllReduce; inference cares about per-token decode time. The math behind "TP=8 or 8× replicas?" comes down to whether the workload is memory-bound or compute-bound.

What changes at inference time

Training is dominated by big batches doing big matmuls — comfortably compute-bound (above the roofline ridge, lesson 01). Inference splits into two regimes:

Prefill. One forward pass on the full prompt. The matmul operates on (B, T_prefill, d); arithmetic intensity is high; compute-bound, exactly like training.
Decode. One forward pass per generated token. The input is (B, 1, d). The KV cache from previous tokens grows but the new query is just one token. Arithmetic intensity is low: per parameter read, only one multiply-add per sample in the batch. Memory-bound: you spend all your time reading weights from HBM.

This is the core asymmetry. Training and prefill scale on compute; decode scales on HBM bandwidth. Most of inference's tokens-per-second is set by decode. So the question becomes: how fast can we read weights from HBM, per output token, per request?

The decode latency calculation

Per token, you read every weight once. For a 70B bf16 model, that's 140 GB. On one H100 at 3.35 TB/s HBM bandwidth, the absolute minimum per-token time is:

t_token_min = |weights_in_bytes| / HBM_BW = 140 / 3350 ≈ 42 ms

That's 24 tokens/sec, single GPU, single request, peak bandwidth utilization. In practice you achieve ~60–80% of that.

Now use TP=8. The weight read is spread across 8 GPUs' HBM bandwidths: each one reads only 1/8 of the weights, so the read time drops by 8×. The cost is a per-layer AllReduce, which on NVLink is tiny relative to the saving. New per-token time:

t_token_TP=8 = 140 / (8 · 3350) + small_AR ≈ 5–6 ms

Roughly an 8× latency improvement. But the per-token throughput per GPU is unchanged: TP=8 uses 8× the GPUs to serve one request 8× faster. From the cluster's point of view, it's the same total work.

Animated · one decode token, bytes flowing from HBM

Pick a TP degree and watch where the bytes physically come from. At TP=1, a single GPU streams the entire 140 GB of weights through its HBM bus, layer by layer. At TP=8, each of the 8 GPUs reads only its 1/8 shard — in parallel, eight times faster. Each blue particle is a chunk of weights moving from HBM into a tensor-core compute unit.

Throughput vs latency — the choice

Now consider a serving fleet with 8 GPUs and two options:

Option	Latency / request	Per-GPU throughput, batch=1 (tok/s)
1 replica TP=8	~5 ms/tok (8× faster)	~24 tok/s (unchanged)
8 replicas TP=1	~42 ms/tok	~24 tok/s (unchanged)

(Both shapes batch many concurrent requests in production — we'll bring that in below. At batch=1, the table compares the single-stream extreme.)

The two options give the same total tokens-per-second. What differs is how they're allocated: TP=8 lets each request run faster; replication lets more requests run simultaneously. Pick by what matters. Latency-sensitive workloads (chat, autocomplete, coding agents) want TP high. Bulk-throughput workloads (batched embedding, document processing) want replication high.

The catch — batching changes the math

The single-stream comparison above assumes batch = 1. As you batch more requests onto the same forward pass, the per-token cost goes up only slightly (you do more arithmetic but read the same weights), so the cost per request drops. With batch B = 32 sharing the same forward:

cost_per_request ≈ t_token_single / B + small_per_request_overhead

Decode batching, plus continuous batching (lesson 04 in the vLLM series), is what makes shared GPU serving economical. The TP-vs-replicas question becomes: with batching, is one TP=8 replica running batch 32 cheaper than 8 TP=1 replicas each running batch 32? In wall-clock per request: the TP=8 replica is still ~8× faster (memory-bound dynamics survive batching for a while). In total cluster throughput: roughly equal again — the GPU work is the same, just spread.

So the right framing: at any batch size, TP and replication trade latency for parallelism, not for total throughput. The exception is when batching itself becomes a constraint:

HBM budget for KV cache. Bigger batches need more KV-cache slots. With TP, each rank only holds 1/N of the KV cache (sharded across heads, as in lesson 09 of the vLLM series). So TP increases effective KV capacity per replica — useful for long-context workloads. The GQA wrinkle from lesson 06 applies: TP can't exceed the KV-head count, so a model with 8 KV heads (Llama-3-70B) caps at TP=8 for honest KV sharding; further TP would replicate KV state and undo the capacity win.
Time-to-first-token (TTFT). Prefill is compute-bound and benefits from TP linearly (in compute, not memory). Workloads with very long prompts (RAG, code repos) get TP gains on both prefill and decode.

PP is rare at inference time, here's why

PP across the layers of a model adds a per-stage bubble at inference: a single request must walk all N stages sequentially, paying N - 1 stage-time of bubble before its first token comes out. For chat use cases (TTFT < 500 ms), this is too high a price. PP shows up at inference only in two cases:

The model is so big a TP-maxed-out node still can't fit it. You add PP across nodes as a last resort.
Throughput-only workloads with no TTFT constraint, where you can fill the pipeline with many concurrent requests.

The decision table

Workload	Right answer
Chat, code complete (latency king, batch small)	TP high (4–8), replicas to fit QPS
Bulk inference (no per-request latency budget)	TP low (1–2), many replicas, big batch
Long-context RAG (prefill heavy)	TP high for prefill, replicas for QPS
MoE serving (huge total weights)	EP across GPUs, TP=1, replicas for QPS
Single-GPU-fits dense model	TP=1, just replicate

TP at serving differs from TP at training

Two subtle differences:

No backward. No gradient AllReduce. The per-step comm cost drops by 2× compared to training, since training also pays the symmetric AllReduce in backward.
The per-step compute is smaller (batch is smaller, sequence dim is smaller during decode). So the comm/compute ratio at decode is worse than at training — TP's AllReduces become a larger fraction of the time. Modern serving stacks (vLLM, TRT-LLM) work hard to fuse the AllReduce into the matmul kernel where possible to hide it.

This is why some serving stacks use TP=4 even when TP=8 would be available — the marginal latency improvement saturates because the AllReduce cost is no longer dwarfed by compute.

Animated · Amdahl ladder · TP scaling, sublinear because of comm

Pure HBM-bandwidth math says decode latency falls linearly with TP. In practice the curve flattens because (a) every TP step costs an AllReduce on NVLink, (b) at higher batch sizes some of the work becomes compute-bound and stops benefiting from more HBM lanes. The widget below traces latency as you sweep TP from 1 to 16; move the batch-size slider to watch the curve bend.

2D · choose your (TP, replicas) point on the frontier

Below is the same throughput-vs-latency frontier as the original widget, but interactive: pick a TP degree and a number of replicas with the sliders. The chosen point is highlighted; nearby alternatives are dim. Configs whose per-rank KV-cache exceeds HBM are greyed out as infeasible. The Pareto front is the chain in green.

Interactive · which shape gives you the QPS×latency point you want?

Set the cluster size, the model, and the per-request constraints. The widget lays out feasible (TP, replicas) shapes and overlays them on a "throughput × latency" plot. The Pareto front is the family of shapes that aren't dominated. Anything not on the front is a waste.

Takeaway

Decode is memory-bound: per-token time is set by HBM bandwidth, not FLOPs. TP=N spreads the read across N GPUs, dividing latency. Replication multiplies parallel requests. Per-cluster throughput is the same either way — until KV-cache or batching effects break the symmetry. The right choice is whichever knob your bottleneck is on.