all_lessons / ml_system_design / 03b · estimating throughput lesson 3b / 20

Estimating throughput — tokens/s, requests/s, and the frontier

Latency (03a) is what one request feels; throughput is what the cluster bills for. The two are the same arithmetic read two ways: throughput is just batch ÷ TPOT. This lesson turns that into tokens/sec, requests/sec, and dollars-per-token — and draws the latency–throughput frontier that lesson 03 sketched by hand, now as a curve you compute.

The one idea
In 03a the weight read (2N bytes) was a fixed tax paid every decode step regardless of batch. Throughput is the payoff: pack B requests behind that one weight read and you serve B tokens for the price of one read. So throughput rises with batch — until the KV term catches the weight term and the free lunch ends. That ceiling, in tokens/sec and in dollars, is this lesson.

1 · Decode throughput = batch ÷ TPOT

Every decode step advances all B requests by one token and takes TPOT seconds (03a). So the replica emits B tokens every TPOT seconds:

decode_throughput = B / TPOT = (B · BW · MBU) / (2N + B · L · kv_per_tok) [tok/s]

Read the fraction's behavior — it's the whole story:

decode_throughput_max = (BW · MBU) / (L · kv_per_tok) [tok/s]
Worked: the decode ceiling for Llama-3-8B, 2K context, one H100
BW·MBU = 3.35e12 · 0.7 = 2.35e12 B/s; L·kv = 2000 · 131072 = 2.62e8 B/token.
Ceiling = 2.35e12 / 2.62e8 ≈ 8,950 tok/s per GPU — no matter how big the batch. Compare single-stream (03a): ~147 tok/s. So batching buys up to ~60× here. That multiple is the economic case for batched serving — and it shrinks as context L grows, because long contexts spend the bandwidth on KV instead of on new tokens.

2 · Prefill throughput — the other half

Prefill is compute-bound (03a), so its throughput is set by the FLOP roof, not bandwidth. One GPU ingests prompt tokens at:

prefill_throughput = (peak_FLOPs · MFU) / (2N) [prompt tok/s]

For 8B on an H100 at 50% MFU: 4.95e14 / 16e9 ≈ 30,900 prompt tok/s — far above the decode ceiling, because prefill does the arithmetic-dense work the GPU is built for. This asymmetry is why a replica's two phases have wildly different appetites, and why disaggregating prefill from decode onto separate machines (lesson 05) can let each run at its own optimum instead of one starving the other.

Don't add the two throughputs
Prefill tok/s and decode tok/s measure different work (prompt ingestion vs generation) and, on a shared replica, compete for the same GPU. A replica doesn't deliver "30,900 + 8,950"; it time-slices between them, and the scheduler's split (chunked prefill, lesson 06) sets the real blend. Quote them separately and let the workload's prompt:output ratio decide which one is your binding throughput.

3 · The frontier, computed

Lesson 03 drew the latency–throughput trade-off freehand and called the sweet spot "the knee." You can now plot it. Sweep batch B from 1 upward and compute both curves from the 03a formula:

The SLO is a horizontal line on the TPOT axis. The knee is the largest batch whose TPOT still clears the SLO — the cheapest tokens you can serve without breaking the latency promise. That batch (or lesson 04's KV-memory ceiling, whichever is smaller — the binding cap is the lower of the two) is exactly the requests/GPU number lesson 03's Little's-Law calculator and lesson 04's replica both consume. The widget below draws all of it live.

Throughput & the latency–throughput frontier

Sweeps batch B. Blue = decode throughput (tok/s, left axis); amber = TPOT (ms, right axis); red dashed = your TPOT SLO; the marker is the knee — the max batch under the SLO. Switch mode to batch to drop the SLO and operate at the ceiling instead. Same assumptions as 03a (fp16 weights; KV/token interpolated between lesson 02's 128 KB @ 8B and 320 KB @ 70B; L = context; ±30%).

operating batch
throughput
decode ceiling
$ / 1M tokens

4 · From tokens/s to dollars — closing lesson 02's loop

Lesson 02 gave the serving-cost formula and left achieved tokens/sec/GPU as the unknown. You just computed it. Plug the knee throughput in:

$ / 1M tokens = ($/gpu-hr) / (throughput_tok_per_s · 3600) · 1e6

At the 8B knee (say ~6,000 tok/s within a 25 ms SLO) and $2/GPU-hr: 2 / (6000·3600) · 1e6 ≈ $0.093 / 1M tokens. Now every lever in this track has a price tag. A bigger model raises 2N → lower throughput → higher $/token. fp8 KV halves kv\_per\_tok → lifts the ceiling → cheaper. Longer context raises L → lower ceiling → pricier. The chain bytes → batch → throughput → dollars is the spine of every serving design, and you can now walk it end to end without a profiler.

5 · From tokens/s to requests/s — handing back to Little's Law

Capacity planning (lesson 03) speaks in requests, not tokens. Convert with the average output length G: a replica finishing tokens at throughput tok/s completes

requests/s ≈ decode_throughput / G

requests, and that is the per-replica service rate you multiply into a fleet. The loop closes cleanly: 03a gave you TPOT, 03b turned it into throughput and then into requests/s, lesson 03's Little's Law turns requests/s + latency into the concurrency a fleet must hold, and lesson 04 confirms the per-replica batch the memory budget actually permits. Four lessons, one unbroken chain from a model spec to a GPU count and a monthly bill.

6 · Batch / offline throughput — operate at the ceiling, not the knee

§3's knee exists only because an SLO caps the batch. Remove the human — an offline job (embeddings, bulk classification, dataset generation) carries no TPOT SLO — and the knee vanishes. You push the batch all the way to the memory limit (lesson 04) and ride the §1 throughput ceiling. Same hardware, a different operating point, and the cheapest token you will ever serve.

Two estimates shift in the batch regime:

Worked: one 8B model, three jobs, three throughputs
H100, 8B, fp16, 2K context. Online chat (25 ms TPOT SLO): knee ≈ 160 → ~6,300 tok/s. Offline generation (no SLO, batch to the memory limit): ~8,900 tok/s — the decode ceiling. Offline embeddings (prefill only): ~30,900 tok/s — the compute roof. The model never changed; the SLO (or its absence) moved throughput ~5× — and the bill with it.

That spread — and the spot-GPU, length-bucketing, and scheduling tactics that bank it — is the whole subject of the batch case study, lesson 18. The estimate to carry: dropping the SLO lifts decode to its ceiling (~1.4× the knee here), while a prefill-only embedding job reaches the compute roof (~5×) — with $/token falling to match. Flip the widget to batch mode to watch the operating point leave the knee.

Why throughput stays bandwidth-bound (the roofline, once more)
A decode step's arithmetic intensity is roughly B FLOP/byte while weight-bound (B tokens share one weight read), and it saturates near B* (~61 for our 8B/2K case) once KV-bound — it never climbs far. To become compute-bound you'd need it past the H100 ridge of ~295 (lesson 02). But the KV memory budget (lesson 04) usually caps the batch in the dozens-to-low-hundreds — below the ridge — so decode almost always stays left of it, bandwidth-bound. That's the deep reason the throughput ceiling in §1 is written in bandwidth, not FLOPs, and why "spend fewer bytes per token" (GQA, fp8 KV, paging) is the throughput game.

What carries forward