Estimating throughput — tokens/s, requests/s, and the frontier
Latency (03a) is what one request feels; throughput is what the cluster bills for. The two are the same arithmetic read two ways: throughput is just batch ÷ TPOT. This lesson turns that into tokens/sec, requests/sec, and dollars-per-token — and draws the latency–throughput frontier that lesson 03 sketched by hand, now as a curve you compute.
1 · Decode throughput = batch ÷ TPOT
Every decode step advances all B requests by one token and takes TPOT seconds (03a). So the replica emits B tokens every TPOT seconds:
Read the fraction's behavior — it's the whole story:
- Small B (weight-bound): denominator ≈ 2N (constant), so throughput grows almost linearly with batch. Doubling the batch nearly doubles tokens/s at virtually no latency cost. This is the free lunch.
- Large B (KV-bound): the B·L·kv term dominates, the B's cancel, and throughput flattens to a ceiling that no batch size can beat:
Ceiling = 2.35e12 / 2.62e8 ≈ 8,950 tok/s per GPU — no matter how big the batch. Compare single-stream (03a): ~147 tok/s. So batching buys up to ~60× here. That multiple is the economic case for batched serving — and it shrinks as context L grows, because long contexts spend the bandwidth on KV instead of on new tokens.
2 · Prefill throughput — the other half
Prefill is compute-bound (03a), so its throughput is set by the FLOP roof, not bandwidth. One GPU ingests prompt tokens at:
For 8B on an H100 at 50% MFU: 4.95e14 / 16e9 ≈ 30,900 prompt tok/s — far above the decode ceiling, because prefill does the arithmetic-dense work the GPU is built for. This asymmetry is why a replica's two phases have wildly different appetites, and why disaggregating prefill from decode onto separate machines (lesson 05) can let each run at its own optimum instead of one starving the other.
3 · The frontier, computed
Lesson 03 drew the latency–throughput trade-off freehand and called the sweet spot "the knee." You can now plot it. Sweep batch B from 1 upward and compute both curves from the 03a formula:
- Throughput = B/TPOT rises, then saturates at the §1 ceiling.
- TPOT stays flat, then climbs linearly past B* (03a).
The SLO is a horizontal line on the TPOT axis. The knee is the largest batch whose TPOT still clears the SLO — the cheapest tokens you can serve without breaking the latency promise. That batch (or lesson 04's KV-memory ceiling, whichever is smaller — the binding cap is the lower of the two) is exactly the requests/GPU number lesson 03's Little's-Law calculator and lesson 04's replica both consume. The widget below draws all of it live.
4 · From tokens/s to dollars — closing lesson 02's loop
Lesson 02 gave the serving-cost formula and left achieved tokens/sec/GPU as the unknown. You just computed it. Plug the knee throughput in:
At the 8B knee (say ~6,000 tok/s within a 25 ms SLO) and $2/GPU-hr: 2 / (6000·3600) · 1e6 ≈ $0.093 / 1M tokens. Now every lever in this track has a price tag. A bigger model raises 2N → lower throughput → higher $/token. fp8 KV halves kv\_per\_tok → lifts the ceiling → cheaper. Longer context raises L → lower ceiling → pricier. The chain bytes → batch → throughput → dollars is the spine of every serving design, and you can now walk it end to end without a profiler.
5 · From tokens/s to requests/s — handing back to Little's Law
Capacity planning (lesson 03) speaks in requests, not tokens. Convert with the average output length G: a replica finishing tokens at throughput tok/s completes
requests, and that is the per-replica service rate you multiply into a fleet. The loop closes cleanly: 03a gave you TPOT, 03b turned it into throughput and then into requests/s, lesson 03's Little's Law turns requests/s + latency into the concurrency a fleet must hold, and lesson 04 confirms the per-replica batch the memory budget actually permits. Four lessons, one unbroken chain from a model spec to a GPU count and a monthly bill.
6 · Batch / offline throughput — operate at the ceiling, not the knee
§3's knee exists only because an SLO caps the batch. Remove the human — an offline job (embeddings, bulk classification, dataset generation) carries no TPOT SLO — and the knee vanishes. You push the batch all the way to the memory limit (lesson 04) and ride the §1 throughput ceiling. Same hardware, a different operating point, and the cheapest token you will ever serve.
Two estimates shift in the batch regime:
- Decode-bound jobs (generation): operate at the §1 ceiling (BW·MBU)/(L·kv), not the knee. Push the batch toward the lesson-04 memory limit and stop there; $/token falls to its floor.
- Prefill-only jobs (embeddings, scoring, classification): there is no decode loop — you want one vector per input, not a stream. Throughput is the prefill rate from §2, (peak·MFU)/2N — compute-bound, and typically 3–5× the decode ceiling. This is why embedding a corpus costs far less per token than generating one.
That spread — and the spot-GPU, length-bucketing, and scheduling tactics that bank it — is the whole subject of the batch case study, lesson 18. The estimate to carry: dropping the SLO lifts decode to its ceiling (~1.4× the knee here), while a prefill-only embedding job reaches the compute roof (~5×) — with $/token falling to match. Flip the widget to batch mode to watch the operating point leave the knee.
What carries forward
- Throughput = B / TPOT. It rises near-linearly while weight-bound, then saturates at (BW·MBU)/(L·kv\_per\_tok) — a ceiling no batch size beats.
- Prefill throughput = (peak·MFU)/2N is far higher and compute-bound; quote it separately, never summed with decode.
- The frontier is computable: the knee is the max batch under the TPOT SLO, and it's the requests/GPU the rest of the track consumes.
- $/1M tok = price / (tok_per_s·3600)·1e6 closes lesson 02's loop — bytes → batch → throughput → dollars, end to end.
- requests/s = throughput / G hands the number back to Little's Law (03) and replica sizing (04).
- Batch / offline drops the SLO — operate at the decode ceiling (~1.4× the knee) or, for prefill-only embeddings, the compute roof (peak·MFU)/2N (up to ~5×). Same model, the cheapest token (lesson 18).