all_lessons / ml_system_design / 18 · case · batch lesson 18 / 20

Case study — high-throughput batch (offline) inference & embeddings

This is the mirror image of lesson 13. There, latency was everything and batching was the enemy of TTFT; the design bent itself around a 200 ms wall. Here there is no latency SLO at all — so batching is not a tax, it is the entire game. The only objective is tokens-per-dollar. We end the six-case series on this clean inversion: same transformer, same loop, the opposite binding wall.

The brief
Offline jobs over a large corpus: score / classify / summarize / embed ~1B documents overnight, or generate synthetic training data. No per-request latency SLO — nobody is waiting on a single output. The constraints are only two: (1) finish within a window (say 12 h), and (2) minimize total $. Throughput and cost are the whole specification.

Run the loop. Before reading on: with no latency budget, where on lesson 03's frontier do you operate — and what stops being a constant that was fixed in every online design so far?

Stage 0 · Requirements → the only number that matters (lesson 03)

In lesson 03 every design negotiated the latency–throughput frontier and settled at the knee under the SLO ceiling. Remove the SLO ceiling and the knee disappears: you operate at the far right of the frontier, the largest batch you can fit, where each token is cheapest. Little's Law is irrelevant here — there is no W to bound, no concurrency target to hit, no GPU count falling out of L = λW. The objective collapses to a single formula (lesson 02's serving cost):

$/1M tok = ($/gpu-hr) / (tok_per_s_per_gpu · 3600) · 1e6
Binding constraint, stage 0
Throughput-per-dollar. Every lever in this design serves exactly two numbers: tok/s/GPU (the denominator) and $/GPU-hr (the numerator). Nothing else. A change that raises tok/s/GPU or lowers $/GPU-hr is good; everything else is noise. This is the simplest objective in the whole series — and that simplicity is the point of ending here.

Stage 1 · Max out the GPU — push past the roofline ridge (lesson 02, 04)

Online decode at small batch sits far left of the roofline ridge (~295 FLOP/byte on an H100, lesson 02) — bandwidth-bound, re-reading the weights for a handful of tokens. Batching moves you rightward: more tokens per weight-read. With no latency SLO you keep pushing batch until you are firmly right of the ridge — compute-bound, the FLOPs fully used, and tok/s/GPU at its ceiling. This is why offline throughput per GPU is routinely many× the latency-constrained online number: online had to stop at the knee; you don't.

The ceiling on batch is not latency — it is KV memory (lesson 04). Bigger batch needs more KV resident, and KV is the scarce HBM tenant (lesson 02). So the levers are the ones that buy batch headroom and cheaper FLOPs:

unsorted batch (pad to longest): length-bucketed batch: doc A xxxx........ ← 60% pad bin "short": AAAA BBBB CCCC ~0% pad doc B xx.......... ← 80% pad bin "long": DDDDDDDD EEEEEEE ~5% pad doc C xxxxxx...... ← 40% pad doc D xxxxxxxxxxxx the tax-setter → wasted $ ∝ padding fraction, recovered

Stage 2 · Cheaper hardware — the lever online couldn't pull (lesson 11)

Online serving treats $/gpu-hr as a constant: you cannot drop a live user, so you pay the on-demand price for reliable GPUs. Batch jobs are fault-tolerant — a failed shard just retries, and no one notices. That unlocks spot / preemptible GPUs at ~30–70% off (lesson 11). The discipline is the same as fault-tolerant training (lesson 07): checkpoint progress — persist which documents are done — so a preemption costs minutes of re-work, not hours.

Binding constraint, stage 2
$/GPU-hr is now a lever, not a constant. This is the structural difference from every online case in the series. In lesson 13 the price of compute was fixed and you fought on the latency axis; here you attack both terms of the cost formula. A 70% spot discount alone cuts $/1M tokens by 3× — before a single throughput optimization.

Stage 3 · Scheduling & packing — finish at the window, not before

The deadline turns this into a packing problem: keep every GPU busy until the 12 h window, then stop. A job scheduler shards the corpus, packs shards onto GPUs, and right-sizes the fleet to finish exactly at the window. Both errors cost money:

Fleet sizingResultCost
Over-provisionedfinishes in 6 h, GPUs idle for 6 hyou paid for headroom you didn't use
Right-sizedfinishes at ~12 hminimal — the target
Under-provisionedmisses the windowblows constraint (1); corpus not ready

The subtle failure is the straggler (callback to lesson 08): one slow shard — an oversized bin, a flaky object-store read, a preempted node mid-retry — delays the whole window, because the job is done only when its slowest shard is. Treat the slowest shard as your real completion time: cap per-shard size, replicate or re-issue stragglers near the deadline, and keep shards small enough that a preemption re-runs cheaply. The fleet size that finishes at 12 h is set by total work ÷ (tok/s/GPU × GPUs × 12h), with the straggler margin baked in.

Stage 4 · The embeddings sub-case — the extreme throughput regime

Embedding a corpus is the purest version of this workload. An embedding model is an encoder: one forward pass per document, no decode loop, so there is no KV cache growth and nothing to stream token-by-token. It is pure prefill — already compute-bound, trivially batchable, the rightmost point on the frontier by construction. The generation bottleneck simply vanishes.

So the bottleneck shifts off the GPU entirely: for 1B documents the cost moves to the ANN index build and vector storage — writing, sharding, and indexing billions of vectors downstream — not the forward passes that produced them. The GPU stops being the constraint; the embedding step is so cheap per token that the design question becomes "how do I store and index a billion vectors," which is a data-systems problem, not an inference one.

Interactive · tokens per dollar

This widget is the inversion of lesson 13's TTFT cliff: there you watched latency fall off a wall; here you watch $/1M tokens fall as you push batch past the ridge, quantize, and price in spot. Small batch leaves throughput — and money — on the floor. Push right, then attack the price.

Tokens per dollar (offline)

Compute-bound tok/s ≈ 990e12·quant_factor·MFU / (2·N·1e9), with MFU≈0.4. Batch scales it by batch/(batch+const) to show diminishing returns up to the ridge plateau, then ×(1−padding%). fp8 ≈ 2× the throughput & fits more batch vs fp16. eff $/GPU-hr = $3·(1−spot%). Online baseline = the same model at batch≈8 (knee, latency-constrained). Order-of-magnitude per the series' ±30% contract.

tok/s/GPU
eff $/GPU-hr
$/1M tokens
vs online baseline

What this case teaches