Case study — high-throughput batch (offline) inference & embeddings

This is the mirror image of lesson 13. There, latency was everything and batching was the enemy of TTFT; the design bent itself around a 200 ms wall. Here there is no latency SLO at all — so batching is not a tax, it is the entire game. The only objective is tokens-per-dollar. We end the six-case series on this clean inversion: same transformer, same loop, the opposite binding wall.

The brief

Offline jobs over a large corpus: score / classify / summarize / embed ~1B documents overnight, or generate synthetic training data. No per-request latency SLO — nobody is waiting on a single output. The constraints are only two: (1) finish within a window (say 12 h), and (2) minimize total $. Throughput and cost are the whole specification.

Run the loop. Before reading on: with no latency budget, where on lesson 03's frontier do you operate — and what stops being a constant that was fixed in every online design so far?

Stage 0 · Requirements → the only number that matters (lesson 03)

In lesson 03 every design negotiated the latency–throughput frontier and settled at the knee under the SLO ceiling. Remove the SLO ceiling and the knee disappears: you operate at the far right of the frontier, the largest batch you can fit, where each token is cheapest. Little's Law is irrelevant here — there is no W to bound, no concurrency target to hit, no GPU count falling out of L = λW. The objective collapses to a single formula (lesson 02's serving cost):

$/1M tok = ($/gpu-hr) / (tok_per_s_per_gpu · 3600) · 1e6

Binding constraint, stage 0

Throughput-per-dollar. Every lever in this design serves exactly two numbers: tok/s/GPU (the denominator) and $/GPU-hr (the numerator). Nothing else. A change that raises tok/s/GPU or lowers $/GPU-hr is good; everything else is noise. This is the simplest objective in the whole series — and that simplicity is the point of ending here.

Stage 1 · Max out the GPU — push past the roofline ridge (lesson 02, 04)

Online decode at small batch sits far left of the roofline ridge (~295 FLOP/byte on an H100, lesson 02) — bandwidth-bound, re-reading the weights for a handful of tokens. Batching moves you rightward: more tokens per weight-read. With no latency SLO you keep pushing batch until you are firmly right of the ridge — compute-bound, the FLOPs fully used, and tok/s/GPU at its ceiling. This is why offline throughput per GPU is routinely many× the latency-constrained online number: online had to stop at the knee; you don't.

The ceiling on batch is not latency — it is KV memory (lesson 04). Bigger batch needs more KV resident, and KV is the scarce HBM tenant (lesson 02). So the levers are the ones that buy batch headroom and cheaper FLOPs:

fp8 / int8 quantization (lesson 06). Halving weight bytes frees HBM for more KV → bigger batch, and the cheaper compute lifts tok/s directly. Unlike the latency cases, you have no qualms about it here: the cheap tier is exactly where a small quality drop is acceptable, gated by an eval (lesson 10).
A smaller or MoE model, if quality allows. Cost is 2N FLOPs/token; an MoE routing top-2 of 8 experts spends ~2N_active — roughly a quarter of the FLOPs of a same-total-size dense model (lesson 06). When you are compute-bound, fewer active FLOPs/token is more tok/s/GPU. Drop to the smallest model that still passes the corpus's eval bar.
Sort/bucket inputs by sequence length. A mixed batch pads every sequence to the longest one, and the padding does real FLOPs that produce nothing. Wasted $ is proportional to padding fraction: a batch that is 40% padding wastes 40% of the compute — you pay for 1B documents' worth of FLOPs and 1.4B worth of GPU-hours. Bucketing inputs into length bins before batching recovers almost all of it. This is free money that online serving (which must take requests as they arrive) cannot collect.

unsorted batch (pad to longest): length-bucketed batch: doc A xxxx........ ← 60% pad bin "short": AAAA BBBB CCCC ~0% pad doc B xx.......... ← 80% pad bin "long": DDDDDDDD EEEEEEE ~5% pad doc C xxxxxx...... ← 40% pad doc D xxxxxxxxxxxx the tax-setter → wasted $ ∝ padding fraction, recovered

Stage 2 · Cheaper hardware — the lever online couldn't pull (lesson 11)

Online serving treats $/gpu-hr as a constant: you cannot drop a live user, so you pay the on-demand price for reliable GPUs. Batch jobs are fault-tolerant — a failed shard just retries, and no one notices. That unlocks spot / preemptible GPUs at ~30–70% off (lesson 11). The discipline is the same as fault-tolerant training (lesson 07): checkpoint progress — persist which documents are done — so a preemption costs minutes of re-work, not hours.

Binding constraint, stage 2

$/GPU-hr is now a lever, not a constant. This is the structural difference from every online case in the series. In lesson 13 the price of compute was fixed and you fought on the latency axis; here you attack both terms of the cost formula. A 70% spot discount alone cuts $/1M tokens by 3× — before a single throughput optimization.

Stage 3 · Scheduling & packing — finish at the window, not before

The deadline turns this into a packing problem: keep every GPU busy until the 12 h window, then stop. A job scheduler shards the corpus, packs shards onto GPUs, and right-sizes the fleet to finish exactly at the window. Both errors cost money:

Fleet sizing	Result	Cost
Over-provisioned	finishes in 6 h, GPUs idle for 6 h	you paid for headroom you didn't use
Right-sized	finishes at ~12 h	minimal — the target
Under-provisioned	misses the window	blows constraint (1); corpus not ready

The subtle failure is the straggler (callback to lesson 08): one slow shard — an oversized bin, a flaky object-store read, a preempted node mid-retry — delays the whole window, because the job is done only when its slowest shard is. Treat the slowest shard as your real completion time: cap per-shard size, replicate or re-issue stragglers near the deadline, and keep shards small enough that a preemption re-runs cheaply. The fleet size that finishes at 12 h is set by total work ÷ (tok/s/GPU × GPUs × 12h), with the straggler margin baked in.

Stage 4 · The embeddings sub-case — the extreme throughput regime

Embedding a corpus is the purest version of this workload. An embedding model is an encoder: one forward pass per document, no decode loop, so there is no KV cache growth and nothing to stream token-by-token. It is pure prefill — already compute-bound, trivially batchable, the rightmost point on the frontier by construction. The generation bottleneck simply vanishes.

So the bottleneck shifts off the GPU entirely: for 1B documents the cost moves to the ANN index build and vector storage — writing, sharding, and indexing billions of vectors downstream — not the forward passes that produced them. The GPU stops being the constraint; the embedding step is so cheap per token that the design question becomes "how do I store and index a billion vectors," which is a data-systems problem, not an inference one.

Interactive · tokens per dollar

This widget is the inversion of lesson 13's TTFT cliff: there you watched latency fall off a wall; here you watch $/1M tokens fall as you push batch past the ridge, quantize, and price in spot. Small batch leaves throughput — and money — on the floor. Push right, then attack the price.

Tokens per dollar (offline)

Compute-bound tok/s ≈ 990e12·quant_factor·MFU / (2·N·1e9), with MFU≈0.4. Batch scales it by batch/(batch+const) to show diminishing returns up to the ridge plateau, then ×(1−padding%). fp8 ≈ 2× the throughput & fits more batch vs fp16. eff $/GPU-hr = $3·(1−spot%). Online baseline = the same model at batch≈8 (knee, latency-constrained). Order-of-magnitude per the series' ±30% contract.

model params N (B) 7 batch size 256 quant (1=fp8 / 2=fp16) 1 spot discount % 0 padding waste % 0

tok/s/GPU

–

eff $/GPU-hr

–

$/1M tokens

–

vs online baseline

–

What this case teaches

It is lesson 13 turned inside out. Remove the latency SLO and the wall flips: batching stops being the enemy of TTFT and becomes the entire mechanism. You operate at the far right of the frontier, compute-bound, where online serving was forbidden to go.
The objective is one number — $/1M tokens — and you attack both of its terms. Raise tok/s/GPU (huge batch past the ridge, fp8, smaller/MoE model, length-bucketing to kill padding waste) and lower $/GPU-hr (spot + checkpointing). Online could only touch the first; fault-tolerance unlocks the second.
The deadline replaces the SLO. Right-size the fleet to finish at the window — and beware the straggler (lesson 08), since the job ends only when its slowest shard does.
Embeddings are the extreme case: encoder, one forward pass, no KV growth — the bottleneck leaves the GPU and becomes index build + storage.
Six cases, one method. Code assistant (TTFT), consumer chatbot (KV × cost), RAG (prefill + retrieval), agentic platform (tail + state), long context (seq² + KV), and this (tokens/$) — same loop, same napkin math, same frontier. Only the binding wall moved, and each time the wall picked the system. That is the whole skill: find the wall, then design straight at it.