all_lessons / ml_system_design / 03a · estimating latency lesson 3a / 20

Estimating latency — TTFT and TPOT in milliseconds

Lesson 03 named the latency metrics and used them as SLO targets. It never told you how to predict them. This lesson does: given a model, a GPU, and a prompt, you compute TTFT and TPOT in milliseconds — before you write a line of serving code — using nothing but lesson 02's FLOPs, bytes, and the roofline. Two phases, two formulas. That's the whole lesson.

The one idea
Latency is just work ÷ rate. Prefill is compute-work limited by the FLOP/s roof, so TTFT = compute_work / compute_rate. Decode is memory-traffic limited by the bandwidth roof, so TPOT = bytes_moved / bandwidth. The roofline from lesson 02 told you which roof binds each phase; here we divide by that roof to get time. Everything below is filling in the two numerators.

1 · Why two formulas, not one

Lesson 01 split a request into prefill then decode; lesson 02's roofline told you they live on opposite sides of the arithmetic-intensity ridge (~295 FLOP/byte on an H100). That split is the reason a single "latency" number is meaningless for an LLM — the two phases are limited by two different pieces of hardware:

PhaseWork it doesRoof that bindsLatency it sets
Prefillone big matmul over all prompt tokens at once → high arithmetic intensitycompute (FLOP/s)TTFT — time to first token
Decodeone tiny matmul per step (one token), re-reading every weight → low arithmetic intensitymemory bandwidth (B/s)TPOT — time per output token

So we estimate them with two different rates. Get this table into your bones and the arithmetic writes itself.

2 · TTFT — the compute-bound estimate

Prefill pushes the whole prompt through one forward pass. From lesson 02, a forward pass costs 2N FLOPs per token, so a prompt of P tokens costs 2 · N · P FLOPs. Divide by the rate the GPU actually achieves — peak FLOP/s scaled by MFU (you never get peak; lesson 02):

TTFT_compute ≈ (2 · N · P) / (peak_FLOPs · MFU)
Worked: Llama-3-8B, 2,000-token prompt, one H100
Work: 2 · 8e9 · 2000 = 3.2e13 FLOPs. Rate at 50% MFU: 990e12 · 0.5 = 4.95e14 FLOP/s.
TTFT ≈ 3.2e13 / 4.95e14 ≈ 0.065 s = 65 ms. A snappy "it heard me." Double the prompt to 4K → ~130 ms; the relationship is linear in prompt length, which is why TTFT is a prefill problem and why prefix caching (lesson 06) — which lets you skip prompt tokens you've already processed — is the highest-leverage TTFT optimization.

Two corrections keep this honest, both deferred to later lessons but flagged now so you know the formula's edges:

3 · TPOT — the bandwidth-bound estimate

This is the one people get wrong, so we build it slowly. Each decode step produces one token per request but must read every weight out of HBM to do it. The step's latency is the time to move its bytes across memory bandwidth:

TPOT ≈ bytes_read_per_step / (peak_bandwidth · MBU)

What gets read per step, for a running batch of B requests each holding a context of L tokens?

What's readBytesScales with
Model weights (once, shared by the whole batch)2N bytes (fp16) — 2 B/param, not §2's 2N FLOPsmodel size only
KV cache (every token of every request in the batch)B · L · kv_bytes_per_tokenbatch × context
TPOT ≈ (2N + B · L · kv_per_tok) / (BW · MBU)

MBU is "memory-bandwidth utilization," the bandwidth analogue of MFU — typically 0.6–0.8 on a well-tuned decode kernel. The leading 2N is read once per step regardless of batch size — that is the whole reason batching is nearly free for decode, the engine of lesson 03's throughput story (and lesson 03b's).

Worked: single-stream decode, Llama-3-8B on an H100
At B = 1 the KV term is tiny, so weights dominate: bytes ≈ 2N = 16e9. At BW 3.35 TB/s and MBU 0.7: TPOT ≈ 16e9 / (3.35e12 · 0.7) ≈ 6.8 ms → ~147 tokens/s for one stream. That matches what an 8B model actually does on an H100. Notice the model didn't even need its KV cache to set this number — at small batch, TPOT is just "how long to read the weights once."

4 · The two decode regimes — the flip that matters

The TPOT = (2N + B·L·kv\_per\_tok)/(BW·MBU) formula has two regimes, and knowing which you're in is the design insight:

The crossover — where the KV traffic equals the weight traffic — is a number you can compute:

B* = 2N / (L · kv_per_tok)
Worked: where 8B flips, at 2K context
kv_per_tok for Llama-3-8B is 128 KB (lesson 02/04). So B* = 16e9 / (2000 · 131072) ≈ 61. Below ~61 concurrent requests, adding load barely moves TPOT (weight-bound — batch for free). Above it, each request measurably slows the stream (KV-bound). That single number tells you how much headroom you have before latency starts fighting back — and it drops as context L grows, which is why long-context serving turns latency-hostile fast.
running batch B → TPOT (ms) → B* — weights = KV traffic weight-bound: TPOT ≈ flat (batch is free) KV-bound: TPOT rises with B

5 · End-to-end latency

For a request that emits G output tokens, stitch the two phases together — one prefill, then G−1 more decode steps after the first token:

E2E ≈ TTFT + TPOT · (G − 1)

The mix tells you which phase to optimize. A code-completion request (big prompt, ~10 output tokens) is TTFT-dominated — work the prefill side (lesson 13). A reasoning request (short prompt, 5,000 thinking tokens) is utterly TPOT-dominated: at 7 ms/token that's 35 seconds of decode, and shaving TPOT is the only thing that matters. Same model, opposite optimization targets — and you can tell which before building anything, just by plugging the output length into this line.

The estimate is a floor, not a promise
These formulas give the best case on an unloaded replica: no queueing, no prefill–decode interference, perfect MBU. Real p99 sits above them because of the queue (§2) and because a long prefill freezes everyone's decode (lesson 04's contention). Use the estimate to (a) sanity-check whether an SLO is even physically achievable, and (b) locate the binding phase. For the tail, you measure. A design that fails the napkin estimate will never pass in production — that's the estimate's real job.

6 · Batch / offline latency — makespan, not per-token

Everything above assumed a human waiting on a stream. Flip to an offline batch job — embed a 10M-document corpus, translate a dataset overnight, score every row in a warehouse table — and the latency metric changes entirely. No one watches any single response, so TTFT and TPOT stop mattering on their own. What matters is makespan: the wall-clock to drain the whole job.

Makespan is just throughput wearing a clock. Run the job as back-to-back waves of B requests:

makespan ≈ ⌈N_req / B⌉ · (TTFT + TPOT·(G−1)) ≈ total_tokens / aggregate_throughput

The two forms agree whenever decode dominates the wave (the per-wave TTFT is a rounding error) — and that agreement is the lesson: for batch work, minimizing latency means maximizing throughput (lesson 03b), the exact opposite of the online knee. You'd gladly make any single request slower if a fuller batch drained the queue sooner.

Worked: generate 100k responses offline
100,000 requests × 500 output tokens = 5·10⁷ tokens. With no SLO, push the batch to the lesson-04 memory limit (~256 on our 8B replica) and ride near the decode ceiling — ~7,200 tok/s, comfortably above the online knee's ~6,300 (03b §3). makespan ≈ 5e7 / 7,200 ≈ 6,900 s ≈ 1.9 h. Cross-check the wave form: ⌈100000/256⌉ = 391 waves × (TTFT + 499·TPOT) at ~35 ms TPOT ≈ 391 · 17.7 s ≈ 6,900 s — the same number (the per-wave TTFT is a rounding error at this output length). Want it faster? Raise the throughput — bigger batch, fp8 KV, more GPUs — never "lower the latency."
The straggler tax — and why offline jobs sort
Under static batching a wave can't return until its longest member finishes (lesson 04). Drop one 2,000-token output into a wave of 50-token ones and all the slots pay for 2,000 — makespan balloons. Offline, the fix is free because every request is in hand up front: sort or bucket by expected length so each wave is uniform. (Continuous batching removes most of the tax automatically; bucketing mops up the rest.) Pure embedding jobs dodge decode entirely — one forward pass per input, no token loop — so their makespan is set by prefill throughput, computed in 03b (at corpus scale the bottleneck then moves off the GPU to index build — the subject of lesson 18).

Interactive · latency calculator

Pick a model, GPU, prompt, batch, and output length. The widget runs §2–§5 and shows TTFT, TPOT, and E2E — plus which decode regime you're in and how each step's bytes split between weights and KV. Find the batch where the bar flips from blue (weight-bound) to amber (KV-bound): that's B* (reported live in the note), and it's where your latency stops being free. Drag the prompt short and B* shoots past the slider — that is the lesson: short contexts stay weight-bound, so batching never costs you latency. The job-size knob turns the same per-request numbers into an offline makespan (§6): raise the batch and watch a 100k-request job's wall-clock collapse — the batch latency you cut by maximizing throughput.

TTFT / TPOT / E2E estimator

Assumptions, stated honestly (lesson 02 precision, ±30%): weights fp16 = 2N; KV/token interpolated between lesson 02's anchors (128 KB at 8B → 320 KB at 70B); context L = prompt + half the output (a mid-decode average — so the B* shown below differs slightly from §4's L = 2000 box); seq² prefill term ignored; no queueing. It's a floor.

TTFT (prefill)
TPOT (per token)
single-stream tok/s
E2E latency · online
makespan · offline job

decode step: bytes read = weights (blue) + KV (amber)

weights –KV –

E2E split: prefill TTFT (green) vs decode TPOT·(G−1) (track) — drag output G and watch which phase owns the wait

prefill –decode –

What carries forward