Estimating latency — TTFT and TPOT in milliseconds
Lesson 03 named the latency metrics and used them as SLO targets. It never told you how to predict them. This lesson does: given a model, a GPU, and a prompt, you compute TTFT and TPOT in milliseconds — before you write a line of serving code — using nothing but lesson 02's FLOPs, bytes, and the roofline. Two phases, two formulas. That's the whole lesson.
1 · Why two formulas, not one
Lesson 01 split a request into prefill then decode; lesson 02's roofline told you they live on opposite sides of the arithmetic-intensity ridge (~295 FLOP/byte on an H100). That split is the reason a single "latency" number is meaningless for an LLM — the two phases are limited by two different pieces of hardware:
| Phase | Work it does | Roof that binds | Latency it sets |
|---|---|---|---|
| Prefill | one big matmul over all prompt tokens at once → high arithmetic intensity | compute (FLOP/s) | TTFT — time to first token |
| Decode | one tiny matmul per step (one token), re-reading every weight → low arithmetic intensity | memory bandwidth (B/s) | TPOT — time per output token |
So we estimate them with two different rates. Get this table into your bones and the arithmetic writes itself.
2 · TTFT — the compute-bound estimate
Prefill pushes the whole prompt through one forward pass. From lesson 02, a forward pass costs 2N FLOPs per token, so a prompt of P tokens costs 2 · N · P FLOPs. Divide by the rate the GPU actually achieves — peak FLOP/s scaled by MFU (you never get peak; lesson 02):
TTFT ≈ 3.2e13 / 4.95e14 ≈ 0.065 s = 65 ms. A snappy "it heard me." Double the prompt to 4K → ~130 ms; the relationship is linear in prompt length, which is why TTFT is a prefill problem and why prefix caching (lesson 06) — which lets you skip prompt tokens you've already processed — is the highest-leverage TTFT optimization.
Two corrections keep this honest, both deferred to later lessons but flagged now so you know the formula's edges:
- Queueing delay. The formula is the service time. Real TTFT = queue_wait + TTFT_compute. Under load the queue wait dominates the tail — it's why p99 TTFT >> p50, and why admission control (lesson 04) and chunked prefill (lesson 06) exist.
- The seq² term. Attention adds an O(P²) cost the 2N·P rule ignores. Negligible below a few thousand tokens; at 100K+ it dominates and prefill goes quadratic — the entire subject of lesson 17.
3 · TPOT — the bandwidth-bound estimate
This is the one people get wrong, so we build it slowly. Each decode step produces one token per request but must read every weight out of HBM to do it. The step's latency is the time to move its bytes across memory bandwidth:
What gets read per step, for a running batch of B requests each holding a context of L tokens?
| What's read | Bytes | Scales with |
|---|---|---|
| Model weights (once, shared by the whole batch) | 2N bytes (fp16) — 2 B/param, not §2's 2N FLOPs | model size only |
| KV cache (every token of every request in the batch) | B · L · kv_bytes_per_token | batch × context |
MBU is "memory-bandwidth utilization," the bandwidth analogue of MFU — typically 0.6–0.8 on a well-tuned decode kernel. The leading 2N is read once per step regardless of batch size — that is the whole reason batching is nearly free for decode, the engine of lesson 03's throughput story (and lesson 03b's).
4 · The two decode regimes — the flip that matters
The TPOT = (2N + B·L·kv\_per\_tok)/(BW·MBU) formula has two regimes, and knowing which you're in is the design insight:
- Weight-bound (small batch): the 2N term dominates. TPOT is roughly constant as you add requests — you're re-reading the same weights and serving more tokens with them. This is the free-throughput zone.
- KV-bound (large batch / long context): the B·L·kv term dominates. Now every added request adds real bytes, so TPOT climbs linearly with batch. Throughput stops being free; latency degrades for everyone.
The crossover — where the KV traffic equals the weight traffic — is a number you can compute:
5 · End-to-end latency
For a request that emits G output tokens, stitch the two phases together — one prefill, then G−1 more decode steps after the first token:
The mix tells you which phase to optimize. A code-completion request (big prompt, ~10 output tokens) is TTFT-dominated — work the prefill side (lesson 13). A reasoning request (short prompt, 5,000 thinking tokens) is utterly TPOT-dominated: at 7 ms/token that's 35 seconds of decode, and shaving TPOT is the only thing that matters. Same model, opposite optimization targets — and you can tell which before building anything, just by plugging the output length into this line.
6 · Batch / offline latency — makespan, not per-token
Everything above assumed a human waiting on a stream. Flip to an offline batch job — embed a 10M-document corpus, translate a dataset overnight, score every row in a warehouse table — and the latency metric changes entirely. No one watches any single response, so TTFT and TPOT stop mattering on their own. What matters is makespan: the wall-clock to drain the whole job.
Makespan is just throughput wearing a clock. Run the job as back-to-back waves of B requests:
The two forms agree whenever decode dominates the wave (the per-wave TTFT is a rounding error) — and that agreement is the lesson: for batch work, minimizing latency means maximizing throughput (lesson 03b), the exact opposite of the online knee. You'd gladly make any single request slower if a fuller batch drained the queue sooner.
Interactive · latency calculator
Pick a model, GPU, prompt, batch, and output length. The widget runs §2–§5 and shows TTFT, TPOT, and E2E — plus which decode regime you're in and how each step's bytes split between weights and KV. Find the batch where the bar flips from blue (weight-bound) to amber (KV-bound): that's B* (reported live in the note), and it's where your latency stops being free. Drag the prompt short and B* shoots past the slider — that is the lesson: short contexts stay weight-bound, so batching never costs you latency. The job-size knob turns the same per-request numbers into an offline makespan (§6): raise the batch and watch a 100k-request job's wall-clock collapse — the batch latency you cut by maximizing throughput.
What carries forward
- Latency = work ÷ rate, once per phase. TTFT = 2NP / (peak·MFU) (compute-bound); TPOT = (2N + B·L·kv) / (BW·MBU) (bandwidth-bound).
- At small batch, TPOT ≈ 2N/BW — just the cost of reading the weights once. This is why decode batching is nearly free until B* = 2N/(L·kv\_per\_tok).
- Cross B* and TPOT climbs linearly — latency stops being free, and longer contexts push B* down.
- E2E = TTFT + TPOT·(G−1) tells you which phase to optimize from the output length alone.
- The estimate is a floor — add queueing and contention for the real tail. But an SLO that fails the floor is physically impossible; that check alone is worth the arithmetic.
- Offline/batch latency is makespan = ⌈N/B⌉·E2E ≈ total_tokens/throughput — so batch jobs are optimized by maximizing throughput (03b), and sorted by length to kill the straggler tax.