Estimating latency — TTFT and TPOT in milliseconds

Lesson 03 named the latency metrics and used them as SLO targets. It never told you how to predict them. This lesson does: given a model, a GPU, and a prompt, you compute TTFT and TPOT in milliseconds — before you write a line of serving code — using nothing but lesson 02's FLOPs, bytes, and the roofline. Two phases, two formulas. That's the whole lesson.

The one idea

Latency is just work ÷ rate. Prefill is compute-work limited by the FLOP/s roof, so TTFT = compute_work / compute_rate. Decode is memory-traffic limited by the bandwidth roof, so TPOT = bytes_moved / bandwidth. The roofline from lesson 02 told you which roof binds each phase; here we divide by that roof to get time. Everything below is filling in the two numerators.

1 · Why two formulas, not one

Lesson 01 split a request into prefill then decode; lesson 02's roofline told you they live on opposite sides of the arithmetic-intensity ridge (~295 FLOP/byte on an H100). That split is the reason a single "latency" number is meaningless for an LLM — the two phases are limited by two different pieces of hardware:

Phase	Work it does	Roof that binds	Latency it sets
Prefill	one big matmul over all prompt tokens at once → high arithmetic intensity	compute (FLOP/s)	TTFT — time to first token
Decode	one tiny matmul per step (one token), re-reading every weight → low arithmetic intensity	memory bandwidth (B/s)	TPOT — time per output token

So we estimate them with two different rates. Get this table into your bones and the arithmetic writes itself.

2 · TTFT — the compute-bound estimate

Prefill pushes the whole prompt through one forward pass. From lesson 02, a forward pass costs 2N FLOPs per token, so a prompt of P tokens costs 2 · N · P FLOPs. Divide by the rate the GPU actually achieves — peak FLOP/s scaled by MFU (you never get peak; lesson 02):

TTFT_compute ≈ (2 · N · P) / (peak_FLOPs · MFU)

Worked: Llama-3-8B, 2,000-token prompt, one H100

Work: 2 · 8e9 · 2000 = 3.2e13 FLOPs. Rate at 50% MFU: 990e12 · 0.5 = 4.95e14 FLOP/s.
TTFT ≈ 3.2e13 / 4.95e14 ≈ 0.065 s = 65 ms. A snappy "it heard me." Double the prompt to 4K → ~130 ms; the relationship is linear in prompt length, which is why TTFT is a prefill problem and why prefix caching (lesson 06) — which lets you skip prompt tokens you've already processed — is the highest-leverage TTFT optimization.

Two corrections keep this honest, both deferred to later lessons but flagged now so you know the formula's edges:

Queueing delay. The formula is the service time. Real TTFT = queue_wait + TTFT_compute. Under load the queue wait dominates the tail — it's why p99 TTFT >> p50, and why admission control (lesson 04) and chunked prefill (lesson 06) exist.
The seq² term. Attention adds an O(P²) cost the 2N·P rule ignores. Negligible below a few thousand tokens; at 100K+ it dominates and prefill goes quadratic — the entire subject of lesson 17.

3 · TPOT — the bandwidth-bound estimate

This is the one people get wrong, so we build it slowly. Each decode step produces one token per request but must read every weight out of HBM to do it. The step's latency is the time to move its bytes across memory bandwidth:

TPOT ≈ bytes_read_per_step / (peak_bandwidth · MBU)

What gets read per step, for a running batch of B requests each holding a context of L tokens?

What's read	Bytes	Scales with
Model weights (once, shared by the whole batch)	2N bytes (fp16) — 2 B/param, not §2's 2N FLOPs	model size only
KV cache (every token of every request in the batch)	B · L · kv_bytes_per_token	batch × context

TPOT ≈ (2N + B · L · kv_per_tok) / (BW · MBU)

MBU is "memory-bandwidth utilization," the bandwidth analogue of MFU — typically 0.6–0.8 on a well-tuned decode kernel. The leading 2N is read once per step regardless of batch size — that is the whole reason batching is nearly free for decode, the engine of lesson 03's throughput story (and lesson 03b's).

Worked: single-stream decode, Llama-3-8B on an H100

At B = 1 the KV term is tiny, so weights dominate: bytes ≈ 2N = 16e9. At BW 3.35 TB/s and MBU 0.7: TPOT ≈ 16e9 / (3.35e12 · 0.7) ≈ 6.8 ms → ~147 tokens/s for one stream. That matches what an 8B model actually does on an H100. Notice the model didn't even need its KV cache to set this number — at small batch, TPOT is just "how long to read the weights once."

4 · The two decode regimes — the flip that matters

The TPOT = (2N + B·L·kv\_per\_tok)/(BW·MBU) formula has two regimes, and knowing which you're in is the design insight:

Weight-bound (small batch): the 2N term dominates. TPOT is roughly constant as you add requests — you're re-reading the same weights and serving more tokens with them. This is the free-throughput zone.
KV-bound (large batch / long context): the B·L·kv term dominates. Now every added request adds real bytes, so TPOT climbs linearly with batch. Throughput stops being free; latency degrades for everyone.

The crossover — where the KV traffic equals the weight traffic — is a number you can compute:

B* = 2N / (L · kv_per_tok)

Worked: where 8B flips, at 2K context

kv_per_tok for Llama-3-8B is 128 KB (lesson 02/04). So B* = 16e9 / (2000 · 131072) ≈ 61. Below ~61 concurrent requests, adding load barely moves TPOT (weight-bound — batch for free). Above it, each request measurably slows the stream (KV-bound). That single number tells you how much headroom you have before latency starts fighting back — and it drops as context L grows, which is why long-context serving turns latency-hostile fast.

5 · End-to-end latency

For a request that emits G output tokens, stitch the two phases together — one prefill, then G−1 more decode steps after the first token:

E2E ≈ TTFT + TPOT · (G − 1)

The mix tells you which phase to optimize. A code-completion request (big prompt, ~10 output tokens) is TTFT-dominated — work the prefill side (lesson 13). A reasoning request (short prompt, 5,000 thinking tokens) is utterly TPOT-dominated: at 7 ms/token that's 35 seconds of decode, and shaving TPOT is the only thing that matters. Same model, opposite optimization targets — and you can tell which before building anything, just by plugging the output length into this line.

The estimate is a floor, not a promise

These formulas give the best case on an unloaded replica: no queueing, no prefill–decode interference, perfect MBU. Real p99 sits above them because of the queue (§2) and because a long prefill freezes everyone's decode (lesson 04's contention). Use the estimate to (a) sanity-check whether an SLO is even physically achievable, and (b) locate the binding phase. For the tail, you measure. A design that fails the napkin estimate will never pass in production — that's the estimate's real job.

6 · Batch / offline latency — makespan, not per-token

Everything above assumed a human waiting on a stream. Flip to an offline batch job — embed a 10M-document corpus, translate a dataset overnight, score every row in a warehouse table — and the latency metric changes entirely. No one watches any single response, so TTFT and TPOT stop mattering on their own. What matters is makespan: the wall-clock to drain the whole job.

Makespan is just throughput wearing a clock. Run the job as back-to-back waves of B requests:

makespan ≈ ⌈N_req / B⌉ · (TTFT + TPOT·(G−1)) ≈ total_tokens / aggregate_throughput

The two forms agree whenever decode dominates the wave (the per-wave TTFT is a rounding error) — and that agreement is the lesson: for batch work, minimizing latency means maximizing throughput (lesson 03b), the exact opposite of the online knee. You'd gladly make any single request slower if a fuller batch drained the queue sooner.

Worked: generate 100k responses offline

100,000 requests × 500 output tokens = 5·10⁷ tokens. With no SLO, push the batch to the lesson-04 memory limit (~256 on our 8B replica) and ride near the decode ceiling — ~7,200 tok/s, comfortably above the online knee's ~6,300 (03b §3). makespan ≈ 5e7 / 7,200 ≈ 6,900 s ≈ 1.9 h. Cross-check the wave form: ⌈100000/256⌉ = 391 waves × (TTFT + 499·TPOT) at ~35 ms TPOT ≈ 391 · 17.7 s ≈ 6,900 s — the same number (the per-wave TTFT is a rounding error at this output length). Want it faster? Raise the throughput — bigger batch, fp8 KV, more GPUs — never "lower the latency."

The straggler tax — and why offline jobs sort

Under static batching a wave can't return until its longest member finishes (lesson 04). Drop one 2,000-token output into a wave of 50-token ones and all the slots pay for 2,000 — makespan balloons. Offline, the fix is free because every request is in hand up front: sort or bucket by expected length so each wave is uniform. (Continuous batching removes most of the tax automatically; bucketing mops up the rest.) Pure embedding jobs dodge decode entirely — one forward pass per input, no token loop — so their makespan is set by prefill throughput, computed in 03b (at corpus scale the bottleneck then moves off the GPU to index build — the subject of lesson 18).

Interactive · latency calculator

Pick a model, GPU, prompt, batch, and output length. The widget runs §2–§5 and shows TTFT, TPOT, and E2E — plus which decode regime you're in and how each step's bytes split between weights and KV. Find the batch where the bar flips from blue (weight-bound) to amber (KV-bound): that's B* (reported live in the note), and it's where your latency stops being free. Drag the prompt short and B* shoots past the slider — that is the lesson: short contexts stay weight-bound, so batching never costs you latency. The job-size knob turns the same per-request numbers into an offline makespan (§6): raise the batch and watch a 100k-request job's wall-clock collapse — the batch latency you cut by maximizing throughput.

TTFT / TPOT / E2E estimator

Assumptions, stated honestly (lesson 02 precision, ±30%): weights fp16 = 2N; KV/token interpolated between lesson 02's anchors (128 KB at 8B → 320 KB at 70B); context L = prompt + half the output (a mid-decode average — so the B* shown below differs slightly from §4's L = 2000 box); seq² prefill term ignored; no queueing. It's a floor.

GPU params N (B) 8 prompt P (tok) 2048 output G (tok) 300 batch B 1 MFU % 50 MBU % 70 job size (reqs) 100000

TTFT (prefill)

–

TPOT (per token)

–

single-stream tok/s

–

E2E latency · online

–

makespan · offline job

–

decode step: bytes read = weights (blue) + KV (amber)

weights –KV –

E2E split: prefill TTFT (green) vs decode TPOT·(G−1) (track) — drag output G and watch which phase owns the wait

prefill –decode –

What carries forward

Latency = work ÷ rate, once per phase. TTFT = 2NP / (peak·MFU) (compute-bound); TPOT = (2N + B·L·kv) / (BW·MBU) (bandwidth-bound).
At small batch, TPOT ≈ 2N/BW — just the cost of reading the weights once. This is why decode batching is nearly free until B* = 2N/(L·kv\_per\_tok).
Cross B* and TPOT climbs linearly — latency stops being free, and longer contexts push B* down.
E2E = TTFT + TPOT·(G−1) tells you which phase to optimize from the output length alone.
The estimate is a floor — add queueing and contention for the real tail. But an SLO that fails the floor is physically impossible; that check alone is worth the arithmetic.
Offline/batch latency is makespan = ⌈N/B⌉·E2E ≈ total_tokens/throughput — so batch jobs are optimized by maximizing throughput (03b), and sorted by length to kill the straggler tax.