all_lessons / ml_system_design / 13 · case · code assistant lesson 13 / 20

Case study — a code assistant (Copilot-style)

The capstone designed a system where output length dominated. Here the opposite wall binds: outputs are tiny, but the user is typing, so TTFT is everything — and the input is a 4–8K-token file that barely changes between keystrokes. This is the first of six case studies; each takes the same loop to a workload whose binding constraint is different, so the resulting system looks different even though the model is ordinary.

The brief
Inline autocomplete in an IDE. Context: ~6,000 tokens (the open file + a few retrieved neighbours). Output: 10–40 tokens (a line or two). Trigger: on every typing pause — bursty, up to 2,000 req/s across users. SLO: p95 TTFT < 200 ms (slower than that and the suggestion arrives after the human has already typed the line). Plus a chat tier with normal SLOs. Budget: it has to be cheap enough to give away.

Run the loop. Before reading on: what's the binding constraint — memory, bandwidth, compute, or latency? And what single mechanism saves this design?

Stage 0 · Requirements → the number that matters (lesson 03)

Output is ~25 tokens, so by Little's Law the time-in-system is dominated by TTFT, not decode: W ≈ TTFT + 25·TPOT ≈ 0.2 + 25·0.01 ≈ 0.45 s. Concurrency L = λW = 2000·0.45 = 900 — modest. This is not a throughput or memory problem; it is a TTFT problem. And TTFT is prefill latency (lesson 04), so the whole design is about making prefill of a 6K-token context disappear.

Stage 1 · The arithmetic says you're already over budget (lesson 02, 04)

Prefill is compute-bound (lesson 01). For a 7B model at ~40% MFU on one H100: prefill throughput = 990e12·0.4 / (2·7e9) ≈ 28,000 tokens/s. A cold 6,000-token prefill therefore costs 6000 / 28000 ≈ 214 msyou've blown the 200 ms TTFT budget before decoding a single token. A bigger model makes it strictly worse (13B → ~430 ms). The naive design fails on arithmetic alone.

Binding constraint, stage 1
Prefill latency on a near-static 6K context. But look at the workload: between two keystroke-triggered requests the context changed by a handful of tokens. You are re-prefilling 6,000 tokens to handle 5 new ones. That redundancy is the whole opportunity — exactly the SGLang premise (lesson 03's prefix-share parameter, lesson 06's prefix caching).

Stage 2 · Prefix caching turns the problem inside out (lesson 06)

With a prefix cache / RadixAttention (lesson 06, SGLang 04), a request reuses the KV of the longest matching prefix already in cache. On a cache hit, you prefill only the new tokens:

TTFT_warm ≈ prefill(new tokens) + 1 decode step ≈ 5/28000 s + ~3 ms ≈ 3–4 ms

That is a ~50× collapse in TTFT, and it moves the binding constraint from "can I afford the prefill?" to "can I hit the cache?" — which is now a routing and capacity question, not a compute one:

Stage 3 · The remaining latency, squeezed (lesson 06)

Even warm, you want margin under 200 ms. Two more levers, each justified by the binding wall:

Binding constraint, final
Cache hit rate. Once prefix caching is in, every percentage point of hit rate is TTFT and cost. The system's job becomes maximizing residency and routing accuracy — a memory-and-scheduling problem wearing a latency problem's clothes. Notice we never bought faster compute; the wall was redundant prefill, so the fix was to delete it.

Interactive · the cache-hit cliff

Slide the prefix-cache hit rate. Watch p95 TTFT fall off the 200 ms cliff the instant hits become common — and watch how a bigger model just shifts the cliff, it doesn't remove the need to be on the right side of it.

Autocomplete TTFT vs cache hit rate

Cold requests pay full prefill (compute-bound, ~28k tok/s/H100 at 7B, scaled by model size); warm requests prefill only the new tokens. p95 ≈ the 95th-percentile mix of warm/cold. Assumptions stated here; order-of-magnitude per the series' ±30% contract.

cold TTFT
warm TTFT
p95 TTFT (est.)
verdict vs 200ms

What this case teaches