Case study — a code assistant (Copilot-style)
The capstone designed a system where output length dominated. Here the opposite wall binds: outputs are tiny, but the user is typing, so TTFT is everything — and the input is a 4–8K-token file that barely changes between keystrokes. This is the first of six case studies; each takes the same loop to a workload whose binding constraint is different, so the resulting system looks different even though the model is ordinary.
Run the loop. Before reading on: what's the binding constraint — memory, bandwidth, compute, or latency? And what single mechanism saves this design?
Stage 0 · Requirements → the number that matters (lesson 03)
Output is ~25 tokens, so by Little's Law the time-in-system is dominated by TTFT, not decode: W ≈ TTFT + 25·TPOT ≈ 0.2 + 25·0.01 ≈ 0.45 s. Concurrency L = λW = 2000·0.45 = 900 — modest. This is not a throughput or memory problem; it is a TTFT problem. And TTFT is prefill latency (lesson 04), so the whole design is about making prefill of a 6K-token context disappear.
Stage 1 · The arithmetic says you're already over budget (lesson 02, 04)
Prefill is compute-bound (lesson 01). For a 7B model at ~40% MFU on one H100: prefill throughput = 990e12·0.4 / (2·7e9) ≈ 28,000 tokens/s. A cold 6,000-token prefill therefore costs 6000 / 28000 ≈ 214 ms — you've blown the 200 ms TTFT budget before decoding a single token. A bigger model makes it strictly worse (13B → ~430 ms). The naive design fails on arithmetic alone.
Stage 2 · Prefix caching turns the problem inside out (lesson 06)
With a prefix cache / RadixAttention (lesson 06, SGLang 04), a request reuses the KV of the longest matching prefix already in cache. On a cache hit, you prefill only the new tokens:
That is a ~50× collapse in TTFT, and it moves the binding constraint from "can I afford the prefill?" to "can I hit the cache?" — which is now a routing and capacity question, not a compute one:
- Cache-aware routing (lesson 05). A user's requests must land on the replica holding their file's KV. Hash-by-session or by-file-prefix routing, not round-robin — round-robin would give a cold cache on most requests and resurrect the 214 ms problem. Link SGLang 05.
- KV residency. The file's KV must survive between keystrokes (seconds) without eviction. At 6K tokens × ~128 KB/token (7B) ≈ 0.77 GB per active file — cheap, so you can keep thousands of editing sessions warm. This is why a small model wins twice: less prefill compute and more KV room for warm sessions.
Stage 3 · The remaining latency, squeezed (lesson 06)
Even warm, you want margin under 200 ms. Two more levers, each justified by the binding wall:
- Small model + speculative decoding. Autocomplete is small-batch and latency-critical — exactly where speculative decoding pays (lesson 06: spare compute, bandwidth-bound decode). A 1–3B draft against a 7B target, or a dedicated tiny autocomplete model, keeps decode steps few. Don't use the chat model here.
- Two tiers, two models. Autocomplete (tiny model, aggressive prefix cache, spec decode) and chat (bigger model, normal serving) are different SLOs → different replicas, possibly different hardware. Trying to serve both from one pool means the chat prefills stall the autocomplete TTFT (lesson 04's contention) — so disaggregate by tier.
Interactive · the cache-hit cliff
Slide the prefix-cache hit rate. Watch p95 TTFT fall off the 200 ms cliff the instant hits become common — and watch how a bigger model just shifts the cliff, it doesn't remove the need to be on the right side of it.
What this case teaches
- The workload picks the wall. Same transformer as the capstone; because outputs are tiny and inputs are near-static, the binding wall is TTFT-via-redundant-prefill, not concurrency-via-output-length.
- Prefix caching is the load-bearing mechanism — and it converts a compute problem into a routing + KV-residency problem (lessons 05–06).
- Tier by SLO. Autocomplete and chat are different systems; co-locating them recreates lesson 04's prefill/decode contention at the product level.
- Smaller is faster here — less prefill compute and more warm KV. The instinct to "use the best model" is wrong when TTFT binds.