Case study — a code assistant (Copilot-style)

The capstone designed a system where output length dominated. Here the opposite wall binds: outputs are tiny, but the user is typing, so TTFT is everything — and the input is a 4–8K-token file that barely changes between keystrokes. This is the first of six case studies; each takes the same loop to a workload whose binding constraint is different, so the resulting system looks different even though the model is ordinary.

The brief

Inline autocomplete in an IDE. Context: ~6,000 tokens (the open file + a few retrieved neighbours). Output: 10–40 tokens (a line or two). Trigger: on every typing pause — bursty, up to 2,000 req/s across users. SLO: p95 TTFT < 200 ms (slower than that and the suggestion arrives after the human has already typed the line). Plus a chat tier with normal SLOs. Budget: it has to be cheap enough to give away.

Run the loop. Before reading on: what's the binding constraint — memory, bandwidth, compute, or latency? And what single mechanism saves this design?

Stage 0 · Requirements → the number that matters (lesson 03)

Output is ~25 tokens, so by Little's Law the time-in-system is dominated by TTFT, not decode: W ≈ TTFT + 25·TPOT ≈ 0.2 + 25·0.01 ≈ 0.45 s. Concurrency L = λW = 2000·0.45 = 900 — modest. This is not a throughput or memory problem; it is a TTFT problem. And TTFT is prefill latency (lesson 04), so the whole design is about making prefill of a 6K-token context disappear.

Stage 1 · The arithmetic says you're already over budget (lesson 02, 04)

Prefill is compute-bound (lesson 01). For a 7B model at ~40% MFU on one H100: prefill throughput = 990e12·0.4 / (2·7e9) ≈ 28,000 tokens/s. A cold 6,000-token prefill therefore costs 6000 / 28000 ≈ 214 ms — you've blown the 200 ms TTFT budget before decoding a single token. A bigger model makes it strictly worse (13B → ~430 ms). The naive design fails on arithmetic alone.

Binding constraint, stage 1

Prefill latency on a near-static 6K context. But look at the workload: between two keystroke-triggered requests the context changed by a handful of tokens. You are re-prefilling 6,000 tokens to handle 5 new ones. That redundancy is the whole opportunity — exactly the SGLang premise (lesson 03's prefix-share parameter, lesson 06's prefix caching).

Stage 2 · Prefix caching turns the problem inside out (lesson 06)

With a prefix cache / RadixAttention (lesson 06, SGLang 04), a request reuses the KV of the longest matching prefix already in cache. On a cache hit, you prefill only the new tokens:

TTFT_warm ≈ prefill(new tokens) + 1 decode step ≈ 5/28000 s + ~3 ms ≈ 3–4 ms

That is a ~50× collapse in TTFT, and it moves the binding constraint from "can I afford the prefill?" to "can I hit the cache?" — which is now a routing and capacity question, not a compute one:

Cache-aware routing (lesson 05). A user's requests must land on the replica holding their file's KV. Hash-by-session or by-file-prefix routing, not round-robin — round-robin would give a cold cache on most requests and resurrect the 214 ms problem. Link SGLang 05.
KV residency. The file's KV must survive between keystrokes (seconds) without eviction. At 6K tokens × ~128 KB/token (7B) ≈ 0.77 GB per active file — cheap, so you can keep thousands of editing sessions warm. This is why a small model wins twice: less prefill compute and more KV room for warm sessions.

Stage 3 · The remaining latency, squeezed (lesson 06)

Even warm, you want margin under 200 ms. Two more levers, each justified by the binding wall:

Small model + speculative decoding. Autocomplete is small-batch and latency-critical — exactly where speculative decoding pays (lesson 06: spare compute, bandwidth-bound decode). A 1–3B draft against a 7B target, or a dedicated tiny autocomplete model, keeps decode steps few. Don't use the chat model here.
Two tiers, two models. Autocomplete (tiny model, aggressive prefix cache, spec decode) and chat (bigger model, normal serving) are different SLOs → different replicas, possibly different hardware. Trying to serve both from one pool means the chat prefills stall the autocomplete TTFT (lesson 04's contention) — so disaggregate by tier.

Binding constraint, final

Cache hit rate. Once prefix caching is in, every percentage point of hit rate is TTFT and cost. The system's job becomes maximizing residency and routing accuracy — a memory-and-scheduling problem wearing a latency problem's clothes. Notice we never bought faster compute; the wall was redundant prefill, so the fix was to delete it.

Interactive · the cache-hit cliff

Slide the prefix-cache hit rate. Watch p95 TTFT fall off the 200 ms cliff the instant hits become common — and watch how a bigger model just shifts the cliff, it doesn't remove the need to be on the right side of it.

What this case teaches

The workload picks the wall. Same transformer as the capstone; because outputs are tiny and inputs are near-static, the binding wall is TTFT-via-redundant-prefill, not concurrency-via-output-length.
Prefix caching is the load-bearing mechanism — and it converts a compute problem into a routing + KV-residency problem (lessons 05–06).
Tier by SLO. Autocomplete and chat are different systems; co-locating them recreates lesson 04's prefill/decode contention at the product level.
Smaller is faster here — less prefill compute and more warm KV. The instinct to "use the best model" is wrong when TTFT binds.