all_lessons / ml_system_design / 14 · case · chatbot lesson 14 / 20

Case study — a consumer chatbot at scale (ChatGPT-style)

Lesson 13's autocomplete had tiny outputs and near-static inputs, so the wall was TTFT and the fix was deleting redundant prefill. Flip every one of those properties. Here outputs are long, multi-turn history grows every turn, and traffic swings 10× between day and night at tens of thousands of req/s. The transformer is the same; the binding wall is now KV-memory × concurrency × cost, and the system that falls out is unrecognizable next to the code assistant.

The brief
A consumer chat assistant. Sessions: multi-turn, average ~6 turns. Output: ~300 tokens/turn. Context grows each turn — by turn 6 the history is ~3–4K tokens. Peak: ~50,000 req/s with a 10× day/night swing. SLO: p95 TTFT < 1 s, TPOT < 60 ms. Tiers: a free tier and a paid tier. Budget: it has to be cheap — most users pay nothing.

Run the loop. Before reading on: what binds — compute, bandwidth, memory, or cost? And what does multi-turn history do to the answer that a single-shot chat workload (lesson 03) wouldn't?

Stage 0 · Requirements → the number that matters (lesson 03)

One turn emits ~300 tokens, so by Little's Law the per-turn time-in-system is decode-dominated, not TTFT-dominated:

W ≈ TTFT + 300·TPOT ≈ 1 + 300·0.06 ≈ 19 s per turn

Concurrency at peak: L = λW = 50,000 · 19 ≈ 950,000 turns in flight at once. Compare lesson 13's 900. This is a near-million-way concurrency problem, and every one of those in-flight turns is holding a KV cache. Output length × request rate, not the model, just produced the headline number — and unlike the capstone's reasoning model, the length isn't even pathological; it's plain chat at consumer scale.

Binding constraint, stage 0
Concurrency, and therefore KV memory. 950k concurrent turns each carrying a KV cache is a bytes problem before it is anything else. TTFT has a whole second of budget here — generous — so we will spend that budget to save memory, the exact opposite trade from lesson 13.

Stage 1 · Multi-turn history is the crux (lessons 02, 06)

The thing the brief adds over a single-shot chat is that a session is a sequence of turns sharing a growing history. By turn 6 the model attends over ~3–4K tokens of conversation. For a 70B model at 320 KB/token (lesson 02), a 4K-token history is:

4000 · 320 KB ≈ 1.3 GB of KV — per active session

Now you face a fork between turns of one session, and both branches look unaffordable:

Prefix caching (lesson 06, SGLang 04) resolves the fork: the conversation history is a shared prefix across the turns of a session. Cache the history KV; on the next turn, prefill only the new user message (tens of tokens), reusing the cached history. You pay the memory of (a) only while the session is hot, and you pay the compute of (b) only for the genuinely new tokens.

Binding constraint, stage 1
KV memory vs prefill-compute, on the growing history. Keeping every session resident is impossible; re-prefilling every turn is slow and wasteful. Prefix caching collapses the trade — but it cannot keep all sessions hot, so you need an eviction policy: evict idle sessions' KV, and re-prefill on their next turn. A resumed session eats one cold prefill (~350 ms, still inside the 1 s TTFT budget). That deliberate TTFT hit on resume is what buys back the memory. You are spending stage 0's latency slack exactly here.

Stage 2 · Fit more sessions per GPU (lessons 02, 04, 06)

With caching deciding which sessions occupy memory, the next lever is making each session cheaper in bytes so more fit per GPU. Every byte saved per token multiplies across the resident-session count:

The compounding is the point: fp8 (×2) on top of a smaller free-tier model (×~5 in KV) is an order-of-magnitude swing in sessions/GPU, and sessions/GPU is the denominator of the entire bill.

LeverEffect on KV bytes/tokenEffect on sessions/GPU
70B, fp16 KV320 KBbaseline
70B, fp8 KV160 KB~2×
7B free tier, fp8 KV~16 KB~20× (and weights drop 140→14 GB)

Stage 3 · The 10× diurnal swing (lessons 05, 11)

Peak is 50k req/s; the trough is ~5k. Provision for peak 24/7 and you pay for capacity that sits idle most of the day:

wasted fraction ≈ (peak − avg)/peak; for a 10× swing, off-peak is ~90% idle capacity

Autoscaling (lesson 05) tracks the swing — but on the right signal. GPU utilization is the trap metric (lesson 11): a memory-bound decode fleet runs near-full on KV while showing modest SM utilization, so scaling on GPU-util under-provisions right as you breach SLO. Scale on queue depth and KV-utilization instead — the signals that actually predict an SLO breach.

The catch is cold-start. Bringing a 70B replica up means loading 140 GB of weights (35 GB/GPU across TP-4) from storage — minutes, not seconds. If you scale reactively you're always minutes behind a rising ramp. So keep a warm pool sized to cover the cold-start window of the steepest expected ramp, and pre-warm ahead of the predictable morning rise rather than chasing it.

Binding constraint, stage 3
Cost, set by how tightly you track the swing. Once KV is shrunk (stage 2), the bill is no longer "can we fit?" but "how much of peak do we pay for around the clock?" Tight tracking on KV-util + a warm pool gets you toward paying for ~average + spike headroom instead of peak 24/7 — close to a 2–3× cost reduction off the naive peak-provisioned design. The wall moved from bytes to dollars.

Stage 4 · Multi-tenancy & QoS (lesson 11)

Free and paid share a fleet's economics but not its priority. Under a spike that outruns even the warm pool, something must give, and graceful degradation (lesson 11) decides what:

Interactive · sessions per GPU & the monthly bill

The whole case in four KPIs. Shrink the KV (fp8, smaller model) and watch sessions/GPU rise and the fleet shrink; flip autoscale and watch the bill track the swing instead of peak. The note tells you which wall is binding.

Sessions per GPU & the monthly bill

KV/token anchored at 320 KB/tok for 70B (GQA-8, fp16), scaled linearly in N; ×0.5 for fp8. Weights (2N GB) are spread across the GPUs they need (TP, ~60 GB usable each), leaving usable KV/GPU ≈ 80 − 2N/TP − 4 after the per-GPU weight share and overhead. $3/GPU-hr · 730 hr/mo. Autoscale-on bills ~40% of a peak-provisioned-24/7 fleet (avg + spike headroom across the 10× swing); off bills the full peak. Order-of-magnitude per the series' ±30% contract.

KV / session
sessions / GPU
GPUs needed
$ / month

What this case teaches