Case study — a consumer chatbot at scale (ChatGPT-style)
Lesson 13's autocomplete had tiny outputs and near-static inputs, so the wall was TTFT and the fix was deleting redundant prefill. Flip every one of those properties. Here outputs are long, multi-turn history grows every turn, and traffic swings 10× between day and night at tens of thousands of req/s. The transformer is the same; the binding wall is now KV-memory × concurrency × cost, and the system that falls out is unrecognizable next to the code assistant.
Run the loop. Before reading on: what binds — compute, bandwidth, memory, or cost? And what does multi-turn history do to the answer that a single-shot chat workload (lesson 03) wouldn't?
Stage 0 · Requirements → the number that matters (lesson 03)
One turn emits ~300 tokens, so by Little's Law the per-turn time-in-system is decode-dominated, not TTFT-dominated:
Concurrency at peak: L = λW = 50,000 · 19 ≈ 950,000 turns in flight at once. Compare lesson 13's 900. This is a near-million-way concurrency problem, and every one of those in-flight turns is holding a KV cache. Output length × request rate, not the model, just produced the headline number — and unlike the capstone's reasoning model, the length isn't even pathological; it's plain chat at consumer scale.
Stage 1 · Multi-turn history is the crux (lessons 02, 06)
The thing the brief adds over a single-shot chat is that a session is a sequence of turns sharing a growing history. By turn 6 the model attends over ~3–4K tokens of conversation. For a 70B model at 320 KB/token (lesson 02), a 4K-token history is:
Now you face a fork between turns of one session, and both branches look unaffordable:
- (a) Keep the history KV resident between a user's turns. Memory cost: 1.3 GB × hundreds of thousands of live sessions = hundreds of terabytes of HBM. An H100 has 80 GB. This is physically impossible at the budget.
- (b) Re-prefill the whole history at the start of every turn. Compute cost: turn 6 re-prefills ~3.5K tokens for 5 turns running — and it blows TTFT. Prefill of 4K tokens on a 70B at ~40% MFU on a TP-4 replica is 4000 / (4·990e12·0.4/(2·70e9)) ≈ 4000/11,300 ≈ 350 ms, paid every turn and growing with the conversation.
Prefix caching (lesson 06, SGLang 04) resolves the fork: the conversation history is a shared prefix across the turns of a session. Cache the history KV; on the next turn, prefill only the new user message (tens of tokens), reusing the cached history. You pay the memory of (a) only while the session is hot, and you pay the compute of (b) only for the genuinely new tokens.
Stage 2 · Fit more sessions per GPU (lessons 02, 04, 06)
With caching deciding which sessions occupy memory, the next lever is making each session cheaper in bytes so more fit per GPU. Every byte saved per token multiplies across the resident-session count:
- GQA + fp8 KV cache. GQA already set n_kv_heads low (the 320 KB/token figure assumes GQA-8). Storing K/V at fp8 instead of fp16 halves bytes/token again → 160 KB/token → ~double the session density for the same HBM. Validate quality on the regression suite (lesson 10) before trusting fp8 KV.
- Tier by model. The free tier doesn't need the 70B. A smaller model (say 7–13B) has far smaller KV/token (KV scales ~linearly with layers·heads, so a 7B is roughly an order of magnitude cheaper per token) and smaller weights, freeing HBM for more sessions. Quality/cost trade: cheap model for free users, big model for paid. Route tiers to different replica pools (lesson 05) — never co-mingle, or the expensive model's KV starves the cheap one's session count.
- Continuous batching (lesson 04). Turns are wildly length-skewed (a one-line reply vs a 300-token essay). Continuous batching backfills finished slots immediately so decode stays full across the skew, instead of a static batch idling on its longest member.
The compounding is the point: fp8 (×2) on top of a smaller free-tier model (×~5 in KV) is an order-of-magnitude swing in sessions/GPU, and sessions/GPU is the denominator of the entire bill.
| Lever | Effect on KV bytes/token | Effect on sessions/GPU |
|---|---|---|
| 70B, fp16 KV | 320 KB | baseline |
| 70B, fp8 KV | 160 KB | ~2× |
| 7B free tier, fp8 KV | ~16 KB | ~20× (and weights drop 140→14 GB) |
Stage 3 · The 10× diurnal swing (lessons 05, 11)
Peak is 50k req/s; the trough is ~5k. Provision for peak 24/7 and you pay for capacity that sits idle most of the day:
Autoscaling (lesson 05) tracks the swing — but on the right signal. GPU utilization is the trap metric (lesson 11): a memory-bound decode fleet runs near-full on KV while showing modest SM utilization, so scaling on GPU-util under-provisions right as you breach SLO. Scale on queue depth and KV-utilization instead — the signals that actually predict an SLO breach.
The catch is cold-start. Bringing a 70B replica up means loading 140 GB of weights (35 GB/GPU across TP-4) from storage — minutes, not seconds. If you scale reactively you're always minutes behind a rising ramp. So keep a warm pool sized to cover the cold-start window of the steepest expected ramp, and pre-warm ahead of the predictable morning rise rather than chasing it.
Stage 4 · Multi-tenancy & QoS (lesson 11)
Free and paid share a fleet's economics but not its priority. Under a spike that outruns even the warm pool, something must give, and graceful degradation (lesson 11) decides what:
- Priority by tier. Paid requests jump the queue; free requests wait. KV admission favors paid sessions when memory is tight.
- Admission control / load shedding. When queue depth crosses a threshold, shed or queue free tier first — degrade gracefully (slower, queued, or "try again") rather than letting the whole fleet's TTFT collapse for everyone. Never shed paid before free.
- Per-tenant KV budgets. Cap resident history per free session (e.g. evict aggressively, summarize old turns) so one tier can't starve the other's session count.
Interactive · sessions per GPU & the monthly bill
The whole case in four KPIs. Shrink the KV (fp8, smaller model) and watch sessions/GPU rise and the fleet shrink; flip autoscale and watch the bill track the swing instead of peak. The note tells you which wall is binding.
What this case teaches
- The workload picks the wall — again. Lesson 13 was TTFT-bound because outputs were tiny and inputs static. Here outputs are long and history grows, so the wall is KV-memory × concurrency × cost. Same transformer, opposite system.
- Multi-turn history is the new variable. A growing per-session KV turns "keep it resident vs re-prefill" into the central trade; prefix caching plus an eviction policy resolves it, spending the generous TTFT budget to save memory.
- Sessions/GPU is the denominator of the bill. fp8 KV, GQA, and a smaller free-tier model compound into an order-of-magnitude swing in density — and density, not FLOPs, sets cost.
- The diurnal swing makes cost a design parameter. Autoscale on KV-util/queue depth (not GPU-util, lesson 11) with a warm pool for the 140 GB cold-start, and you pay for average-plus-spike instead of peak 24/7.
- Multi-tenancy is a graceful-degradation problem. Shed free before paid; per-tenant KV budgets keep one tier from starving the other.