Case study — a consumer chatbot at scale (ChatGPT-style)

Lesson 13's autocomplete had tiny outputs and near-static inputs, so the wall was TTFT and the fix was deleting redundant prefill. Flip every one of those properties. Here outputs are long, multi-turn history grows every turn, and traffic swings 10× between day and night at tens of thousands of req/s. The transformer is the same; the binding wall is now KV-memory × concurrency × cost, and the system that falls out is unrecognizable next to the code assistant.

The brief

A consumer chat assistant. Sessions: multi-turn, average ~6 turns. Output: ~300 tokens/turn. Context grows each turn — by turn 6 the history is ~3–4K tokens. Peak: ~50,000 req/s with a 10× day/night swing. SLO: p95 TTFT < 1 s, TPOT < 60 ms. Tiers: a free tier and a paid tier. Budget: it has to be cheap — most users pay nothing.

Run the loop. Before reading on: what binds — compute, bandwidth, memory, or cost? And what does multi-turn history do to the answer that a single-shot chat workload (lesson 03) wouldn't?

Stage 0 · Requirements → the number that matters (lesson 03)

One turn emits ~300 tokens, so by Little's Law the per-turn time-in-system is decode-dominated, not TTFT-dominated:

W ≈ TTFT + 300·TPOT ≈ 1 + 300·0.06 ≈ 19 s per turn

Concurrency at peak: L = λW = 50,000 · 19 ≈ 950,000 turns in flight at once. Compare lesson 13's 900. This is a near-million-way concurrency problem, and every one of those in-flight turns is holding a KV cache. Output length × request rate, not the model, just produced the headline number — and unlike the capstone's reasoning model, the length isn't even pathological; it's plain chat at consumer scale.

Binding constraint, stage 0

Concurrency, and therefore KV memory. 950k concurrent turns each carrying a KV cache is a bytes problem before it is anything else. TTFT has a whole second of budget here — generous — so we will spend that budget to save memory, the exact opposite trade from lesson 13.

Stage 1 · Multi-turn history is the crux (lessons 02, 06)

The thing the brief adds over a single-shot chat is that a session is a sequence of turns sharing a growing history. By turn 6 the model attends over ~3–4K tokens of conversation. For a 70B model at 320 KB/token (lesson 02), a 4K-token history is:

4000 · 320 KB ≈ 1.3 GB of KV — per active session

Now you face a fork between turns of one session, and both branches look unaffordable:

(a) Keep the history KV resident between a user's turns. Memory cost: 1.3 GB × hundreds of thousands of live sessions = hundreds of terabytes of HBM. An H100 has 80 GB. This is physically impossible at the budget.
(b) Re-prefill the whole history at the start of every turn. Compute cost: turn 6 re-prefills ~3.5K tokens for 5 turns running — and it blows TTFT. Prefill of 4K tokens on a 70B at ~40% MFU on a TP-4 replica is 4000 / (4·990e12·0.4/(2·70e9)) ≈ 4000/11,300 ≈ 350 ms, paid every turn and growing with the conversation.

Prefix caching (lesson 06, SGLang 04) resolves the fork: the conversation history is a shared prefix across the turns of a session. Cache the history KV; on the next turn, prefill only the new user message (tens of tokens), reusing the cached history. You pay the memory of (a) only while the session is hot, and you pay the compute of (b) only for the genuinely new tokens.

Binding constraint, stage 1

KV memory vs prefill-compute, on the growing history. Keeping every session resident is impossible; re-prefilling every turn is slow and wasteful. Prefix caching collapses the trade — but it cannot keep all sessions hot, so you need an eviction policy: evict idle sessions' KV, and re-prefill on their next turn. A resumed session eats one cold prefill (~350 ms, still inside the 1 s TTFT budget). That deliberate TTFT hit on resume is what buys back the memory. You are spending stage 0's latency slack exactly here.

Stage 2 · Fit more sessions per GPU (lessons 02, 04, 06)

With caching deciding which sessions occupy memory, the next lever is making each session cheaper in bytes so more fit per GPU. Every byte saved per token multiplies across the resident-session count:

GQA + fp8 KV cache. GQA already set n_kv_heads low (the 320 KB/token figure assumes GQA-8). Storing K/V at fp8 instead of fp16 halves bytes/token again → 160 KB/token → ~double the session density for the same HBM. Validate quality on the regression suite (lesson 10) before trusting fp8 KV.
Tier by model. The free tier doesn't need the 70B. A smaller model (say 7–13B) has far smaller KV/token (KV scales ~linearly with layers·heads, so a 7B is roughly an order of magnitude cheaper per token) and smaller weights, freeing HBM for more sessions. Quality/cost trade: cheap model for free users, big model for paid. Route tiers to different replica pools (lesson 05) — never co-mingle, or the expensive model's KV starves the cheap one's session count.
Continuous batching (lesson 04). Turns are wildly length-skewed (a one-line reply vs a 300-token essay). Continuous batching backfills finished slots immediately so decode stays full across the skew, instead of a static batch idling on its longest member.

The compounding is the point: fp8 (×2) on top of a smaller free-tier model (×~5 in KV) is an order-of-magnitude swing in sessions/GPU, and sessions/GPU is the denominator of the entire bill.

Lever	Effect on KV bytes/token	Effect on sessions/GPU
70B, fp16 KV	320 KB	baseline
70B, fp8 KV	160 KB	~2×
7B free tier, fp8 KV	~16 KB	~20× (and weights drop 140→14 GB)

Stage 3 · The 10× diurnal swing (lessons 05, 11)

Peak is 50k req/s; the trough is ~5k. Provision for peak 24/7 and you pay for capacity that sits idle most of the day:

wasted fraction ≈ (peak − avg)/peak; for a 10× swing, off-peak is ~90% idle capacity

Autoscaling (lesson 05) tracks the swing — but on the right signal. GPU utilization is the trap metric (lesson 11): a memory-bound decode fleet runs near-full on KV while showing modest SM utilization, so scaling on GPU-util under-provisions right as you breach SLO. Scale on queue depth and KV-utilization instead — the signals that actually predict an SLO breach.

The catch is cold-start. Bringing a 70B replica up means loading 140 GB of weights (35 GB/GPU across TP-4) from storage — minutes, not seconds. If you scale reactively you're always minutes behind a rising ramp. So keep a warm pool sized to cover the cold-start window of the steepest expected ramp, and pre-warm ahead of the predictable morning rise rather than chasing it.

Binding constraint, stage 3

Cost, set by how tightly you track the swing. Once KV is shrunk (stage 2), the bill is no longer "can we fit?" but "how much of peak do we pay for around the clock?" Tight tracking on KV-util + a warm pool gets you toward paying for ~average + spike headroom instead of peak 24/7 — close to a 2–3× cost reduction off the naive peak-provisioned design. The wall moved from bytes to dollars.

Stage 4 · Multi-tenancy & QoS (lesson 11)

Free and paid share a fleet's economics but not its priority. Under a spike that outruns even the warm pool, something must give, and graceful degradation (lesson 11) decides what:

Priority by tier. Paid requests jump the queue; free requests wait. KV admission favors paid sessions when memory is tight.
Admission control / load shedding. When queue depth crosses a threshold, shed or queue free tier first — degrade gracefully (slower, queued, or "try again") rather than letting the whole fleet's TTFT collapse for everyone. Never shed paid before free.
Per-tenant KV budgets. Cap resident history per free session (e.g. evict aggressively, summarize old turns) so one tier can't starve the other's session count.

Interactive · sessions per GPU & the monthly bill

The whole case in four KPIs. Shrink the KV (fp8, smaller model) and watch sessions/GPU rise and the fleet shrink; flip autoscale and watch the bill track the swing instead of peak. The note tells you which wall is binding.

Sessions per GPU & the monthly bill

KV/token anchored at 320 KB/tok for 70B (GQA-8, fp16), scaled linearly in N; ×0.5 for fp8. Weights (2N GB) are spread across the GPUs they need (TP, ~60 GB usable each), leaving usable KV/GPU ≈ 80 − 2N/TP − 4 after the per-GPU weight share and overhead. $3/GPU-hr · 730 hr/mo. Autoscale-on bills ~40% of a peak-provisioned-24/7 fleet (avg + spike headroom across the 10× swing); off bills the full peak. Order-of-magnitude per the series' ±30% contract.

model params N (B) 70 KV dtype (1=fp8, 2=fp16) 2 avg history tokens 3500 concurrent sessions 500000 autoscale (0/1) 1

KV / session

–

sessions / GPU

–

GPUs needed

–

$ / month

–

What this case teaches

The workload picks the wall — again. Lesson 13 was TTFT-bound because outputs were tiny and inputs static. Here outputs are long and history grows, so the wall is KV-memory × concurrency × cost. Same transformer, opposite system.
Multi-turn history is the new variable. A growing per-session KV turns "keep it resident vs re-prefill" into the central trade; prefix caching plus an eviction policy resolves it, spending the generous TTFT budget to save memory.
Sessions/GPU is the denominator of the bill. fp8 KV, GQA, and a smaller free-tier model compound into an order-of-magnitude swing in density — and density, not FLOPs, sets cost.
The diurnal swing makes cost a design parameter. Autoscale on KV-util/queue depth (not GPU-util, lesson 11) with a warm pool for the 140 GB cold-start, and you pay for average-plus-spike instead of peak 24/7.
Multi-tenancy is a graceful-degradation problem. Shed free before paid; per-tenant KV budgets keep one tier from starving the other.