Production — reliability, observability, and cost
Lessons 04–10 built a system that works on a good day: it fits in HBM, hits its SLOs, scales out, and passes the eval gate. Production is the other days — the bad one, at scale, on a budget. Three pillars carry a system across them. Observability lets you see the bad day coming; reliability lets you survive it; cost lets you afford the good day — and it is the lever that decides whether success kills you. This lesson quantifies all three in dollars, availability percentages, and minutes.
1 · Observability — see the bad day coming
You cannot fix what you cannot see, and most dashboards measure the wrong thing. The distinction that matters is leading vs lagging: a lagging metric (error rate, dropped requests) tells you the incident already happened; a leading metric tells you it is about to. For LLM serving the leading indicators all come from constraints you already met in earlier lessons.
| Metric | What it predicts | Alert threshold |
|---|---|---|
| Queue depth (requests waiting) | under-provisioning — Little's Law says the queue grows without bound once λ·W exceeds capacity (lesson 03); this is the autoscale signal (lesson 05) | rising trend, or > ~1× in-flight count |
| KV-cache utilization | imminent preemption / OOM — the KV cache is the batch-size ceiling (lesson 04); near 100% means the next request gets swapped or rejected | p95 > ~85% |
| TTFT / TPOT p99 | SLO breach in progress (lesson 03); the tail moves before the median does | p99 > SLO budget |
| Preemption / swap rate | memory pressure already biting — requests are being evicted and re-run, inflating cost and latency | > 0 sustained |
| Empty-output / token error rate | a broken model or decode path — often the first sign of a bad deploy | > baseline + noise |
| Quality drift (online eval score) | silent regression — the model is up and fast but getting worse (feeds lesson 10's eval) | score below gate |
The discipline: a metric earns a place on the on-call dashboard only if it changes before users feel pain. Throughput, total request count, and GPU utilization are vanity or lagging — useful for capacity reviews, not for catching an incident. Queue depth and KV-util are the smoke alarm.
2 · Deploy & rollback for stateful GPU services
A stateless web server deploys in seconds: pull the new image, restart, done. A GPU serving replica cannot. The weights are huge — 140 GB for a 70B (lesson 02) — and loading them into HBM from storage at a few GB/s takes minutes, the same cold-start lag that made reactive autoscaling impossible in lesson 05. You cannot hot-swap a model. That single fact shapes every deploy strategy.
| Strategy | How | Cost / risk |
|---|---|---|
| Blue/green | stand up the full new fleet (green), shift traffic over, keep the old fleet (blue) warm for instant rollback | 2× capacity during the cutover; fastest rollback (flip traffic back) |
| Rolling | replace replicas a few at a time | cheap (no double fleet) but capacity dips during the roll — size for it or you breach SLO mid-deploy |
| Canary | route ~1% of traffic to the new version, watch guardrails, then ramp 1% → 10% → 100% | limits blast radius of a bad model to 1% of users |
Warm up before taking traffic. The first requests to a fresh replica pay for CUDA graph capture and kernel compilation (lesson 06) — TTFT on those is wildly inflated. A replica must be sent synthetic warmup requests until its caches are built, then added to the router. Skip this and every deploy spikes p99 TTFT for real users.
3 · Reliability math — redundancy and blast radius
At 1,000-GPU scale, hardware fails routinely (lesson 07): a GPU with a 0.1%/day failure rate means a 1,000-GPU fleet loses roughly one a day. You do not design to prevent failure; you design to absorb it. The tool is N+k redundancy — provision k spare replicas beyond the N you need so the service rides through up to k simultaneous failures with no capacity loss.
The three terms each defend against a different threat: peak_need covers the diurnal peak (lesson 03), k_spare covers random hardware death, and deploy_headroom covers the capacity a rolling deploy removes while in flight. Drop any term and the bad day finds the gap.
Availability composes the usual way. If one replica is up with probability p, an N+k fleet stays at full capacity as long as no more than k are down — pushing effective availability from, say, 99% per replica toward the three-to-four nines a product promises. But that math only holds for independent failures; the correlated ones set the real floor, and no amount of k buys it back.
Graceful degradation — degrade the product, don't drop it
When demand exceeds even the redundant capacity, the choice is not "serve perfectly or fail." Degrade the product instead of dropping the user:
- Shed load by priority — reject or queue low-priority traffic (free-tier, batch) so paying realtime users keep their SLO. Goodput (lesson 03) for the requests that matter stays intact.
- Fall back to a smaller / cheaper model — a quantized 8B (lesson 06) serves a degraded-but-useful answer when the 70B fleet is saturated.
- Shorten max output — cap output length to cut W (lesson 03) and free KV, trading verbosity for availability.
Each is a deliberate, pre-designed lever, not a panic move. A system with no degradation path has exactly one failure mode: drop everything.
4 · Cost governance — the survival lever
Cost is not an afterthought to bolt on once traffic grows; it is a design constraint that ranks alongside the latency SLOs. Track $/1M tokens (lesson 02's serving formula) as a first-class SLO with its own dashboard and alert.
| Lever | Mechanism | When |
|---|---|---|
| Right-size to load | autoscale toward actual demand instead of provisioning the 10× peak 24/7 (lesson 05) | always — over-provisioning for a diurnal peak is the #1 waste |
| Tiered serving | realtime premium tier vs batch/async cheap tier on the same fleet | workload has latency-insensitive traffic |
| Spot / preemptible GPUs | 30–70% cheaper, but can be reclaimed any time | fault-tolerant training (lesson 07 checkpointing) and batch inference only |
| Quantization / smaller models | int8/int4 weights, distilled models (lesson 06) | the cheap tier, where a small quality drop is acceptable |
| Reserved vs on-demand | commit to a baseline at a discount, burst on-demand | load has a stable floor plus spiky top |
Interactive · provisioning, redundancy, and the idle bill
Set your average load, the diurnal swing, your redundancy and deploy headroom, and the GPU price. Flip autoscaling on and off. Watch the monthly bill — and the slice of it you are spending on idle GPUs — move. The note flips between "you're paying for peak 24/7" and "autoscaling saves $X, but watch the cold-start lag from lesson 05."
What carries forward
- Observe leading indicators, not vanity ones: queue depth, KV-cache utilization, TTFT/TPOT p99, preemption rate, empty-output rate, and quality drift predict incidents. GPU utilization is a trap — a full decode GPU looks idle.
- Stateful GPU services deploy slowly (minutes to load weights), so plan for rollback: blue/green for instant traffic-flip rollback, canary to cap blast radius, warmup before traffic. Tie deploy to lesson 10's regression gate.
- Redundancy is replicas = peak + k + deploy_headroom, but TP groups fail as a unit and a bad deploy is a correlated failure that redundancy can't cover. Keep a graceful-degradation path: shed by priority, fall back smaller, shorten output.
- Cost is a first-class SLO. Track $/1M tokens; right-size to load (idle peak is the #1 waste), tier serving, use spot for fault-tolerant work, quantize the cheap tier. A 10× in traffic without a cost model is a way to die of success.