Production — reliability, observability, and cost

Lessons 04–10 built a system that works on a good day: it fits in HBM, hits its SLOs, scales out, and passes the eval gate. Production is the other days — the bad one, at scale, on a budget. Three pillars carry a system across them. Observability lets you see the bad day coming; reliability lets you survive it; cost lets you afford the good day — and it is the lever that decides whether success kills you. This lesson quantifies all three in dollars, availability percentages, and minutes.

The good-day system is the easy 80%

Every earlier lesson optimized the steady state. But a fleet of 1,000 GPUs (lesson 07) has hardware failing daily, a 10× diurnal swing (lesson 03) that idles most of it at 4 a.m., and a deploy pipeline that can take down every replica at once. None of that shows up in a benchmark. Production engineering is the discipline of designing for the moments the benchmark never tests.

1 · Observability — see the bad day coming

You cannot fix what you cannot see, and most dashboards measure the wrong thing. The distinction that matters is leading vs lagging: a lagging metric (error rate, dropped requests) tells you the incident already happened; a leading metric tells you it is about to. For LLM serving the leading indicators all come from constraints you already met in earlier lessons.

Metric	What it predicts	Alert threshold
Queue depth (requests waiting)	under-provisioning — Little's Law says the queue grows without bound once λ·W exceeds capacity (lesson 03); this is the autoscale signal (lesson 05)	rising trend, or > ~1× in-flight count
KV-cache utilization	imminent preemption / OOM — the KV cache is the batch-size ceiling (lesson 04); near 100% means the next request gets swapped or rejected	p95 > ~85%
TTFT / TPOT p99	SLO breach in progress (lesson 03); the tail moves before the median does	p99 > SLO budget
Preemption / swap rate	memory pressure already biting — requests are being evicted and re-run, inflating cost and latency	> 0 sustained
Empty-output / token error rate	a broken model or decode path — often the first sign of a bad deploy	> baseline + noise
Quality drift (online eval score)	silent regression — the model is up and fast but getting worse (feeds lesson 10's eval)	score below gate

GPU utilization is a trap metric

The instinct is to watch GPU utilization, and it is almost useless for serving. A memory-bound, fully-batched decode GPU (lesson 01: decode is bandwidth-bound) can show modest SM utilization while completely full — every KV block taken, queue building. Read "45% utilized" and you'll conclude there's headroom when you are saturated — the exact autoscaling failure from lesson 05. Utilization measures whether the SMs are busy; KV-cache utilization and queue depth measure whether you have room for the next user. Alert on the latter.

The discipline: a metric earns a place on the on-call dashboard only if it changes before users feel pain. Throughput, total request count, and GPU utilization are vanity or lagging — useful for capacity reviews, not for catching an incident. Queue depth and KV-util are the smoke alarm.

2 · Deploy & rollback for stateful GPU services

A stateless web server deploys in seconds: pull the new image, restart, done. A GPU serving replica cannot. The weights are huge — 140 GB for a 70B (lesson 02) — and loading them into HBM from storage at a few GB/s takes minutes, the same cold-start lag that made reactive autoscaling impossible in lesson 05. You cannot hot-swap a model. That single fact shapes every deploy strategy.

Strategy	How	Cost / risk
Blue/green	stand up the full new fleet (green), shift traffic over, keep the old fleet (blue) warm for instant rollback	2× capacity during the cutover; fastest rollback (flip traffic back)
Rolling	replace replicas a few at a time	cheap (no double fleet) but capacity dips during the roll — size for it or you breach SLO mid-deploy
Canary	route ~1% of traffic to the new version, watch guardrails, then ramp 1% → 10% → 100%	limits blast radius of a bad model to 1% of users

Warm up before taking traffic. The first requests to a fresh replica pay for CUDA graph capture and kernel compilation (lesson 06) — TTFT on those is wildly inflated. A replica must be sent synthetic warmup requests until its caches are built, then added to the router. Skip this and every deploy spikes p99 TTFT for real users.

Rollback speed is a reliability number, and the gate decides it

A bad model deploy is the most common serious incident — and lesson 10's online guardrails are what catch it (quality drift, empty outputs, a failing eval score). Tie the deploy pipeline to that regression gate: the gate's verdict triggers an automatic rollback. With blue/green the rollback is a traffic flip — seconds. With rolling you must reload the old weights — minutes. If a bad model serves users for 8 minutes instead of 8 seconds, that is two more nines of degraded experience. Deploy fast is optional; roll back fast is not.

3 · Reliability math — redundancy and blast radius

At 1,000-GPU scale, hardware fails routinely (lesson 07): a GPU with a 0.1%/day failure rate means a 1,000-GPU fleet loses roughly one a day. You do not design to prevent failure; you design to absorb it. The tool is N+k redundancy — provision k spare replicas beyond the N you need so the service rides through up to k simultaneous failures with no capacity loss.

replicas = peak_need + k_spare + deploy_headroom

The three terms each defend against a different threat: peak_need covers the diurnal peak (lesson 03), k_spare covers random hardware death, and deploy_headroom covers the capacity a rolling deploy removes while in flight. Drop any term and the bad day finds the gap.

Failure domains — a TP group dies as a unit

Redundancy math assumes replicas fail independently, but they don't always. A replica that spans 8 GPUs as a TP=8 group (lesson 05) fails as a single unit: lose one GPU and the all-reduce can't complete, so the whole replica is down — you lose 8 GPUs' worth of capacity per failure, not one. Worse are correlated failures: a bad rack takes out every replica on it; a bad model deploy hits all replicas at once. The deploy is the single largest correlated-failure source in the system — which is exactly why canary and fast rollback (pillar 2) exist. Redundancy protects against independent failures; it does nothing against a bad deploy pushed everywhere.

Availability composes the usual way. If one replica is up with probability p, an N+k fleet stays at full capacity as long as no more than k are down — pushing effective availability from, say, 99% per replica toward the three-to-four nines a product promises. But that math only holds for independent failures; the correlated ones set the real floor, and no amount of k buys it back.

Graceful degradation — degrade the product, don't drop it

When demand exceeds even the redundant capacity, the choice is not "serve perfectly or fail." Degrade the product instead of dropping the user:

Shed load by priority — reject or queue low-priority traffic (free-tier, batch) so paying realtime users keep their SLO. Goodput (lesson 03) for the requests that matter stays intact.
Fall back to a smaller / cheaper model — a quantized 8B (lesson 06) serves a degraded-but-useful answer when the 70B fleet is saturated.
Shorten max output — cap output length to cut W (lesson 03) and free KV, trading verbosity for availability.

Each is a deliberate, pre-designed lever, not a panic move. A system with no degradation path has exactly one failure mode: drop everything.

4 · Cost governance — the survival lever

Cost is not an afterthought to bolt on once traffic grows; it is a design constraint that ranks alongside the latency SLOs. Track $/1M tokens (lesson 02's serving formula) as a first-class SLO with its own dashboard and alert.

$/1M tok = ($/gpu-hr) / (tok_per_s_per_gpu · 3600) · 1e6

Lever	Mechanism	When
Right-size to load	autoscale toward actual demand instead of provisioning the 10× peak 24/7 (lesson 05)	always — over-provisioning for a diurnal peak is the #1 waste
Tiered serving	realtime premium tier vs batch/async cheap tier on the same fleet	workload has latency-insensitive traffic
Spot / preemptible GPUs	30–70% cheaper, but can be reclaimed any time	fault-tolerant training (lesson 07 checkpointing) and batch inference only
Quantization / smaller models	int8/int4 weights, distilled models (lesson 06)	the cheap tier, where a small quality drop is acceptable
Reserved vs on-demand	commit to a baseline at a discount, burst on-demand	load has a stable floor plus spiky top

The #1 cost waste: paying for peak 24 hours a day

A service with a 10× diurnal swing that provisions for peak around the clock idles ~90% of that fleet most of the day — you are renting 10 GPUs to use 1. Autoscaling toward the demand curve reclaims most of that idle bill, but the catch is lesson 05's cold-start lag: you cannot scale up reactively when the reaction takes minutes, so you keep a warm pool or lead the curve. The trade is explicit money (idle GPUs) vs explicit risk (autoscale lag breaching SLO during a spike). The widget below puts a dollar figure on both sides.

Success that kills you

A product that 10×'s its traffic without a cost model dies of its own success: the bill 10×'s, the unit economics were never checked, and the company is now paying more per token than it charges. Cost-per-token must be a design constraint from day one — the same way the latency SLO is. Scaling up is a celebration only if $/1M tok was under the revenue-per-token line before you scaled.

Interactive · provisioning, redundancy, and the idle bill

Set your average load, the diurnal swing, your redundancy and deploy headroom, and the GPU price. Flip autoscaling on and off. Watch the monthly bill — and the slice of it you are spending on idle GPUs — move. The note flips between "you're paying for peak 24/7" and "autoscaling saves $X, but watch the cold-start lag from lesson 05."

Provisioning: peak, redundancy, and the idle bill

Assumptions: 730 hr/month. Autoscale ON ≈ you run near peak only during the ~⅓ of the day that is peak and near average the rest, plus k spares always on; OFF ≈ you run the full peak fleet 24/7. Idle $ = GPU-hours provisioned above the average need × $/replica-hr. Order-of-magnitude planning, not a billing forecast.

avg concurrency need (replicas) 100 peak / avg ratio 4 N+k spare replicas 3 deploy headroom % 20 $ / replica-hr 12 autoscale (0 off / 1 on) off

provisioned replicas

–

$/month

–

idle $/month

–

availability note

–

What carries forward

Observe leading indicators, not vanity ones: queue depth, KV-cache utilization, TTFT/TPOT p99, preemption rate, empty-output rate, and quality drift predict incidents. GPU utilization is a trap — a full decode GPU looks idle.
Stateful GPU services deploy slowly (minutes to load weights), so plan for rollback: blue/green for instant traffic-flip rollback, canary to cap blast radius, warmup before traffic. Tie deploy to lesson 10's regression gate.
Redundancy is replicas = peak + k + deploy_headroom, but TP groups fail as a unit and a bad deploy is a correlated failure that redundancy can't cover. Keep a graceful-degradation path: shed by priority, fall back smaller, shorten output.
Cost is a first-class SLO. Track $/1M tokens; right-size to load (idle peak is the #1 waste), tier serving, use spot for fault-tolerant work, quantize the cheap tier. A 10× in traffic without a cost model is a way to die of success.