Designing evaluation and the feedback flywheel
Every lesson so far made the system faster, cheaper, or bigger. Lesson 04 cut decode latency, 06 deleted redundant prefill, 07 defended MFU, 09 balanced the RL loop. Not one of them checked whether the model got any better. A system you can't measure, you can't improve — and worse, you can't safely deploy. Evaluation is the control system that closes the loop: it tells you which change to keep, what is safe to ship, and feeds the data that trains the next model. This lesson designs eval itself as a system, with its own SLOs.
1 · The three eval tiers — by cost and fidelity
Evaluation is a hierarchy. Each tier is more faithful to "did real users get a better experience?" and proportionally slower and more expensive. You run the cheap ones constantly and the expensive ones rarely, as gates.
| Tier | What it is | Cost / latency | Run cadence | Fidelity |
|---|---|---|---|---|
| (a) Offline static | Held-out sets, public benchmarks, a fixed regression suite | Cents; seconds–minutes | Every commit / PR | Low — proxy, can leak or saturate |
| (b) LLM-as-judge | A strong model scores or ranks outputs vs a reference | $1–$100s; minutes | Per candidate / nightly | Medium — approximates human preference, but biased |
| (c) Human / online A/B | Human raters, or live traffic measured against real metrics | $1k–$100k+; days–weeks | Release gate only | High — the actual question |
(a) Offline static — cheap, reproducible, leaky
A frozen set of inputs with known-good answers (or a scoring function): exact-match on math, pass/fail on unit tests, perplexity on a held-out corpus. These are the CI of ML — fast, deterministic, and the only tier you can afford on every commit. Two failure modes kill them. Saturation: once the model scores 99%, the benchmark can no longer distinguish a better model from a worse one — your ruler has run out of ticks. Contamination: if the benchmark text leaked into pretraining data, the model memorized the answers and the score is fiction. This is exactly why the data plane runs decontamination — substring/n-gram matching the eval sets against the training corpus (data_engineering 06). An undecontaminated benchmark is a thermometer dipped in the boiler.
(b) LLM-as-judge — scalable preference, systematically biased
Human preference is the gold standard but costs ~$0.50–$5 per comparison and takes hours. An LLM judge approximates it at ~$0.001–$0.01 per comparison in seconds, so you can score thousands of outputs nightly. The catch: judges carry structured biases that, unhandled, produce confidently wrong rankings.
| Bias | What the judge does | Mitigation |
|---|---|---|
| position | Favors whichever answer is shown first (A over B) | Randomize order; score both orderings and average |
| verbosity | Rates longer answers higher regardless of quality | Length-control: regress out length, or cap/normalize tokens |
| self-preference | Rates its own model family higher | Use a different judge family than the one under test |
(c) Human / online A/B — highest fidelity, slowest, reserved for gates
Humans rating outputs, or real users on real traffic, answer the only question that ultimately matters. They are too slow and expensive for iteration, so they sit at the top of the pyramid as the release gate — the last check before a change reaches everyone. The rest of the lesson is mostly about §2: how to run the online version without burning weeks.
2 · Online A/B with guardrails
The release decision is an experiment. Route a fraction of live traffic to the new model (treatment) against the current model (control), and compare a primary metric — the thing you're trying to improve (task success, user thumbs-up rate, completion rate) — while watching a panel of guardrail metrics that must not regress even if the primary improves.
| Role | Examples | Decision rule |
|---|---|---|
| Primary | Task success, thumbs-up rate, resolution rate | Must improve with significance to justify shipping |
| Guardrails | p99 latency (lesson 03), refusal rate, safety-violation rate, cost / request | Must not regress — a guardrail breach blocks the ship even if primary wins |
The guardrail panel is where lessons 03 and 09 cash out. A new model that's 2% better at the task but blows the p99 TTFT SLO from lesson 03, or doubles cost/request, or quietly raises the refusal rate, is not a ship — it's a different product with a worse latency contract. You decide guardrail thresholds before the experiment so you can't rationalize a breach after seeing the data.
Sample size — how long until you know?
Detecting a small effect needs a lot of data. For a rate metric with baseline p (so variance σ² = p(1−p)) and a minimum detectable effect MDE (absolute), at α=0.05 and power=0.8 the per-arm sample size is the standard two-proportion result:
Halving the MDE you want to detect quadruples n. This ties straight back to lesson 03's traffic numbers: if your service does 1,000 req/s (≈86M/day) you reach huge sample sizes in hours; if the feature only triggers for 1k eligible users/day, the same experiment runs for months — and by then the world has shifted. The full statistical toolkit (CUPED variance reduction, the peeking problem, SRM, ratio metrics) is in deep_learning 10 · A/B testing; the widget below is the sizing core.
3 · The offline ↔ online gap
The cruel fact of evaluation: offline wins routinely fail to transfer online. A model with +1% offline benchmark score shows no online lift, or regresses. Two reasons:
- Distribution shift. Offline sets are drawn from yesterday's traffic and yesterday's policy. Real users send different prompts, and a deployed model changes the inputs it sees next (the feedback loop of §5). The benchmark measured a world that no longer exists.
- Goodhart's law. "When a measure becomes a target, it ceases to be a good measure." Optimize hard against a fixed benchmark and you learn the benchmark's quirks, not the underlying capability — especially once §1's saturation sets in.
4 · Regression gating — the release contract
Make the offline filter a literal CI gate. A deploy is blocked unless the candidate clears fixed thresholds on the regression suite and introduces no guardrail regression. This is the model's equivalent of "tests must pass before merge" — and it connects directly to the deploy machinery in lesson 11.
The thresholds are a contract: no metric in the suite drops by more than X, no safety metric drops at all. Why this matters in dollars: a bad model reaching production can mean an emergency rollback, hours of degraded user experience across millions of requests, and — if it's a safety regression — reputational damage that dwarfs the GPU bill. The gate is cheap; a bad deploy is not. The cost asymmetry is the entire argument for paying the offline-eval tax on every change.
5 · The data flywheel — the compounding loop
Here is what turns a live system into a self-improving one, and where lessons 08 and 09 reconnect. Production traffic is not just load to serve — it is the richest dataset you will ever have, generated for free by real users on the real distribution.
Serve → log → curate → retrain → deploy → more traffic. The logged prompts, outputs, and user signals (thumbs, edits, abandonment, downstream success) become the next SFT set, the next preference pairs, the next RL prompt distribution — feeding the RL flywheel of lesson 09 through the data plane of lesson 08 (and its online variant, data_engineering 10). Each turn lands a better model on a fresher distribution; the gap to a competitor without your traffic widens every cycle. That compounding is the durable moat.
Interactive · A/B significance — how long until you know?
Set the baseline success rate, the effect you need to detect, your daily eligible traffic (lesson 03's λ), and the fraction routed to treatment. The widget computes the per-arm sample size, the total, and how many days until the result is significant — then tells you whether your eval or your model is the bottleneck.
What carries forward
- Eval is a system with SLOs — cheap enough per-PR, reproducible, low-variance. A lying control system ships regressions with a green check.
- Three tiers by cost & fidelity: offline static (every commit, but saturates and leaks — decontaminate per lesson 08), LLM-as-judge (scalable, biased — randomize position, length-control, cross-family judge, calibrate to κ>0.6 against humans), human/online A/B (the gold standard, reserved as the release gate).
- Online A/B compares a primary metric against guardrails (p99 latency from lesson 03, refusal rate, safety, cost/req) that must not regress. Sample size n ≈ 16·p(1−p)/MDE²; small effects or thin traffic mean weeks.
- Offline ↔ online gap is real (distribution shift, Goodhart): offline gates what's allowed to ship, online decides what does ship.
- Regression gating is the release contract — the literal CI gate for models, blocking deploy on any threshold or guardrail breach (lesson 11). The gate is cheap; a bad deploy costs rollbacks, degraded UX across millions of requests, and trust.
- The data flywheel (serve → log → curate → retrain → deploy) feeds lessons 09 and 08 and is the compounding moat — bounded by bias amplification, privacy/consent, and human labeling as the rate-limiter.