all_lessons / ml_system_design / 10 · evaluation lesson 10 / 20

Designing evaluation and the feedback flywheel

Every lesson so far made the system faster, cheaper, or bigger. Lesson 04 cut decode latency, 06 deleted redundant prefill, 07 defended MFU, 09 balanced the RL loop. Not one of them checked whether the model got any better. A system you can't measure, you can't improve — and worse, you can't safely deploy. Evaluation is the control system that closes the loop: it tells you which change to keep, what is safe to ship, and feeds the data that trains the next model. This lesson designs eval itself as a system, with its own SLOs.

The framing: eval is a system, not a script
Treat your eval like a serving replica with requirements (lesson 03). It has its own SLOs: cheap enough to run on every pull request (seconds–minutes, not days), reproducible (same inputs → same score, so a number means something), and low-variance (the noise floor must sit below the effect you care about, or you're measuring nothing). Miss any of these and the "control system" lies to you — and a lying control system is worse than none, because it ships regressions with a green check.

1 · The three eval tiers — by cost and fidelity

Evaluation is a hierarchy. Each tier is more faithful to "did real users get a better experience?" and proportionally slower and more expensive. You run the cheap ones constantly and the expensive ones rarely, as gates.

TierWhat it isCost / latencyRun cadenceFidelity
(a) Offline staticHeld-out sets, public benchmarks, a fixed regression suiteCents; seconds–minutesEvery commit / PRLow — proxy, can leak or saturate
(b) LLM-as-judgeA strong model scores or ranks outputs vs a reference$1–$100s; minutesPer candidate / nightlyMedium — approximates human preference, but biased
(c) Human / online A/BHuman raters, or live traffic measured against real metrics$1k–$100k+; days–weeksRelease gate onlyHigh — the actual question

(a) Offline static — cheap, reproducible, leaky

A frozen set of inputs with known-good answers (or a scoring function): exact-match on math, pass/fail on unit tests, perplexity on a held-out corpus. These are the CI of ML — fast, deterministic, and the only tier you can afford on every commit. Two failure modes kill them. Saturation: once the model scores 99%, the benchmark can no longer distinguish a better model from a worse one — your ruler has run out of ticks. Contamination: if the benchmark text leaked into pretraining data, the model memorized the answers and the score is fiction. This is exactly why the data plane runs decontamination — substring/n-gram matching the eval sets against the training corpus (data_engineering 06). An undecontaminated benchmark is a thermometer dipped in the boiler.

Contamination quietly inflates every number above it
Leakage of even a few percent of eval items into training can move a benchmark several points — enough to "win" a comparison that is pure memorization. The discipline from lesson 08's data plane is non-negotiable: decontaminate the training corpus against every held-out and benchmark set, and treat any benchmark you can't decontaminate as untrustworthy.

(b) LLM-as-judge — scalable preference, systematically biased

Human preference is the gold standard but costs ~$0.50–$5 per comparison and takes hours. An LLM judge approximates it at ~$0.001–$0.01 per comparison in seconds, so you can score thousands of outputs nightly. The catch: judges carry structured biases that, unhandled, produce confidently wrong rankings.

BiasWhat the judge doesMitigation
positionFavors whichever answer is shown first (A over B)Randomize order; score both orderings and average
verbosityRates longer answers higher regardless of qualityLength-control: regress out length, or cap/normalize tokens
self-preferenceRates its own model family higherUse a different judge family than the one under test
Never trust a judge you haven't calibrated against humans
Before a judge gates anything, measure its agreement with human labels on a few hundred items — report Cohen's κ (chance-corrected agreement). A rule of thumb: κ < 0.4 is poor (the judge is barely better than a coin against the disagreements), 0.4–0.6 moderate, > 0.6 substantial. Strong LLM judges reach κ ≈ 0.6–0.8 on well-specified rubrics — roughly human-to-human agreement — but only after position-swapping and length-control. An uncalibrated judge is a benchmark of unknown error; you cannot build a control system on an instrument you haven't characterized.

(c) Human / online A/B — highest fidelity, slowest, reserved for gates

Humans rating outputs, or real users on real traffic, answer the only question that ultimately matters. They are too slow and expensive for iteration, so they sit at the top of the pyramid as the release gate — the last check before a change reaches everyone. The rest of the lesson is mostly about §2: how to run the online version without burning weeks.

2 · Online A/B with guardrails

The release decision is an experiment. Route a fraction of live traffic to the new model (treatment) against the current model (control), and compare a primary metric — the thing you're trying to improve (task success, user thumbs-up rate, completion rate) — while watching a panel of guardrail metrics that must not regress even if the primary improves.

RoleExamplesDecision rule
PrimaryTask success, thumbs-up rate, resolution rateMust improve with significance to justify shipping
Guardrailsp99 latency (lesson 03), refusal rate, safety-violation rate, cost / requestMust not regress — a guardrail breach blocks the ship even if primary wins

The guardrail panel is where lessons 03 and 09 cash out. A new model that's 2% better at the task but blows the p99 TTFT SLO from lesson 03, or doubles cost/request, or quietly raises the refusal rate, is not a ship — it's a different product with a worse latency contract. You decide guardrail thresholds before the experiment so you can't rationalize a breach after seeing the data.

Sample size — how long until you know?

Detecting a small effect needs a lot of data. For a rate metric with baseline p (so variance σ² = p(1−p)) and a minimum detectable effect MDE (absolute), at α=0.05 and power=0.8 the per-arm sample size is the standard two-proportion result:

n ≈ 16 · p(1−p) / MDE²   (16 ≈ (zα/2+zβ)²·2 for α=.05, power=.8)

Halving the MDE you want to detect quadruples n. This ties straight back to lesson 03's traffic numbers: if your service does 1,000 req/s (≈86M/day) you reach huge sample sizes in hours; if the feature only triggers for 1k eligible users/day, the same experiment runs for months — and by then the world has shifted. The full statistical toolkit (CUPED variance reduction, the peeking problem, SRM, ratio metrics) is in deep_learning 10 · A/B testing; the widget below is the sizing core.

3 · The offline ↔ online gap

The cruel fact of evaluation: offline wins routinely fail to transfer online. A model with +1% offline benchmark score shows no online lift, or regresses. Two reasons:

The discipline that resolves the gap
Offline eval gates what is allowed to ship; online A/B decides what does ship. Offline is a cheap, fast filter that catches obvious regressions before you spend a single user-impression — necessary, never sufficient. Online is the slow, faithful arbiter for anything that clears the filter. Confusing the two — shipping on an offline win, or A/B-testing every commit — is the most common eval-design mistake.

4 · Regression gating — the release contract

Make the offline filter a literal CI gate. A deploy is blocked unless the candidate clears fixed thresholds on the regression suite and introduces no guardrail regression. This is the model's equivalent of "tests must pass before merge" — and it connects directly to the deploy machinery in lesson 11.

candidate model new checkpoint REGRESSION GATE offline thresholds + guardrails canary → A/B lesson 11 deploy PASS FAIL any threshold → blocked, back to the bench

The thresholds are a contract: no metric in the suite drops by more than X, no safety metric drops at all. Why this matters in dollars: a bad model reaching production can mean an emergency rollback, hours of degraded user experience across millions of requests, and — if it's a safety regression — reputational damage that dwarfs the GPU bill. The gate is cheap; a bad deploy is not. The cost asymmetry is the entire argument for paying the offline-eval tax on every change.

5 · The data flywheel — the compounding loop

Here is what turns a live system into a self-improving one, and where lessons 08 and 09 reconnect. Production traffic is not just load to serve — it is the richest dataset you will ever have, generated for free by real users on the real distribution.

servelive traffic logprompts, outputs, signals curate / labelfilter, dedup, annotate retrainSFT / pref / RL (09) deploybehind the gate (§4) each turn of the wheel = a better model on a fresher distribution — the compounding moat

Serve → log → curate → retrain → deploy → more traffic. The logged prompts, outputs, and user signals (thumbs, edits, abandonment, downstream success) become the next SFT set, the next preference pairs, the next RL prompt distribution — feeding the RL flywheel of lesson 09 through the data plane of lesson 08 (and its online variant, data_engineering 10). Each turn lands a better model on a fresher distribution; the gap to a competitor without your traffic widens every cycle. That compounding is the durable moat.

Three things that corrode the flywheel
Bias amplification: the deployed model shapes the very data that trains its successor, so any bias becomes self-reinforcing — the model teaches its replacement its own blind spots. Counter it by mixing in held-out and exploration traffic. Privacy / consent: logged user data carries legal and ethical constraints (PII, consent scope, retention) — the data plane must redact and honor opt-outs before anything reaches training. Labeling is the rate-limiter: human curation/labeling is the slow, expensive step that caps how fast the wheel turns; this is precisely why LLM-as-judge (§1) exists — to scale the labeling that would otherwise bottleneck the entire loop.

Interactive · A/B significance — how long until you know?

Set the baseline success rate, the effect you need to detect, your daily eligible traffic (lesson 03's λ), and the fraction routed to treatment. The widget computes the per-arm sample size, the total, and how many days until the result is significant — then tells you whether your eval or your model is the bottleneck.

A/B significance & time-to-decision

Assumptions: two-proportion z-test, α=0.05, power=0.8 fixed. Per-arm n ≈ 16·p(1−p)/MDE² (the 16 ≈ (zα/2+zβ)²·2). Total = 2n. Both arms draw from the experiment traffic, so requests/day in the experiment = daily·frac·2 split across the two arms — equivalently, total samples accrue at daily·frac·2. Days = total / (daily · frac · 2). MDE is absolute (percentage points). Order-of-magnitude, per lesson 02's ±30% contract; for real launches add CUPED and a fixed horizon (deep_learning 10).

n per arm
total samples
days to significance
verdict

What carries forward