Designing evaluation and the feedback flywheel

Every lesson so far made the system faster, cheaper, or bigger. Lesson 04 cut decode latency, 06 deleted redundant prefill, 07 defended MFU, 09 balanced the RL loop. Not one of them checked whether the model got any better. A system you can't measure, you can't improve — and worse, you can't safely deploy. Evaluation is the control system that closes the loop: it tells you which change to keep, what is safe to ship, and feeds the data that trains the next model. This lesson designs eval itself as a system, with its own SLOs.

The framing: eval is a system, not a script

Treat your eval like a serving replica with requirements (lesson 03). It has its own SLOs: cheap enough to run on every pull request (seconds–minutes, not days), reproducible (same inputs → same score, so a number means something), and low-variance (the noise floor must sit below the effect you care about, or you're measuring nothing). Miss any of these and the "control system" lies to you — and a lying control system is worse than none, because it ships regressions with a green check.

1 · The three eval tiers — by cost and fidelity

Evaluation is a hierarchy. Each tier is more faithful to "did real users get a better experience?" and proportionally slower and more expensive. You run the cheap ones constantly and the expensive ones rarely, as gates.

Tier	What it is	Cost / latency	Run cadence	Fidelity
(a) Offline static	Held-out sets, public benchmarks, a fixed regression suite	Cents; seconds–minutes	Every commit / PR	Low — proxy, can leak or saturate
(b) LLM-as-judge	A strong model scores or ranks outputs vs a reference	$1–$100s; minutes	Per candidate / nightly	Medium — approximates human preference, but biased
(c) Human / online A/B	Human raters, or live traffic measured against real metrics	$1k–$100k+; days–weeks	Release gate only	High — the actual question

(a) Offline static — cheap, reproducible, leaky

A frozen set of inputs with known-good answers (or a scoring function): exact-match on math, pass/fail on unit tests, perplexity on a held-out corpus. These are the CI of ML — fast, deterministic, and the only tier you can afford on every commit. Two failure modes kill them. Saturation: once the model scores 99%, the benchmark can no longer distinguish a better model from a worse one — your ruler has run out of ticks. Contamination: if the benchmark text leaked into pretraining data, the model memorized the answers and the score is fiction. This is exactly why the data plane runs decontamination — substring/n-gram matching the eval sets against the training corpus (data_engineering 06). An undecontaminated benchmark is a thermometer dipped in the boiler.

Contamination quietly inflates every number above it

Leakage of even a few percent of eval items into training can move a benchmark several points — enough to "win" a comparison that is pure memorization. The discipline from lesson 08's data plane is non-negotiable: decontaminate the training corpus against every held-out and benchmark set, and treat any benchmark you can't decontaminate as untrustworthy.

(b) LLM-as-judge — scalable preference, systematically biased

Human preference is the gold standard but costs ~$0.50–$5 per comparison and takes hours. An LLM judge approximates it at ~$0.001–$0.01 per comparison in seconds, so you can score thousands of outputs nightly. The catch: judges carry structured biases that, unhandled, produce confidently wrong rankings.

Bias	What the judge does	Mitigation
position	Favors whichever answer is shown first (A over B)	Randomize order; score both orderings and average
verbosity	Rates longer answers higher regardless of quality	Length-control: regress out length, or cap/normalize tokens
self-preference	Rates its own model family higher	Use a different judge family than the one under test

Never trust a judge you haven't calibrated against humans

Before a judge gates anything, measure its agreement with human labels on a few hundred items — report Cohen's κ (chance-corrected agreement). A rule of thumb: κ < 0.4 is poor (the judge is barely better than a coin against the disagreements), 0.4–0.6 moderate, > 0.6 substantial. Strong LLM judges reach κ ≈ 0.6–0.8 on well-specified rubrics — roughly human-to-human agreement — but only after position-swapping and length-control. An uncalibrated judge is a benchmark of unknown error; you cannot build a control system on an instrument you haven't characterized.

(c) Human / online A/B — highest fidelity, slowest, reserved for gates

Humans rating outputs, or real users on real traffic, answer the only question that ultimately matters. They are too slow and expensive for iteration, so they sit at the top of the pyramid as the release gate — the last check before a change reaches everyone. The rest of the lesson is mostly about §2: how to run the online version without burning weeks.

2 · Online A/B with guardrails

The release decision is an experiment. Route a fraction of live traffic to the new model (treatment) against the current model (control), and compare a primary metric — the thing you're trying to improve (task success, user thumbs-up rate, completion rate) — while watching a panel of guardrail metrics that must not regress even if the primary improves.

Role	Examples	Decision rule
Primary	Task success, thumbs-up rate, resolution rate	Must improve with significance to justify shipping
Guardrails	p99 latency (lesson 03), refusal rate, safety-violation rate, cost / request	Must not regress — a guardrail breach blocks the ship even if primary wins

The guardrail panel is where lessons 03 and 09 cash out. A new model that's 2% better at the task but blows the p99 TTFT SLO from lesson 03, or doubles cost/request, or quietly raises the refusal rate, is not a ship — it's a different product with a worse latency contract. You decide guardrail thresholds before the experiment so you can't rationalize a breach after seeing the data.

Sample size — how long until you know?

Detecting a small effect needs a lot of data. For a rate metric with baseline p (so variance σ² = p(1−p)) and a minimum detectable effect MDE (absolute), at α=0.05 and power=0.8 the per-arm sample size is the standard two-proportion result:

n ≈ 16 · p(1−p) / MDE² (16 ≈ (z_α/2+z_β)²·2 for α=.05, power=.8)

Halving the MDE you want to detect quadruples n. This ties straight back to lesson 03's traffic numbers: if your service does 1,000 req/s (≈86M/day) you reach huge sample sizes in hours; if the feature only triggers for 1k eligible users/day, the same experiment runs for months — and by then the world has shifted. The full statistical toolkit (CUPED variance reduction, the peeking problem, SRM, ratio metrics) is in deep_learning 10 · A/B testing; the widget below is the sizing core.

3 · The offline ↔ online gap

The cruel fact of evaluation: offline wins routinely fail to transfer online. A model with +1% offline benchmark score shows no online lift, or regresses. Two reasons:

Distribution shift. Offline sets are drawn from yesterday's traffic and yesterday's policy. Real users send different prompts, and a deployed model changes the inputs it sees next (the feedback loop of §5). The benchmark measured a world that no longer exists.
Goodhart's law. "When a measure becomes a target, it ceases to be a good measure." Optimize hard against a fixed benchmark and you learn the benchmark's quirks, not the underlying capability — especially once §1's saturation sets in.

The discipline that resolves the gap

Offline eval gates what is allowed to ship; online A/B decides what does ship. Offline is a cheap, fast filter that catches obvious regressions before you spend a single user-impression — necessary, never sufficient. Online is the slow, faithful arbiter for anything that clears the filter. Confusing the two — shipping on an offline win, or A/B-testing every commit — is the most common eval-design mistake.

4 · Regression gating — the release contract

Make the offline filter a literal CI gate. A deploy is blocked unless the candidate clears fixed thresholds on the regression suite and introduces no guardrail regression. This is the model's equivalent of "tests must pass before merge" — and it connects directly to the deploy machinery in lesson 11.

The thresholds are a contract: no metric in the suite drops by more than X, no safety metric drops at all. Why this matters in dollars: a bad model reaching production can mean an emergency rollback, hours of degraded user experience across millions of requests, and — if it's a safety regression — reputational damage that dwarfs the GPU bill. The gate is cheap; a bad deploy is not. The cost asymmetry is the entire argument for paying the offline-eval tax on every change.

5 · The data flywheel — the compounding loop

Here is what turns a live system into a self-improving one, and where lessons 08 and 09 reconnect. Production traffic is not just load to serve — it is the richest dataset you will ever have, generated for free by real users on the real distribution.

Serve → log → curate → retrain → deploy → more traffic. The logged prompts, outputs, and user signals (thumbs, edits, abandonment, downstream success) become the next SFT set, the next preference pairs, the next RL prompt distribution — feeding the RL flywheel of lesson 09 through the data plane of lesson 08 (and its online variant, data_engineering 10). Each turn lands a better model on a fresher distribution; the gap to a competitor without your traffic widens every cycle. That compounding is the durable moat.

Three things that corrode the flywheel

Bias amplification: the deployed model shapes the very data that trains its successor, so any bias becomes self-reinforcing — the model teaches its replacement its own blind spots. Counter it by mixing in held-out and exploration traffic. Privacy / consent: logged user data carries legal and ethical constraints (PII, consent scope, retention) — the data plane must redact and honor opt-outs before anything reaches training. Labeling is the rate-limiter: human curation/labeling is the slow, expensive step that caps how fast the wheel turns; this is precisely why LLM-as-judge (§1) exists — to scale the labeling that would otherwise bottleneck the entire loop.

Interactive · A/B significance — how long until you know?

Set the baseline success rate, the effect you need to detect, your daily eligible traffic (lesson 03's λ), and the fraction routed to treatment. The widget computes the per-arm sample size, the total, and how many days until the result is significant — then tells you whether your eval or your model is the bottleneck.

A/B significance & time-to-decision

Assumptions: two-proportion z-test, α=0.05, power=0.8 fixed. Per-arm n ≈ 16·p(1−p)/MDE² (the 16 ≈ (z_α/2+z_β)²·2). Total = 2n. Both arms draw from the experiment traffic, so requests/day in the experiment = daily·frac·2 split across the two arms — equivalently, total samples accrue at daily·frac·2. Days = total / (daily · frac · 2). MDE is absolute (percentage points). Order-of-magnitude, per lesson 02's ±30% contract; for real launches add CUPED and a fixed horizon (deep_learning 10).

baseline success rate % 60 min detectable effect (abs %) 2 daily eligible requests 100,000 frac to treatment % 50

n per arm

–

total samples

–

days to significance

–

verdict

–

What carries forward

Eval is a system with SLOs — cheap enough per-PR, reproducible, low-variance. A lying control system ships regressions with a green check.
Three tiers by cost & fidelity: offline static (every commit, but saturates and leaks — decontaminate per lesson 08), LLM-as-judge (scalable, biased — randomize position, length-control, cross-family judge, calibrate to κ>0.6 against humans), human/online A/B (the gold standard, reserved as the release gate).
Online A/B compares a primary metric against guardrails (p99 latency from lesson 03, refusal rate, safety, cost/req) that must not regress. Sample size n ≈ 16·p(1−p)/MDE²; small effects or thin traffic mean weeks.
Offline ↔ online gap is real (distribution shift, Goodhart): offline gates what's allowed to ship, online decides what does ship.
Regression gating is the release contract — the literal CI gate for models, blocking deploy on any threshold or guardrail breach (lesson 11). The gate is cheap; a bad deploy costs rollbacks, degraded UX across millions of requests, and trust.
The data flywheel (serve → log → curate → retrain → deploy) feeds lessons 09 and 08 and is the compounding moat — bounded by bias amplification, privacy/consent, and human labeling as the rate-limiter.