Evaluation — offline metrics and A/B tests
You've built the ranker. Now the harder question: is it actually better? Offline metrics filter what gets tested. A/B tests decide what ships. Senior candidates quote both numbers when they discuss a launch.
Offline metrics — what each actually measures
Every offline metric is a summary statistic on logged predictions. The choice of metric is a claim about what kind of mistake matters. Wrong claim, wrong prediction of the A/B test.
| Metric | Definition | Measures | Blind to |
|---|---|---|---|
| ROC-AUC | P(score(x⁺) > score(x⁻)) over random pos/neg pair. | Pure rank quality. Threshold-free. | Calibration. Magnitudes. Class imbalance. |
| PR-AUC | Area under precision-recall curve. | Rank quality when positives are rare. More sensitive to imbalance than ROC-AUC. | Calibration. Comparison across datasets with different base rates. |
| Log-loss | −[y log p̂ + (1−y) log(1 − p̂)] | The training loss. Sensitive to calibration — confidently-wrong predictions are heavily penalised. | Rank-only improvements when calibration is already tight. |
| NDCG@K | DCG/IDCG, with DCG = Σ (2^rel_i − 1)/log₂(i+1). | Graded relevance with position discount. Standard for 5-star / ad-quality-bucket labels. | Calibration. Score gap between adjacent positions. |
| MAP | Mean over queries of average precision. | Binary-relevance rank quality averaged across queries. | Graded relevance. |
| MRR | 1 / rank_of_first_relevant, averaged. | Single-correct-answer tasks (definitional search). | Everything past the first hit. Useless for browsing. |
| Recall@K | Fraction of relevant items in top K. | The only metric retrieval should care about. | Order within the K. Precision. |
| ECE | Frequency-weighted average of |emp − pred| over bins (each bin weighted by the number of predictions falling in it). | Calibration — whether p̂ = 0.3 matches a 30% empirical rate. | Rank order. A perfectly calibrated random predictor has ECE ≈ 0. |
Common interview misstep: treating AUC and log-loss as interchangeable. AUC is invariant to monotone transforms — great AUC and broken calibration is fatal for any auction or threshold-dependent system. Log-loss penalises miscalibration directly. Downstream is an auction (lesson 9)? Care about log-loss and ECE. Pure-sort feed? AUC and NDCG.
Ranking: NDCG@K (if graded labels) or AUC (if binary), plus log-loss and ECE if the score feeds an auction.
Re-ranking: list-level metrics — NDCG with a diversity discount, list-CTR, or a learned listwise loss. Per-item metrics miss the inter-item effects that re-rank exists to handle.
The fundamental problem with offline metrics
Every offline metric is computed on logged data generated by the old ranker. Two consequences:
- You can only score items the old ranker showed. If the new ranker would surface an item the old one never put in front of users, there's no label. Offline metrics are biased toward changes that re-order items the old policy already surfaced.
- The metric is a proxy. CTR-optimised offline metrics can rise while watch-time falls. NDCG can rise while diversity collapses. Every offline metric is one summary of behaviour; the business cares about long-term retention.
Offline metrics are a filter, not a decision. They tell you whether the model earns A/B-test traffic. A/B tests tell you whether to ship.
Counterfactual / off-policy evaluation
If you've trained a new ranker but haven't A/B tested, you can estimate online performance from logged data via importance-weighted estimators (the IPS machinery of lesson 7). For a logged tuple (context, action, reward) with old-policy propensity π_old(a | context):
Ê[reward | π_new] = (1/N) · Σ reward · π_new(a | context) / π_old(a | context)
- Requires logged propensities. You needed to write down π_old(a | context) at serving time. Post-hoc estimation adds bias.
- High variance. When π_new and π_old disagree strongly, weights blow up. Doubly-robust estimators (Dudík et al. 2011) combine IPS with a learned reward model — unbiased if either propensity or reward model is correct.
Off-policy evaluation never replaces an A/B test for shipping. It reduces how many candidate A/B tests you have to run.
A/B testing — the gold standard, with traps
Randomly assign users to treatment (new ranker) and control (old ranker). Measure the metric in each arm. Run a statistical test. The mechanics are simple; the traps are subtle and most of the questions interviewers ask are about traps.
Power and minimum detectable effect
For a two-sample test of means with per-arm sample size N and per-user metric standard deviation σ, the MDE at 80% power and 5% two-sided significance is
MDE ≈ (z_{α/2} + z_{β}) · σ · √(2/N) ≈ 2.8 · σ · √(2/N)
(For a Bernoulli click-rate metric, σ² = p(1−p), so MDE ≈ 2.8·√(2p(1−p)/N) per arm.)
This is the trap that gets juniors. Detecting a 0.5% relative lift on a 5% baseline CTR needs millions of users per arm. Underestimating sample size by 10× is the modal mistake. The reflex when asked "how long would you run this?" is to compute the MDE first.
Type I and Type II errors
Running 20 simultaneous tests at α = 0.05 produces, in expectation, one false positive. Mitigations:
- Bonferroni / Holm-Bonferroni: divide α by the number of tests; controls FWER but kills power.
- FDR (Benjamini-Hochberg): controls expected false-discovery rate, not family-wise error. Right knob for screening many variants.
- Always-valid p-values / sequential testing (Howard et al. 2018): peek continuously without inflating Type I. Mandatory at any lab running thousands of experiments.
Type II — under-powered tests missing real effects — is the quieter failure. Teams running 1-week tests on low-traffic surfaces routinely declare "flat" when the true effect was real but undetectable.
Variance reduction — the cheapest power upgrade
| Technique | How it works | Reduction |
|---|---|---|
| CUPED (Deng et al. 2013) | Replace Y by Y − θ(X − E[X]) with pre-experiment covariate X and θ = Cov(Y,X)/Var(X). | 30–50% on most user-level engagement metrics. Free power. |
| Stratification | Pre-stratify by covariate (country, device, heavy/light user); analyse within strata. | Modest; composable with CUPED. |
| Triggered analysis | Only count users who hit the changed code path. | Large when trigger rate is low. Careful definition needed to avoid post-treatment selection bias. |
CUPED is standard at every major experimentation platform. Naming it signals you've shipped experiments at scale.
Network effects, novelty, primacy
| Effect | What goes wrong | Mitigation |
|---|---|---|
| Network interference | Treatment users influence control users (sharing a video). Effect leaks across the assignment boundary, biasing toward zero. | Cluster-level randomization, ego-cluster designs, side-by-side surfaces, or bound the bias. |
| Novelty bias | Users initially engage more with anything new. The week-1 lift is partly novelty, not the model. | Run for weeks; report the trajectory. Or burn-in: discard the first N days. |
| Primacy effect | Opposite — users initially worse because they're used to the old thing, then adapt. Underestimates real effect. | Same — long enough experiments, examine the trajectory. |
| Twyman's law | "+50% CTR" wins are almost always instrumentation: misrouting, double-counting, broken control. Real ranker wins are 0.1–2%. | SRM checks, A/A tests, instrumentation review before believing any large effect. |
Long-term metrics
Short-term CTR is easy; long-term DAU and retention are what the business cares about. The two regularly disagree — a CTR-optimised ranker that shortens sessions (clickbait → dissatisfied user closes the app) wins the short experiment and loses the company.
Standard pattern: a long-term holdout keeps a fixed cohort on the pre-launch experience for 3–6 months after the regular A/B test concludes; compare retention and revenue trajectories. Most orgs run one product-wide holdout with new launches stacking against it, rather than one per experiment.
Interactive · A/B test power calculator
Plug in a baseline metric, a desired lift, and your daily traffic. The widget tells you the MDE at your current sample size, how many days you need to detect the target lift, and what a Bonferroni adjustment does to your significance threshold if you're running many tests in parallel.
Trade-offs at a glance
| AUC | Log-loss | NDCG@K | A/B test | |
|---|---|---|---|---|
| Latency to result | Minutes (offline) | Minutes | Minutes | 1–8 weeks |
| Cost | Trivial | Trivial | Trivial | Engineering + opportunity cost of traffic |
| Calibration-sensitive? | No | Yes | No (rank only) | Indirectly (auctions, thresholding) |
| Position-bias-sensitive? | Yes — logged labels are biased | Yes | Yes | No — randomisation breaks the link |
| Counterfactual to old policy? | No (logged data only) | No | No | Yes — by construction |
| Decision risk | Misleads launches | Misses rank issues | Misses calibration | The actual launch criterion |
Interview prompts you should be ready for
- "Offline NDCG@10 improved 3%; A/B test is flat. Diagnose." (Probes: retrieval ↔ ranker mismatch, position bias in labels, lift on items the old ranker never surfaced, novelty masking, offline metric not matching production scoring, re-rank overrides.)
- "Walk me through CUPED. Why does it work?" (Probes: pre-experiment covariate fixed before treatment so estimator is unbiased; variance reduction proportional to ρ² between pre- and in-experiment metric.)
- "How do you decide how long to run an A/B test?" (Probes: power calculation from baseline and σ, novelty burn-in, weekday cycles, multiple-tests adjustment.)
- "Why is AUC sometimes misleading as a model-selection metric?" (Probes: invariance to monotone transforms → blind to calibration; insensitive to where in the ranking the improvement happens; matters for auctions.)
- "New ranker has big A/B lift in week 1; week 4 flat. What's happening?" (Probes: novelty decay, instrumentation broke, traffic mix shifted, feature distribution drift, contaminated holdout.)
- "Test 10 ranker variants — naive Bonferroni or something smarter?" (Probes: Bonferroni is fine for small numbers but conservative; FDR for screening; sequential/bandit for adaptive allocation; tournament against a champion.)
- "You can't run an A/B test. How do you evaluate?" (Probes: off-policy IPS/doubly-robust, interleaving, shadow scoring, human raters, replay on held-out logs.)