Evaluation — offline metrics and A/B tests

You've built the ranker. Now the harder question: is it actually better? Offline metrics filter what gets tested. A/B tests decide what ships. Senior candidates quote both numbers when they discuss a launch.

Offline metrics — what each actually measures

Every offline metric is a summary statistic on logged predictions. The choice of metric is a claim about what kind of mistake matters. Wrong claim, wrong prediction of the A/B test.

Metric	Definition	Measures	Blind to
ROC-AUC	P(score(x⁺) > score(x⁻)) over random pos/neg pair.	Pure rank quality. Threshold-free.	Calibration. Magnitudes. Class imbalance.
PR-AUC	Area under precision-recall curve.	Rank quality when positives are rare. More sensitive to imbalance than ROC-AUC.	Calibration. Comparison across datasets with different base rates.
Log-loss	−[y log p̂ + (1−y) log(1 − p̂)]	The training loss. Sensitive to calibration — confidently-wrong predictions are heavily penalised.	Rank-only improvements when calibration is already tight.
NDCG@K	DCG/IDCG, with DCG = Σ (2^rel_i − 1)/log₂(i+1).	Graded relevance with position discount. Standard for 5-star / ad-quality-bucket labels.	Calibration. Score gap between adjacent positions.
MAP	Mean over queries of average precision.	Binary-relevance rank quality averaged across queries.	Graded relevance.
MRR	1 / rank_of_first_relevant, averaged.	Single-correct-answer tasks (definitional search).	Everything past the first hit. Useless for browsing.
Recall@K	Fraction of relevant items in top K.	The only metric retrieval should care about.	Order within the K. Precision.
ECE	Frequency-weighted average of \|emp − pred\| over bins (each bin weighted by the number of predictions falling in it).	Calibration — whether p̂ = 0.3 matches a 30% empirical rate.	Rank order. A perfectly calibrated random predictor has ECE ≈ 0.

Common interview misstep: treating AUC and log-loss as interchangeable. AUC is invariant to monotone transforms — great AUC and broken calibration is fatal for any auction or threshold-dependent system. Log-loss penalises miscalibration directly. Downstream is an auction (lesson 9)? Care about log-loss and ECE. Pure-sort feed? AUC and NDCG.

Which metric for which stage

Retrieval: recall@K — the only metric that matters. Order doesn't, the ranker fixes that.
Ranking: NDCG@K (if graded labels) or AUC (if binary), plus log-loss and ECE if the score feeds an auction.
Re-ranking: list-level metrics — NDCG with a diversity discount, list-CTR, or a learned listwise loss. Per-item metrics miss the inter-item effects that re-rank exists to handle.

The fundamental problem with offline metrics

Every offline metric is computed on logged data generated by the old ranker. Two consequences:

You can only score items the old ranker showed. If the new ranker would surface an item the old one never put in front of users, there's no label. Offline metrics are biased toward changes that re-order items the old policy already surfaced.
The metric is a proxy. CTR-optimised offline metrics can rise while watch-time falls. NDCG can rise while diversity collapses. Every offline metric is one summary of behaviour; the business cares about long-term retention.

Offline metrics are a filter, not a decision. They tell you whether the model earns A/B-test traffic. A/B tests tell you whether to ship.

Counterfactual / off-policy evaluation

If you've trained a new ranker but haven't A/B tested, you can estimate online performance from logged data via importance-weighted estimators (the IPS machinery of lesson 7). For a logged tuple (context, action, reward) with old-policy propensity π_old(a | context):

Ê[reward | π_new] = (1/N) · Σ reward · π_new(a | context) / π_old(a | context)

Requires logged propensities. You needed to write down π_old(a | context) at serving time. Post-hoc estimation adds bias.
High variance. When π_new and π_old disagree strongly, weights blow up. Doubly-robust estimators (Dudík et al. 2011) combine IPS with a learned reward model — unbiased if either propensity or reward model is correct.

Off-policy evaluation never replaces an A/B test for shipping. It reduces how many candidate A/B tests you have to run.

A/B testing — the gold standard, with traps

Randomly assign users to treatment (new ranker) and control (old ranker). Measure the metric in each arm. Run a statistical test. The mechanics are simple; the traps are subtle and most of the questions interviewers ask are about traps.

Power and minimum detectable effect

For a two-sample test of means with per-arm sample size N and per-user metric standard deviation σ, the MDE at 80% power and 5% two-sided significance is

MDE ≈ (z_{α/2} + z_{β}) · σ · √(2/N) ≈ 2.8 · σ · √(2/N)

(For a Bernoulli click-rate metric, σ² = p(1−p), so MDE ≈ 2.8·√(2p(1−p)/N) per arm.)

This is the trap that gets juniors. Detecting a 0.5% relative lift on a 5% baseline CTR needs millions of users per arm. Underestimating sample size by 10× is the modal mistake. The reflex when asked "how long would you run this?" is to compute the MDE first.

Type I and Type II errors

Running 20 simultaneous tests at α = 0.05 produces, in expectation, one false positive. Mitigations:

Bonferroni / Holm-Bonferroni: divide α by the number of tests; controls FWER but kills power.
FDR (Benjamini-Hochberg): controls expected false-discovery rate, not family-wise error. Right knob for screening many variants.
Always-valid p-values / sequential testing (Howard et al. 2018): peek continuously without inflating Type I. Mandatory at any lab running thousands of experiments.

Type II — under-powered tests missing real effects — is the quieter failure. Teams running 1-week tests on low-traffic surfaces routinely declare "flat" when the true effect was real but undetectable.

Variance reduction — the cheapest power upgrade

Technique	How it works	Reduction
CUPED (Deng et al. 2013)	Replace Y by Y − θ(X − E[X]) with pre-experiment covariate X and θ = Cov(Y,X)/Var(X).	30–50% on most user-level engagement metrics. Free power.
Stratification	Pre-stratify by covariate (country, device, heavy/light user); analyse within strata.	Modest; composable with CUPED.
Triggered analysis	Only count users who hit the changed code path.	Large when trigger rate is low. Careful definition needed to avoid post-treatment selection bias.

CUPED is standard at every major experimentation platform. Naming it signals you've shipped experiments at scale.

Network effects, novelty, primacy

Effect	What goes wrong	Mitigation
Network interference	Treatment users influence control users (sharing a video). Effect leaks across the assignment boundary, biasing toward zero.	Cluster-level randomization, ego-cluster designs, side-by-side surfaces, or bound the bias.
Novelty bias	Users initially engage more with anything new. The week-1 lift is partly novelty, not the model.	Run for weeks; report the trajectory. Or burn-in: discard the first N days.
Primacy effect	Opposite — users initially worse because they're used to the old thing, then adapt. Underestimates real effect.	Same — long enough experiments, examine the trajectory.
Twyman's law	"+50% CTR" wins are almost always instrumentation: misrouting, double-counting, broken control. Real ranker wins are 0.1–2%.	SRM checks, A/A tests, instrumentation review before believing any large effect.

Long-term metrics

Short-term CTR is easy; long-term DAU and retention are what the business cares about. The two regularly disagree — a CTR-optimised ranker that shortens sessions (clickbait → dissatisfied user closes the app) wins the short experiment and loses the company.

Standard pattern: a long-term holdout keeps a fixed cohort on the pre-launch experience for 3–6 months after the regular A/B test concludes; compare retention and revenue trajectories. Most orgs run one product-wide holdout with new launches stacking against it, rather than one per experiment.

Interactive · A/B test power calculator

Plug in a baseline metric, a desired lift, and your daily traffic. The widget tells you the MDE at your current sample size, how many days you need to detect the target lift, and what a Bonferroni adjustment does to your significance threshold if you're running many tests in parallel.

Sizing an experiment

All inputs are per-arm. The MDE formula assumes a two-sided z-test on the difference of proportions with σ² ≈ p(1−p) per user. Real platforms use cluster-robust or CUPED-adjusted variance; the order of magnitude is the same.

baseline CTR (%): 5.0 target relative lift (%): 1.0 daily users per arm (thousands): 200 observed lift so far (%): 0.40 days running: 7 simultaneous tests: 1

N per arm so far

—

MDE @ 80% power (rel %)

—

days to reach target lift

—

p-value (two-sided)

—

Bonferroni threshold

—

verdict

—

Reading

—

Trade-offs at a glance

	AUC	Log-loss	NDCG@K	A/B test
Latency to result	Minutes (offline)	Minutes	Minutes	1–8 weeks
Cost	Trivial	Trivial	Trivial	Engineering + opportunity cost of traffic
Calibration-sensitive?	No	Yes	No (rank only)	Indirectly (auctions, thresholding)
Position-bias-sensitive?	Yes — logged labels are biased	Yes	Yes	No — randomisation breaks the link
Counterfactual to old policy?	No (logged data only)	No	No	Yes — by construction
Decision risk	Misleads launches	Misses rank issues	Misses calibration	The actual launch criterion

Where junior candidates trip

Asked "how would you evaluate the ranker?" a junior candidate names one offline metric and stops. A senior candidate names the offline metric, names a calibration check, names the planned A/B test with rough sample size, and names what they would do if offline improved but the A/B test was flat. The last part is the test — the interviewer wants to know whether you understand that offline metrics are advisory.

Interview prompts you should be ready for

"Offline NDCG@10 improved 3%; A/B test is flat. Diagnose." (Probes: retrieval ↔ ranker mismatch, position bias in labels, lift on items the old ranker never surfaced, novelty masking, offline metric not matching production scoring, re-rank overrides.)
"Walk me through CUPED. Why does it work?" (Probes: pre-experiment covariate fixed before treatment so estimator is unbiased; variance reduction proportional to ρ² between pre- and in-experiment metric.)
"How do you decide how long to run an A/B test?" (Probes: power calculation from baseline and σ, novelty burn-in, weekday cycles, multiple-tests adjustment.)
"Why is AUC sometimes misleading as a model-selection metric?" (Probes: invariance to monotone transforms → blind to calibration; insensitive to where in the ranking the improvement happens; matters for auctions.)
"New ranker has big A/B lift in week 1; week 4 flat. What's happening?" (Probes: novelty decay, instrumentation broke, traffic mix shifted, feature distribution drift, contaminated holdout.)
"Test 10 ranker variants — naive Bonferroni or something smarter?" (Probes: Bonferroni is fine for small numbers but conservative; FDR for screening; sequential/bandit for adaptive allocation; tournament against a champion.)
"You can't run an A/B test. How do you evaluate?" (Probes: off-policy IPS/doubly-robust, interleaving, shadow scoring, human raters, replay on held-out logs.)

Takeaway

Offline metrics filter. A/B tests decide. Each offline metric encodes a specific claim about what kind of mistake matters — AUC about rank, log-loss about calibration, NDCG about graded position, recall about retrieval. A/B tests have their own failure modes — underpower, novelty, network effects, multiple-testing — and the senior signal is naming the mitigations (CUPED, sequential tests, long-term holdouts) before the interviewer prompts for them. When asked about a launch, quote both numbers: the offline lift that earned the test, and the online lift that earned the ship.