Statistical testing & A/B testing for ML

"Ship it" is a statistical claim. The four formulas every PM-facing MLE memorises, the variance-reduction trick that doubles experiment throughput, and the pitfalls that ship bad models with confidence.

The four numbers — Type I, Type II, MDE, sample size

Setup: a metric Y with mean μ_A in control, μ_B in treatment, variance σ² in both. Null hypothesis H_0: μ_B = μ_A. You run the experiment, observe a difference δ̂ = ȳ_B − ȳ_A, and reject or accept H_0.

Quantity	Symbol	What it means	Typical value
Significance level / Type I rate	α	P(reject H_0 \| H_0 true) — false positive rate	0.05
Power / 1 − Type II rate	1 − β	P(reject H_0 \| true effect = MDE) — true positive rate at the assumed effect	0.8
Minimum Detectable Effect	MDE	Smallest effect the experiment can reliably detect at given α, β, N	"What ship-able lift means"
Sample size per arm	N	Number of units in each variant	"How long do we run"

The four are linked by one master equation. For a two-sided z-test on means with equal sample sizes:

MDE ≈ (z_{1-α/2} + z_{1-β}) · σ · √(2/N)

For standard α = 0.05, β = 0.2: z_{0.975} = 1.96, z_{0.8} = 0.84, so z_α + z_β ≈ 2.80:

MDE ≈ 2.8 · σ · √(2/N) ≈ 4 · σ / √N

Inverting: to detect an effect of size MDE, you need N ≈ 16 · σ² / MDE² per arm.

The interview soundbite

Halving the MDE you want to detect = 4× the sample size. Doubling the metric variance = 2× the sample size. Reducing variance is mathematically equivalent to running 2× longer. Hence CUPED.

Worked example — when do I ship?

You run an A/B test on click-through rate. CTR baseline is 5%, you want to detect a 0.5% lift (0.5pp absolute = 10% relative). What sample do you need?

For a Bernoulli with p ≈ 0.05, σ² = p(1−p) ≈ 0.0475. MDE = 0.005:

N = 16 · 0.0475 / (0.005)² = 16 · 0.0475 / 2.5e-5 = ~30,400 per arm

So ~30k per arm, ~60k total, plus a multiplier for ratio metrics, multiple variants, etc. If your daily traffic on a variant is 20k, that's ~1.5 days. Acceptable. If it's 200, that's 150 days. Don't bother.

CUPED — variance reduction for free

CUPED (Deng et al. 2013, "Controlled-experiment Using Pre-Experiment Data"). The idea: each user has a pre-experiment metric X (e.g. their CTR before the experiment). The experiment metric Y is correlated with X. Subtract the part of Y that's explained by X:

Y_CUPED = Y − θ · (X − E[X]) — θ = Cov(X, Y) / Var(X)

The variance of the new metric is:

Var(Y_CUPED) = Var(Y) · (1 − ρ²) — ρ = correlation(X, Y)

If ρ = 0.6, you've reduced variance by 64% → effective sample size is ~2.8× bigger → you can detect smaller effects, or run shorter experiments. CUPED is "free" because the pre-experiment data is already there.

The peeking problem — sequential testing

You set up an A/B test for 14 days. After 3 days, you check: p < 0.05. Can you ship?

No. Every time you check the experiment, you re-run the hypothesis test. Standard inference assumes you check once; checking every day inflates the Type I rate.

Daily checks	Effective α (one-day check at α=0.05)
1	5%
5	~14%
10	~19%
14 (every day for 2 weeks)	~22%

Mitigations:

Pre-register the experiment length. Decide N up front; only test once at the end. Boring but correct.
Bonferroni / α-spending. If you peek K times, use α/K per check. Conservative; throws away power.
Sequential probability ratio tests / mSPRT. Use a test that's valid at any stopping time. Used by Optimizely, Microsoft ExP. The right answer if you actually want to peek.
Always-valid p-values (e.g., e-processes; Howard et al. 2021). Modern theory; small efficiency cost vs. fixed-horizon tests but always-valid.

Multiple testing — the Bonferroni / FDR pair

You're testing 20 variants against control. Any one has α=5% false positive rate, so the chance that at least one looks significant is 1 − (1−0.05)^{20} ≈ 64%. You will ship a random variant.

Correction	Adjusted threshold	Controls
Bonferroni	α / K (where K = tests run)	Family-wise error rate (FWER)
Benjamini-Hochberg	Sort p-values, accept the largest p_i ≤ (i/K) · α	False Discovery Rate (FDR)
Holm-Bonferroni	Bonferroni with sequential pruning	FWER, slightly more powerful than Bonferroni

Choose based on the question. "I never want a false ship" → Bonferroni / FWER. "I'm okay with some false positives in a shortlist" → BH / FDR.

Ratio metrics and the delta method

Some metrics are ratios: CTR = clicks / impressions. The denominator varies per user; you can't just take the per-user CTR average. Two approaches:

Cluster by user. Treat each user's ratio as the unit; compute the variance of a per-user ratio.
Delta method. For a ratio of sums, the variance is approximately:
Var(ΣY/ΣX) ≈ (1/X̄²) · [Var(Y) − 2·(Ȳ/X̄)·Cov(X,Y) + (Ȳ/X̄)² · Var(X)]

The delta method is the standard derivation: linearise around X̄, Ȳ using a Taylor expansion. Pitfall: it's an approximation, breaks for highly skewed denominators (e.g. heavy-tailed impressions).

Heterogeneous treatment effects (HTE)

The overall lift is +0.3%. But:

New users: +2% (love it).
Power users: −1% (hate it).
Mobile: +0.5%.
Desktop: 0%.

HTE methods estimate τ(x) = E[Y | T=1, X=x] − E[Y | T=0, X=x] as a function of features x. Standard approaches:

Subgroup analysis: pre-register subgroups, test each (with multiple-testing correction).
Causal forests (Athey & Wager 2019): random forest variant that estimates τ(x).
Meta-learners (T-, S-, X-, R-learner; Künzel et al. 2019): regress outcomes by treatment status and take the difference.

The trap is finding HTE post-hoc and claiming significance — you've just multiple-tested across subgroups.

Counterfactual evaluation — when A/B is impossible

Sometimes you can't run an A/B: cold-start launches, irreversible changes, policy decisions. Use offline evaluation on logged data:

Inverse Propensity Scoring (IPS). Reweight each historical outcome by 1/π_old(action), then estimate the new policy's expected outcome. Unbiased; high variance.
Doubly robust: combine IPS with a learned reward model. Lower variance, but biased if the reward model is wrong.
Direct method: fit a reward model on logged data, evaluate the new policy with the model. Lower variance, biased.

(Detailed coverage in the search/ads track, 07_bias_debiasing.)

Interactive · sample size calculator

Common pitfalls — every one ships bad models

Pitfall	How it ships	Fix
Peeking	Check every day; ship when p < 0.05.	Pre-register N, or use sequential test.
SRM (sample ratio mismatch)	You expected 50/50 traffic split; you got 51/49. Means assignment is biased.	Chi-square test on N_A vs N_B; if SRM, the entire experiment is suspect.
Novelty / primacy effect	Treatment lift in week 1; drops to 0 in week 4. The novelty wore off.	Run for at least 2× the user revisit cycle; segment by time-on-treatment.
Spillover / network effects	Treating one user affects untreated users (Facebook News Feed, marketplace bidding).	Cluster randomisation (whole geographic region or social cluster).
Multiple comparisons	20 metrics, one is significant.	Bonferroni or BH correction. Pre-register a primary metric.
Ratio metric trap	Compute mean per user instead of Σ clicks / Σ impressions.	Delta method, or cluster bootstrapping.
Triggered analysis confusion	Some users never see the change. Computing lift over all users dilutes the signal.	Triggered analysis: restrict to users who saw the treatment-eligible state, in both arms.

Offline metric ≠ online lift

You have a new ranking model with +1% offline AUC. The A/B shows no lift (or worse). Why?

Distribution shift. Offline data is from the previous policy. The new model creates a feedback loop (popular items get clicked more, get ranked higher next time, get more clicks…).
Counterfactual data. Offline AUC measures ranking quality on items the old policy showed. The new policy shows different items — there's no offline data on those.
Metric mismatch. AUC is rank quality; the A/B measures clicks (or revenue, or sessions). They're related but not identical.
Long-term effects. Offline measures one-shot; A/B at week 2 includes return-user behaviour the offline metric ignores.

This is exactly the "offline ≠ online" gap covered in detail in search_ads_recsys/08_evaluation. The interview signal is naming the gap and proposing counterfactual evaluation (IPS / doubly-robust) as the partial mitigation.

The interview probes

"You want to detect a 1% relative lift on a 10% baseline. How many samples?" Absolute MDE = 0.001. σ² ≈ 0.09. N = 16 · 0.09 / 0.001² = 1.44M per arm. Big. Either tolerate a larger MDE or use CUPED.
"What's the cost of running 5 variants instead of 2?" Each variant needs N samples → 5N total (vs 2N for A/B). Plus multiple-testing correction shrinks the effective α → larger required N per arm. Total cost ~3× a simple A/B.
"Your experiment shows p=0.04 lift in week 1. Ship it?" Only if (a) the experiment was pre-registered to be analysed at this size; (b) no peeking has been done; (c) the metric is the pre-registered primary; (d) no novelty / primacy effects are likely. Otherwise, run to completion.
"What's the difference between FWER and FDR?" FWER controls P(at least one false discovery). FDR controls E[false discoveries / discoveries]. For exploratory analysis with many hypotheses, FDR is more powerful and the right choice. For ship-or-not decisions, FWER is safer.
"How do you A/B test a model that affects future model training (recommender feedback loop)?" The treatment data affects the world the control sees → spillover. Mitigations: temporal split (run for a fixed period and analyse the pre-feedback window), cluster randomisation (treat whole geographies), separate model training so each variant has its own loop.

Interview prompts you should be ready for

"Derive the sample size formula." (Two-sample z-test on means. Test statistic Z = (ȳ_B − ȳ_A) / (σ√(2/N)) ~ N(δ/(σ√(2/N)), 1). Reject if |Z| > z_{1-α/2}. Power = P(|Z| > z_{1-α/2} | δ = MDE). For one-sided, this is z_{1-α} = z_{1-β} + MDE/(σ√(2/N)). Rearranging: N = 2σ²(z_{1-α/2} + z_{1-β})²/MDE². For α=0.05, β=0.2: N ≈ 16 · σ² / MDE².)
"Your offline AUC went up by 0.5%, your A/B shows no lift. What's wrong?" (Several possibilities: offline data is from the old policy (covariate shift); AUC and your A/B metric are loosely connected; long-term effects not in offline; instrumentation bug. Diagnose: compute AUC on the A/B's logged data — if it matches offline, the model is fine and the metric mapping is broken; if it doesn't, the data distribution shifted.)
"What's CUPED in one sentence?" (Use a pre-experiment metric to subtract the user's baseline level from their experiment metric; the remainder has lower variance, equivalent to running a longer experiment.)
"Bayesian A/B vs frequentist?" (Bayesian: posterior P(B better than A | data). Avoids the peeking problem because posteriors update validly. Requires a prior, which is a choice. Frequentist: p-value, requires fixed horizon. In practice both work; Bayesian is more interpretable to stakeholders ("85% chance B is better"); frequentist matches the literature.)
"You see p = 0.03 in week 1 of a 14-day experiment. Action?" (Do not stop. (a) The Type I rate is inflated for sequential checks. (b) Novelty effects often disappear over time. (c) If pre-registered for 14 days, complete it. If you must stop early, do it for a valid reason like a clear safety issue.)
"How do you do A/B on a 0.1% population (rare event)?" (Need huge sample sizes; consider boosting power via: triggered analysis (only count users who could see the treatment), proxy metrics correlated with the rare event, variance reduction (CUPED), longer windows. If still infeasible, run a holdout vs treat-all comparison rather than a balanced A/B.)

Takeaway

Four numbers: α, β, MDE, N. They are linked by N ≈ 16σ²/MDE². CUPED reduces σ² by 1 − ρ² for free. Peeking inflates α; pre-register or use sequential tests. Multiple comparisons inflate α; correct with Bonferroni or BH. Ratio metrics need the delta method. Offline ≠ online lift; counterfactual evaluation is the partial fix. The interview signal is being able to size an experiment in 60 seconds and articulate which validity assumption is at risk.