Statistical testing & A/B testing for ML
"Ship it" is a statistical claim. The four formulas every PM-facing MLE memorises, the variance-reduction trick that doubles experiment throughput, and the pitfalls that ship bad models with confidence.
The four numbers — Type I, Type II, MDE, sample size
Setup: a metric Y with mean μ_A in control, μ_B in treatment, variance σ² in both. Null hypothesis H_0: μ_B = μ_A. You run the experiment, observe a difference δ̂ = ȳ_B − ȳ_A, and reject or accept H_0.
| Quantity | Symbol | What it means | Typical value |
|---|---|---|---|
| Significance level / Type I rate | α | P(reject H_0 | H_0 true) — false positive rate | 0.05 |
| Power / 1 − Type II rate | 1 − β | P(reject H_0 | true effect = MDE) — true positive rate at the assumed effect | 0.8 |
| Minimum Detectable Effect | MDE | Smallest effect the experiment can reliably detect at given α, β, N | "What ship-able lift means" |
| Sample size per arm | N | Number of units in each variant | "How long do we run" |
The four are linked by one master equation. For a two-sided z-test on means with equal sample sizes:
For standard α = 0.05, β = 0.2: z_{0.975} = 1.96, z_{0.8} = 0.84, so z_α + z_β ≈ 2.80:
Inverting: to detect an effect of size MDE, you need N ≈ 16 · σ² / MDE² per arm.
Worked example — when do I ship?
You run an A/B test on click-through rate. CTR baseline is 5%, you want to detect a 0.5% lift (0.5pp absolute = 10% relative). What sample do you need?
For a Bernoulli with p ≈ 0.05, σ² = p(1−p) ≈ 0.0475. MDE = 0.005:
So ~30k per arm, ~60k total, plus a multiplier for ratio metrics, multiple variants, etc. If your daily traffic on a variant is 20k, that's ~1.5 days. Acceptable. If it's 200, that's 150 days. Don't bother.
CUPED — variance reduction for free
CUPED (Deng et al. 2013, "Controlled-experiment Using Pre-Experiment Data"). The idea: each user has a pre-experiment metric X (e.g. their CTR before the experiment). The experiment metric Y is correlated with X. Subtract the part of Y that's explained by X:
The variance of the new metric is:
If ρ = 0.6, you've reduced variance by 64% → effective sample size is ~2.8× bigger → you can detect smaller effects, or run shorter experiments. CUPED is "free" because the pre-experiment data is already there.
The peeking problem — sequential testing
You set up an A/B test for 14 days. After 3 days, you check: p < 0.05. Can you ship?
No. Every time you check the experiment, you re-run the hypothesis test. Standard inference assumes you check once; checking every day inflates the Type I rate.
| Daily checks | Effective α (one-day check at α=0.05) |
|---|---|
| 1 | 5% |
| 5 | ~14% |
| 10 | ~19% |
| 14 (every day for 2 weeks) | ~22% |
Mitigations:
- Pre-register the experiment length. Decide N up front; only test once at the end. Boring but correct.
- Bonferroni / α-spending. If you peek K times, use α/K per check. Conservative; throws away power.
- Sequential probability ratio tests / mSPRT. Use a test that's valid at any stopping time. Used by Optimizely, Microsoft ExP. The right answer if you actually want to peek.
- Always-valid p-values (e.g., e-processes; Howard et al. 2021). Modern theory; small efficiency cost vs. fixed-horizon tests but always-valid.
Multiple testing — the Bonferroni / FDR pair
You're testing 20 variants against control. Any one has α=5% false positive rate, so the chance that at least one looks significant is 1 − (1−0.05)^{20} ≈ 64%. You will ship a random variant.
| Correction | Adjusted threshold | Controls |
|---|---|---|
| Bonferroni | α / K (where K = tests run) | Family-wise error rate (FWER) |
| Benjamini-Hochberg | Sort p-values, accept the largest p_i ≤ (i/K) · α | False Discovery Rate (FDR) |
| Holm-Bonferroni | Bonferroni with sequential pruning | FWER, slightly more powerful than Bonferroni |
Choose based on the question. "I never want a false ship" → Bonferroni / FWER. "I'm okay with some false positives in a shortlist" → BH / FDR.
Ratio metrics and the delta method
Some metrics are ratios: CTR = clicks / impressions. The denominator varies per user; you can't just take the per-user CTR average. Two approaches:
- Cluster by user. Treat each user's ratio as the unit; compute the variance of a per-user ratio.
- Delta method. For a ratio of sums, the variance is approximately:
Var(ΣY/ΣX) ≈ (1/X̄²) · [Var(Y) − 2·(Ȳ/X̄)·Cov(X,Y) + (Ȳ/X̄)² · Var(X)]
The delta method is the standard derivation: linearise around X̄, Ȳ using a Taylor expansion. Pitfall: it's an approximation, breaks for highly skewed denominators (e.g. heavy-tailed impressions).
Heterogeneous treatment effects (HTE)
The overall lift is +0.3%. But:
- New users: +2% (love it).
- Power users: −1% (hate it).
- Mobile: +0.5%.
- Desktop: 0%.
HTE methods estimate τ(x) = E[Y | T=1, X=x] − E[Y | T=0, X=x] as a function of features x. Standard approaches:
- Subgroup analysis: pre-register subgroups, test each (with multiple-testing correction).
- Causal forests (Athey & Wager 2019): random forest variant that estimates τ(x).
- Meta-learners (T-, S-, X-, R-learner; Künzel et al. 2019): regress outcomes by treatment status and take the difference.
The trap is finding HTE post-hoc and claiming significance — you've just multiple-tested across subgroups.
Counterfactual evaluation — when A/B is impossible
Sometimes you can't run an A/B: cold-start launches, irreversible changes, policy decisions. Use offline evaluation on logged data:
- Inverse Propensity Scoring (IPS). Reweight each historical outcome by 1/π_old(action), then estimate the new policy's expected outcome. Unbiased; high variance.
- Doubly robust: combine IPS with a learned reward model. Lower variance, but biased if the reward model is wrong.
- Direct method: fit a reward model on logged data, evaluate the new policy with the model. Lower variance, biased.
(Detailed coverage in the search/ads track, 07_bias_debiasing.)
Interactive · sample size calculator
Common pitfalls — every one ships bad models
| Pitfall | How it ships | Fix |
|---|---|---|
| Peeking | Check every day; ship when p < 0.05. | Pre-register N, or use sequential test. |
| SRM (sample ratio mismatch) | You expected 50/50 traffic split; you got 51/49. Means assignment is biased. | Chi-square test on N_A vs N_B; if SRM, the entire experiment is suspect. |
| Novelty / primacy effect | Treatment lift in week 1; drops to 0 in week 4. The novelty wore off. | Run for at least 2× the user revisit cycle; segment by time-on-treatment. |
| Spillover / network effects | Treating one user affects untreated users (Facebook News Feed, marketplace bidding). | Cluster randomisation (whole geographic region or social cluster). |
| Multiple comparisons | 20 metrics, one is significant. | Bonferroni or BH correction. Pre-register a primary metric. |
| Ratio metric trap | Compute mean per user instead of Σ clicks / Σ impressions. | Delta method, or cluster bootstrapping. |
| Triggered analysis confusion | Some users never see the change. Computing lift over all users dilutes the signal. | Triggered analysis: restrict to users who saw the treatment-eligible state, in both arms. |
Offline metric ≠ online lift
You have a new ranking model with +1% offline AUC. The A/B shows no lift (or worse). Why?
- Distribution shift. Offline data is from the previous policy. The new model creates a feedback loop (popular items get clicked more, get ranked higher next time, get more clicks…).
- Counterfactual data. Offline AUC measures ranking quality on items the old policy showed. The new policy shows different items — there's no offline data on those.
- Metric mismatch. AUC is rank quality; the A/B measures clicks (or revenue, or sessions). They're related but not identical.
- Long-term effects. Offline measures one-shot; A/B at week 2 includes return-user behaviour the offline metric ignores.
This is exactly the "offline ≠ online" gap covered in detail in search_ads_recsys/08_evaluation. The interview signal is naming the gap and proposing counterfactual evaluation (IPS / doubly-robust) as the partial mitigation.
The interview probes
- "You want to detect a 1% relative lift on a 10% baseline. How many samples?" Absolute MDE = 0.001. σ² ≈ 0.09. N = 16 · 0.09 / 0.001² = 1.44M per arm. Big. Either tolerate a larger MDE or use CUPED.
- "What's the cost of running 5 variants instead of 2?" Each variant needs N samples → 5N total (vs 2N for A/B). Plus multiple-testing correction shrinks the effective α → larger required N per arm. Total cost ~3× a simple A/B.
- "Your experiment shows p=0.04 lift in week 1. Ship it?" Only if (a) the experiment was pre-registered to be analysed at this size; (b) no peeking has been done; (c) the metric is the pre-registered primary; (d) no novelty / primacy effects are likely. Otherwise, run to completion.
- "What's the difference between FWER and FDR?" FWER controls P(at least one false discovery). FDR controls E[false discoveries / discoveries]. For exploratory analysis with many hypotheses, FDR is more powerful and the right choice. For ship-or-not decisions, FWER is safer.
- "How do you A/B test a model that affects future model training (recommender feedback loop)?" The treatment data affects the world the control sees → spillover. Mitigations: temporal split (run for a fixed period and analyse the pre-feedback window), cluster randomisation (treat whole geographies), separate model training so each variant has its own loop.
Interview prompts you should be ready for
- "Derive the sample size formula." (Two-sample z-test on means. Test statistic Z = (ȳ_B − ȳ_A) / (σ√(2/N)) ~ N(δ/(σ√(2/N)), 1). Reject if |Z| > z_{1-α/2}. Power = P(|Z| > z_{1-α/2} | δ = MDE). For one-sided, this is z_{1-α} = z_{1-β} + MDE/(σ√(2/N)). Rearranging: N = 2σ²(z_{1-α/2} + z_{1-β})²/MDE². For α=0.05, β=0.2: N ≈ 16 · σ² / MDE².)
- "Your offline AUC went up by 0.5%, your A/B shows no lift. What's wrong?" (Several possibilities: offline data is from the old policy (covariate shift); AUC and your A/B metric are loosely connected; long-term effects not in offline; instrumentation bug. Diagnose: compute AUC on the A/B's logged data — if it matches offline, the model is fine and the metric mapping is broken; if it doesn't, the data distribution shifted.)
- "What's CUPED in one sentence?" (Use a pre-experiment metric to subtract the user's baseline level from their experiment metric; the remainder has lower variance, equivalent to running a longer experiment.)
- "Bayesian A/B vs frequentist?" (Bayesian: posterior P(B better than A | data). Avoids the peeking problem because posteriors update validly. Requires a prior, which is a choice. Frequentist: p-value, requires fixed horizon. In practice both work; Bayesian is more interpretable to stakeholders ("85% chance B is better"); frequentist matches the literature.)
- "You see p = 0.03 in week 1 of a 14-day experiment. Action?" (Do not stop. (a) The Type I rate is inflated for sequential checks. (b) Novelty effects often disappear over time. (c) If pre-registered for 14 days, complete it. If you must stop early, do it for a valid reason like a clear safety issue.)
- "How do you do A/B on a 0.1% population (rare event)?" (Need huge sample sizes; consider boosting power via: triggered analysis (only count users who could see the treatment), proxy metrics correlated with the rare event, variance reduction (CUPED), longer windows. If still infeasible, run a holdout vs treat-all comparison rather than a balanced A/B.)