Evaluation & class imbalance
Picking the metric is half the modelling job. Almost every "good offline, bad in production" complaint is a metric-choice mistake or an imbalance mistake — usually both. First-principles tour of the metrics, when each one lies, and the right answer to imbalance in 2026.
The confusion matrix — four numbers, everything else is a ratio
A binary classifier emits a score, you threshold it, you compare to the label. Four counts result; every classification metric below is a function of these four (or a curve over them across thresholds).
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | TP | FN |
| Actual negative | FP | TN |
The four derived rates that the rest of the lesson rests on:
| Name | Formula | Reading |
|---|---|---|
| Precision | TP / (TP + FP) | Of items the model flagged, how many were correct. The "if you alert, are you right?" rate. |
| Recall (sensitivity, TPR) | TP / (TP + FN) | Of all true positives in the world, how many did you catch. The "do you miss things?" rate. |
| Specificity (TNR) | TN / (TN + FP) | Of all true negatives, how many you correctly left alone. 1 − FPR. |
| F1 | 2PR / (P + R) | Harmonic mean of precision and recall. Drops fast when either is small — that's the whole point. |
| Fβ | (1 + β²) PR / (β² P + R) | Weights recall β² times as much as precision (so β=2 weights recall 4×). β = 2 for "missing positives is worse than false alarms" (medical screening). |
Why accuracy is the metric you almost never want
Accuracy is (TP + TN) / N. On a dataset 99% negative, "always predict negative" scores 99% and is useless. Accuracy is defensible only when classes are roughly balanced and FP/FN costs are symmetric. Fraud, spam, click prediction, disease screening, ad targeting all violate both. Reporting accuracy on an imbalanced problem without comment is a junior tell.
AUC — the rank-order metric
The ROC curve plots TPR (recall) vs FPR as the threshold varies. AUC is the area under it. Two key facts:
- Probabilistic interpretation. AUC = P(score(pos) > score(neg)) for a random pos/neg pair — the probability your model wins a random pairwise ranking.
- Rank-only. Depends on rank order alone, not absolute calibration. Multiply every score by 10, add 7 — same AUC.
The second property is a feature and a trap. Feature: AUC is invariant to class imbalance — same AUC at 50/50 or 1%. Trap: AUC 0.85 sounds great, but the operating point you'd deploy may have terrible precision or terrible recall. AUC averages over all thresholds, including useless ones.
PR-AUC — the imbalance-aware sibling
PR-AUC is the area under the precision-vs-recall curve. Unlike ROC-AUC, it is sensitive to the base rate: at 1% positives, a ROC-AUC of 0.85 can easily correspond to PR-AUC of 0.30 — same separability, but the precision side collapses because each FP is "loud" when positives are rare.
| Property | ROC-AUC | PR-AUC |
|---|---|---|
| Threshold-free | Yes | Yes |
| Invariant to class balance | Yes | No (base-rate dependent) |
| Reflects practical operating point on rare classes | Poorly | Well |
| Reflects useful negatives (TN) | Yes (via specificity) | No (ignored) |
| Random-baseline value | 0.5 | = positive base rate |
Rule of thumb: modest imbalance (10%+) → ROC-AUC is fine. Severe imbalance (fraud, rare-disease, cold-inventory ad clicks) → PR-AUC is the informative one.
Log-loss and Brier — the calibration-sensitive metrics
AUC and PR-AUC only care about rank. They don't notice if probabilities are systematically too high or low. Two metrics close that gap.
| Metric | Formula (per example) | Behaviour |
|---|---|---|
| Log-loss | −[y log p̂ + (1 − y) log(1 − p̂)] | Goes to +∞ for confident wrong predictions. The actual training loss for logistic regression and most classifiers. Heavily punishes overconfidence. |
| Brier | (p̂ − y)² | Bounded, smoother. Less punitive on confident-wrong than log-loss. Murphy-decomposes into reliability + resolution + uncertainty. |
Both are proper scoring rules: minimized in expectation when p̂ = P(y = 1 | x). AUC is not — it cares only about ranks — which is why two models with the same AUC can have very different log-loss.
Calibration — when probabilities are used downstream
Calibration asks "of all the times you predicted 0.7, did 70% happen?" Separate from ranking. Two diagnostics (also covered in search_ads_recsys / 05):
- Reliability diagram. Bin predictions by predicted probability; plot empirical rate vs predicted. Perfect = diagonal.
- ECE. Frequency-weighted average of |emp_bin − pred_bin|. Popular scalar; bin-count-sensitive.
Why this matters: probabilities feed downstream cost-sensitive decisions (τ* = cost_FP / (cost_FP + cost_FN)), ad auctions (bid × p̂), anomaly thresholds. A model with AUC 0.95 but predictions systematically 3× too high will lose every ad auction.
Regression metrics — the parallel cheat sheet
| Metric | Favours | Failure mode |
|---|---|---|
| MSE / RMSE mean((ŷ−y)²) | Mean-optimal. | Outlier-sensitive; one bad prediction dominates. |
| MAE mean(|ŷ−y|) | Median-optimal; robust to outliers. | Not differentiable at 0 (Huber fixes both). |
| MAPE mean(|ŷ−y|/|y|) | Scale-free percent error. | Blows up near y=0; asymmetric (over- cheaper than under-). |
| SMAPE | Symmetric-ish percent error. | Still ill-defined when both small; bounded in [0,2]. |
| RMSLE / log-MAE | Multiplicative errors (revenue, demand). | Undefined for negatives; under-prediction penalised more. |
| R² 1 − SS_res/SS_tot | Variance explained, bounded above by 1 (but not comparable across datasets with different Var(y)). | Unbounded below; inflates on trended data. |
| Adjusted R² | Penalises noise features (added params). | Inherits R²'s trend problem. |
Class imbalance — when it actually hurts
Pop intuition: "imbalance breaks everything, resample." Reality is more nuanced. Most properly chosen metrics (AUC, PR-AUC, F1, log-loss) are robust to imbalance. The real problems are upstream:
- Algorithm. Trees split on impurity → 99/1 means majority wins by default; unregularized logistic regression drifts toward the prior. GBDTs with proper loss + shrinkage are robust; naive Bayes is not.
- Sample size. The deeper issue is the absolute count of positives, not the ratio. 100 positives total → no resampling rescues you. 10⁵ positives at 0.1% → fine.
- Threshold. A well-trained probabilistic model on imbalanced data is often correct — the default 0.5 cut is what's wrong. Fix is threshold tuning, not retraining.
The four fixes — and which one to actually use
| Family | Mechanism | When it helps | When it hurts |
|---|---|---|---|
| Resample | Oversample minority (SMOTE — Chawla et al. 2002 — interpolates between minority neighbours); or undersample majority. | Algorithms that collapse on the minority class; quick-and-dirty. | Breaks calibration — probabilities reflect resampled prior. SMOTE invents points that may not lie on the true manifold. Fix: post-hoc calibration (Platt or isotonic) or analytic prior correction: p_true = (p_resampled · π) / (p_resampled · π + (1 − p_resampled) · (1 − π) · (π'/(1 − π'))), where π is the true positive prior and π' is the resampled prior. |
| Reweight loss | class_weight='balanced'; weights inversely proportional to frequency. Data unchanged, per-example penalty changes. |
Proper scoring rules — cleaner than resampling. | Doesn't rescue too-few-positives. Calibration still needs a check. |
| Threshold tune | Train on natural distribution; pick threshold to minimize cost_FP·FPR + cost_FN·FNR or hit a P/R target. | Almost always. No retrain, works with any calibrated model. | Can't recover information the model never learned. |
| Switch metric | Stop reporting accuracy. Use PR-AUC, F1, class-weighted log-loss. | Always — costs nothing. | — |
Cost-sensitive learning — the unifying frame
When FP and FN costs differ (missing a tumor ≫ a false alarm), inject the cost matrix in one of three places:
- Metric. Utility = −(cost_FP · FP + cost_FN · FN); pick the model and threshold that maximize it.
- Loss. Per-example weights = per-example cost.
- Threshold. Bayes-optimal cut on a calibrated probability: τ* = cost_FP / (cost_FP + cost_FN).
Threshold-tuning a calibrated proper-scoring-rule model is almost always equivalent to a cost-sensitive loss — and simpler. Senior take: proper scoring rule → calibrated probabilities → encode cost asymmetry at decision time.
Trade-offs at a glance
| Metric | Imbalance | Calibration | Threshold-free | Rank | Abs. p |
|---|---|---|---|---|---|
| Accuracy | No | No | No (0.5) | Indirect | No |
| Precision / Recall | Partial | No | No | Indirect | No |
| F1 | Better | No | No | Indirect | No |
| ROC-AUC | Invariant | No | Yes | Yes | No |
| PR-AUC | Yes (base-rate aware) | No | Yes | Yes | No |
| Log-loss | Yes (reweighted) | Yes | Yes | Yes | Yes |
| Brier | Yes | Yes | Yes | Yes | Yes |
| ECE | Yes | Yes (directly) | Yes | No | Yes |
Interactive · the metric explorer
Synthetic binary classification. Set class imbalance, separability (AUC-equivalent), and decision threshold. Watch every metric move; the diagnosis below tells you what story each tells.
Interview prompts you should be ready for
- "Why is AUC misleading for highly imbalanced classes?" (Invariant to base rate — same AUC at 50% or 1%. PR-AUC is the informative single number when positives are rare.)
- "Walk through PR-AUC vs ROC-AUC. Pick which when?" (ROC summarises ranking; PR summarises the precision side, which users feel on rare-positive problems. Modest imbalance → ROC. Severe → PR.)
- "Your boss wants 'just one number.' What do you report?" (Calibrated probabilities downstream → log-loss. Ranking → ROC-AUC. Rare-positive alerting → PR-AUC or F1 at the deployed threshold. Push back: pair the number with base-rate and threshold.)
- "99% accuracy on a 99%-negative dataset. What's wrong?" (The trivial "always negative" baseline already hits 99%. Need precision/recall on positives, PR-AUC, or expected cost.)
- "SMOTE vs class weights vs threshold tuning — when each?" (Threshold tuning first; free, no retrain. Class weights when the algorithm collapses to majority. SMOTE last — changes the data distribution, must re-calibrate.)
- "Fraud model AUC 0.95 but team says it misses real fraud. Diagnose." (AUC averages over all thresholds. At the deployed cut, recall on high-value sub-segments may be low; high AUC may come from easy negatives. Check PR-AUC, segment-stratified recall, operating-point P/R.)
- "Why is log-loss the training loss but rarely the headline?" (Unbounded, base-rate-dependent, hard to communicate across datasets. AUC/PR-AUC are easier to compare. Log-loss belongs in every report — just not the single bullet.)