Evaluation & class imbalance

Picking the metric is half the modelling job. Almost every "good offline, bad in production" complaint is a metric-choice mistake or an imbalance mistake — usually both. First-principles tour of the metrics, when each one lies, and the right answer to imbalance in 2026.

The confusion matrix — four numbers, everything else is a ratio

A binary classifier emits a score, you threshold it, you compare to the label. Four counts result; every classification metric below is a function of these four (or a curve over them across thresholds).

	Predicted positive	Predicted negative
Actual positive	TP	FN
Actual negative	FP	TN

The four derived rates that the rest of the lesson rests on:

Name	Formula	Reading
Precision	TP / (TP + FP)	Of items the model flagged, how many were correct. The "if you alert, are you right?" rate.
Recall (sensitivity, TPR)	TP / (TP + FN)	Of all true positives in the world, how many did you catch. The "do you miss things?" rate.
Specificity (TNR)	TN / (TN + FP)	Of all true negatives, how many you correctly left alone. 1 − FPR.
F1	2PR / (P + R)	Harmonic mean of precision and recall. Drops fast when either is small — that's the whole point.
F_β	(1 + β²) PR / (β² P + R)	Weights recall β² times as much as precision (so β=2 weights recall 4×). β = 2 for "missing positives is worse than false alarms" (medical screening).

The harmonic-mean trick

F1 uses the harmonic mean specifically so a 0 on either P or R produces a 0 F1. Arithmetic mean of P = 1.0, R = 0.0 is 0.5 — looks fine; harmonic mean is 0. That's what makes F1 immune to the trivial "predict-everything-positive" cheat.

Why accuracy is the metric you almost never want

Accuracy is (TP + TN) / N. On a dataset 99% negative, "always predict negative" scores 99% and is useless. Accuracy is defensible only when classes are roughly balanced and FP/FN costs are symmetric. Fraud, spam, click prediction, disease screening, ad targeting all violate both. Reporting accuracy on an imbalanced problem without comment is a junior tell.

AUC — the rank-order metric

The ROC curve plots TPR (recall) vs FPR as the threshold varies. AUC is the area under it. Two key facts:

Probabilistic interpretation. AUC = P(score(pos) > score(neg)) for a random pos/neg pair — the probability your model wins a random pairwise ranking.
Rank-only. Depends on rank order alone, not absolute calibration. Multiply every score by 10, add 7 — same AUC.

The second property is a feature and a trap. Feature: AUC is invariant to class imbalance — same AUC at 50/50 or 1%. Trap: AUC 0.85 sounds great, but the operating point you'd deploy may have terrible precision or terrible recall. AUC averages over all thresholds, including useless ones.

PR-AUC — the imbalance-aware sibling

PR-AUC is the area under the precision-vs-recall curve. Unlike ROC-AUC, it is sensitive to the base rate: at 1% positives, a ROC-AUC of 0.85 can easily correspond to PR-AUC of 0.30 — same separability, but the precision side collapses because each FP is "loud" when positives are rare.

Property	ROC-AUC	PR-AUC
Threshold-free	Yes	Yes
Invariant to class balance	Yes	No (base-rate dependent)
Reflects practical operating point on rare classes	Poorly	Well
Reflects useful negatives (TN)	Yes (via specificity)	No (ignored)
Random-baseline value	0.5	= positive base rate

Rule of thumb: modest imbalance (10%+) → ROC-AUC is fine. Severe imbalance (fraud, rare-disease, cold-inventory ad clicks) → PR-AUC is the informative one.

Log-loss and Brier — the calibration-sensitive metrics

AUC and PR-AUC only care about rank. They don't notice if probabilities are systematically too high or low. Two metrics close that gap.

Metric	Formula (per example)	Behaviour
Log-loss	−[y log p̂ + (1 − y) log(1 − p̂)]	Goes to +∞ for confident wrong predictions. The actual training loss for logistic regression and most classifiers. Heavily punishes overconfidence.
Brier	(p̂ − y)²	Bounded, smoother. Less punitive on confident-wrong than log-loss. Murphy-decomposes into reliability + resolution + uncertainty.

Both are proper scoring rules: minimized in expectation when p̂ = P(y = 1 | x). AUC is not — it cares only about ranks — which is why two models with the same AUC can have very different log-loss.

Calibration — when probabilities are used downstream

Calibration asks "of all the times you predicted 0.7, did 70% happen?" Separate from ranking. Two diagnostics (also covered in search_ads_recsys / 05):

Reliability diagram. Bin predictions by predicted probability; plot empirical rate vs predicted. Perfect = diagonal.
ECE. Frequency-weighted average of |emp_bin − pred_bin|. Popular scalar; bin-count-sensitive.

Why this matters: probabilities feed downstream cost-sensitive decisions (τ* = cost_FP / (cost_FP + cost_FN)), ad auctions (bid × p̂), anomaly thresholds. A model with AUC 0.95 but predictions systematically 3× too high will lose every ad auction.

Regression metrics — the parallel cheat sheet

Metric	Favours	Failure mode
MSE / RMSE mean((ŷ−y)²)	Mean-optimal.	Outlier-sensitive; one bad prediction dominates.
MAE mean(\|ŷ−y\|)	Median-optimal; robust to outliers.	Not differentiable at 0 (Huber fixes both).
MAPE mean(\|ŷ−y\|/\|y\|)	Scale-free percent error.	Blows up near y=0; asymmetric (over- cheaper than under-).
SMAPE	Symmetric-ish percent error.	Still ill-defined when both small; bounded in [0,2].
RMSLE / log-MAE	Multiplicative errors (revenue, demand).	Undefined for negatives; under-prediction penalised more.
R² 1 − SS_res/SS_tot	Variance explained, bounded above by 1 (but not comparable across datasets with different Var(y)).	Unbounded below; inflates on trended data.
Adjusted R²	Penalises noise features (added params).	Inherits R²'s trend problem.

R²-on-time-series trap

A "predict yesterday's value" model gets high R² on a trending series because SS_tot is dominated by the long-run trend, not by the day-to-day deviations you should be predicting. Report R² on residuals after detrending, use MAE/RMSE directly, or score against a naïve-persistence baseline.

Class imbalance — when it actually hurts

Pop intuition: "imbalance breaks everything, resample." Reality is more nuanced. Most properly chosen metrics (AUC, PR-AUC, F1, log-loss) are robust to imbalance. The real problems are upstream:

Algorithm. Trees split on impurity → 99/1 means majority wins by default; unregularized logistic regression drifts toward the prior. GBDTs with proper loss + shrinkage are robust; naive Bayes is not.
Sample size. The deeper issue is the absolute count of positives, not the ratio. 100 positives total → no resampling rescues you. 10⁵ positives at 0.1% → fine.
Threshold. A well-trained probabilistic model on imbalanced data is often correct — the default 0.5 cut is what's wrong. Fix is threshold tuning, not retraining.

The four fixes — and which one to actually use

Family	Mechanism	When it helps	When it hurts
Resample	Oversample minority (SMOTE — Chawla et al. 2002 — interpolates between minority neighbours); or undersample majority.	Algorithms that collapse on the minority class; quick-and-dirty.	Breaks calibration — probabilities reflect resampled prior. SMOTE invents points that may not lie on the true manifold. Fix: post-hoc calibration (Platt or isotonic) or analytic prior correction: p_true = (p_resampled · π) / (p_resampled · π + (1 − p_resampled) · (1 − π) · (π'/(1 − π'))), where π is the true positive prior and π' is the resampled prior.
Reweight loss	`class_weight='balanced'`; weights inversely proportional to frequency. Data unchanged, per-example penalty changes.	Proper scoring rules — cleaner than resampling.	Doesn't rescue too-few-positives. Calibration still needs a check.
Threshold tune	Train on natural distribution; pick threshold to minimize cost_FP·FPR + cost_FN·FNR or hit a P/R target.	Almost always. No retrain, works with any calibrated model.	Can't recover information the model never learned.
Switch metric	Stop reporting accuracy. Use PR-AUC, F1, class-weighted log-loss.	Always — costs nothing.	—

The 2026 default

Train on natural distribution with a proper scoring rule (log-loss). Evaluate with PR-AUC or class-weighted F1. Pick the operating threshold that minimizes expected cost on validation. Resampling is a fallback for when the model itself collapses on the minority — usually an algorithm choice (raw trees, unregularized linear) you should reconsider before reaching for SMOTE.

Cost-sensitive learning — the unifying frame

When FP and FN costs differ (missing a tumor ≫ a false alarm), inject the cost matrix in one of three places:

Metric. Utility = −(cost_FP · FP + cost_FN · FN); pick the model and threshold that maximize it.
Loss. Per-example weights = per-example cost.
Threshold. Bayes-optimal cut on a calibrated probability: τ* = cost_FP / (cost_FP + cost_FN).

Threshold-tuning a calibrated proper-scoring-rule model is almost always equivalent to a cost-sensitive loss — and simpler. Senior take: proper scoring rule → calibrated probabilities → encode cost asymmetry at decision time.

Trade-offs at a glance

Metric	Imbalance	Calibration	Threshold-free	Rank	Abs. p
Accuracy	No	No	No (0.5)	Indirect	No
Precision / Recall	Partial	No	No	Indirect	No
F1	Better	No	No	Indirect	No
ROC-AUC	Invariant	No	Yes	Yes	No
PR-AUC	Yes (base-rate aware)	No	Yes	Yes	No
Log-loss	Yes (reweighted)	Yes	Yes	Yes	Yes
Brier	Yes	Yes	Yes	Yes	Yes
ECE	Yes	Yes (directly)	Yes	No	Yes

Interactive · the metric explorer

Synthetic binary classification. Set class imbalance, separability (AUC-equivalent), and decision threshold. Watch every metric move; the diagnosis below tells you what story each tells.

Interview prompts you should be ready for

"Why is AUC misleading for highly imbalanced classes?" (Invariant to base rate — same AUC at 50% or 1%. PR-AUC is the informative single number when positives are rare.)
"Walk through PR-AUC vs ROC-AUC. Pick which when?" (ROC summarises ranking; PR summarises the precision side, which users feel on rare-positive problems. Modest imbalance → ROC. Severe → PR.)
"Your boss wants 'just one number.' What do you report?" (Calibrated probabilities downstream → log-loss. Ranking → ROC-AUC. Rare-positive alerting → PR-AUC or F1 at the deployed threshold. Push back: pair the number with base-rate and threshold.)
"99% accuracy on a 99%-negative dataset. What's wrong?" (The trivial "always negative" baseline already hits 99%. Need precision/recall on positives, PR-AUC, or expected cost.)
"SMOTE vs class weights vs threshold tuning — when each?" (Threshold tuning first; free, no retrain. Class weights when the algorithm collapses to majority. SMOTE last — changes the data distribution, must re-calibrate.)
"Fraud model AUC 0.95 but team says it misses real fraud. Diagnose." (AUC averages over all thresholds. At the deployed cut, recall on high-value sub-segments may be low; high AUC may come from easy negatives. Check PR-AUC, segment-stratified recall, operating-point P/R.)
"Why is log-loss the training loss but rarely the headline?" (Unbounded, base-rate-dependent, hard to communicate across datasets. AUC/PR-AUC are easier to compare. Log-loss belongs in every report — just not the single bullet.)

Takeaway

Pick the metric that matches the decision your model feeds. Accuracy almost never qualifies. ROC-AUC for ranking on balanced-ish data; PR-AUC for rare positives; log-loss/Brier when probabilities are used downstream; calibration as a separate axis. For imbalance: natural distribution + proper scoring rule + base-rate-aware metric + threshold-tune for cost. SMOTE is a fallback, not a default.