traditional_ml / 11 · evaluation & class imbalance lesson 11 / 12

Evaluation & class imbalance

Picking the metric is half the modelling job. Almost every "good offline, bad in production" complaint is a metric-choice mistake or an imbalance mistake — usually both. First-principles tour of the metrics, when each one lies, and the right answer to imbalance in 2026.

The confusion matrix — four numbers, everything else is a ratio

A binary classifier emits a score, you threshold it, you compare to the label. Four counts result; every classification metric below is a function of these four (or a curve over them across thresholds).

Predicted positivePredicted negative
Actual positiveTPFN
Actual negativeFPTN

The four derived rates that the rest of the lesson rests on:

NameFormulaReading
PrecisionTP / (TP + FP)Of items the model flagged, how many were correct. The "if you alert, are you right?" rate.
Recall (sensitivity, TPR)TP / (TP + FN)Of all true positives in the world, how many did you catch. The "do you miss things?" rate.
Specificity (TNR)TN / (TN + FP)Of all true negatives, how many you correctly left alone. 1 − FPR.
F12PR / (P + R)Harmonic mean of precision and recall. Drops fast when either is small — that's the whole point.
Fβ(1 + β²) PR / (β² P + R)Weights recall β² times as much as precision (so β=2 weights recall 4×). β = 2 for "missing positives is worse than false alarms" (medical screening).
The harmonic-mean trick
F1 uses the harmonic mean specifically so a 0 on either P or R produces a 0 F1. Arithmetic mean of P = 1.0, R = 0.0 is 0.5 — looks fine; harmonic mean is 0. That's what makes F1 immune to the trivial "predict-everything-positive" cheat.

Why accuracy is the metric you almost never want

Accuracy is (TP + TN) / N. On a dataset 99% negative, "always predict negative" scores 99% and is useless. Accuracy is defensible only when classes are roughly balanced and FP/FN costs are symmetric. Fraud, spam, click prediction, disease screening, ad targeting all violate both. Reporting accuracy on an imbalanced problem without comment is a junior tell.

AUC — the rank-order metric

The ROC curve plots TPR (recall) vs FPR as the threshold varies. AUC is the area under it. Two key facts:

The second property is a feature and a trap. Feature: AUC is invariant to class imbalance — same AUC at 50/50 or 1%. Trap: AUC 0.85 sounds great, but the operating point you'd deploy may have terrible precision or terrible recall. AUC averages over all thresholds, including useless ones.

PR-AUC — the imbalance-aware sibling

PR-AUC is the area under the precision-vs-recall curve. Unlike ROC-AUC, it is sensitive to the base rate: at 1% positives, a ROC-AUC of 0.85 can easily correspond to PR-AUC of 0.30 — same separability, but the precision side collapses because each FP is "loud" when positives are rare.

PropertyROC-AUCPR-AUC
Threshold-freeYesYes
Invariant to class balanceYesNo (base-rate dependent)
Reflects practical operating point on rare classesPoorlyWell
Reflects useful negatives (TN)Yes (via specificity)No (ignored)
Random-baseline value0.5= positive base rate

Rule of thumb: modest imbalance (10%+) → ROC-AUC is fine. Severe imbalance (fraud, rare-disease, cold-inventory ad clicks) → PR-AUC is the informative one.

Same model, different class balances. ROC-AUC barely moves; PR-AUC collapses. Identical separability across all six panels. Only the positive base rate changes. ROC curves PR curves 50 / 50 balanced FPR TPR ROC-AUC = 0.85 10 / 90 moderate FPR TPR ROC-AUC = 0.84 1 / 99 severe FPR TPR ROC-AUC = 0.85 50 / 50 balanced Recall Precision base = 0.5 PR-AUC = 0.87 10 / 90 moderate Recall Precision base = 0.10 PR-AUC = 0.52 1 / 99 severe Recall Precision base = 0.01 PR-AUC = 0.18 ROC-AUC: 0.85 → 0.84 → 0.85 vs PR-AUC: 0.87 → 0.52 → 0.18

Log-loss and Brier — the calibration-sensitive metrics

AUC and PR-AUC only care about rank. They don't notice if probabilities are systematically too high or low. Two metrics close that gap.

MetricFormula (per example)Behaviour
Log-loss −[y log p̂ + (1 − y) log(1 − p̂)] Goes to +∞ for confident wrong predictions. The actual training loss for logistic regression and most classifiers. Heavily punishes overconfidence.
Brier (p̂ − y)² Bounded, smoother. Less punitive on confident-wrong than log-loss. Murphy-decomposes into reliability + resolution + uncertainty.

Both are proper scoring rules: minimized in expectation when p̂ = P(y = 1 | x). AUC is not — it cares only about ranks — which is why two models with the same AUC can have very different log-loss.

Calibration — when probabilities are used downstream

Calibration asks "of all the times you predicted 0.7, did 70% happen?" Separate from ranking. Two diagnostics (also covered in search_ads_recsys / 05):

Why this matters: probabilities feed downstream cost-sensitive decisions (τ* = cost_FP / (cost_FP + cost_FN)), ad auctions (bid × p̂), anomaly thresholds. A model with AUC 0.95 but predictions systematically 3× too high will lose every ad auction.

Regression metrics — the parallel cheat sheet

MetricFavoursFailure mode
MSE / RMSE mean((ŷ−y)²)Mean-optimal.Outlier-sensitive; one bad prediction dominates.
MAE mean(|ŷ−y|)Median-optimal; robust to outliers.Not differentiable at 0 (Huber fixes both).
MAPE mean(|ŷ−y|/|y|)Scale-free percent error.Blows up near y=0; asymmetric (over- cheaper than under-).
SMAPESymmetric-ish percent error.Still ill-defined when both small; bounded in [0,2].
RMSLE / log-MAEMultiplicative errors (revenue, demand).Undefined for negatives; under-prediction penalised more.
1 − SS_res/SS_totVariance explained, bounded above by 1 (but not comparable across datasets with different Var(y)).Unbounded below; inflates on trended data.
Adjusted R²Penalises noise features (added params).Inherits R²'s trend problem.
R²-on-time-series trap
A "predict yesterday's value" model gets high R² on a trending series because SS_tot is dominated by the long-run trend, not by the day-to-day deviations you should be predicting. Report R² on residuals after detrending, use MAE/RMSE directly, or score against a naïve-persistence baseline.

Class imbalance — when it actually hurts

Pop intuition: "imbalance breaks everything, resample." Reality is more nuanced. Most properly chosen metrics (AUC, PR-AUC, F1, log-loss) are robust to imbalance. The real problems are upstream:

  1. Algorithm. Trees split on impurity → 99/1 means majority wins by default; unregularized logistic regression drifts toward the prior. GBDTs with proper loss + shrinkage are robust; naive Bayes is not.
  2. Sample size. The deeper issue is the absolute count of positives, not the ratio. 100 positives total → no resampling rescues you. 10⁵ positives at 0.1% → fine.
  3. Threshold. A well-trained probabilistic model on imbalanced data is often correct — the default 0.5 cut is what's wrong. Fix is threshold tuning, not retraining.

The four fixes — and which one to actually use

FamilyMechanismWhen it helpsWhen it hurts
Resample Oversample minority (SMOTE — Chawla et al. 2002 — interpolates between minority neighbours); or undersample majority. Algorithms that collapse on the minority class; quick-and-dirty. Breaks calibration — probabilities reflect resampled prior. SMOTE invents points that may not lie on the true manifold. Fix: post-hoc calibration (Platt or isotonic) or analytic prior correction: p_true = (p_resampled · π) / (p_resampled · π + (1 − p_resampled) · (1 − π) · (π'/(1 − π'))), where π is the true positive prior and π' is the resampled prior.
Reweight loss class_weight='balanced'; weights inversely proportional to frequency. Data unchanged, per-example penalty changes. Proper scoring rules — cleaner than resampling. Doesn't rescue too-few-positives. Calibration still needs a check.
Threshold tune Train on natural distribution; pick threshold to minimize cost_FP·FPR + cost_FN·FNR or hit a P/R target. Almost always. No retrain, works with any calibrated model. Can't recover information the model never learned.
Switch metric Stop reporting accuracy. Use PR-AUC, F1, class-weighted log-loss. Always — costs nothing.
The 2026 default
Train on natural distribution with a proper scoring rule (log-loss). Evaluate with PR-AUC or class-weighted F1. Pick the operating threshold that minimizes expected cost on validation. Resampling is a fallback for when the model itself collapses on the minority — usually an algorithm choice (raw trees, unregularized linear) you should reconsider before reaching for SMOTE.

Cost-sensitive learning — the unifying frame

When FP and FN costs differ (missing a tumor ≫ a false alarm), inject the cost matrix in one of three places:

Threshold-tuning a calibrated proper-scoring-rule model is almost always equivalent to a cost-sensitive loss — and simpler. Senior take: proper scoring rule → calibrated probabilities → encode cost asymmetry at decision time.

Trade-offs at a glance

MetricImbalanceCalibrationThreshold-freeRankAbs. p
AccuracyNoNoNo (0.5)IndirectNo
Precision / RecallPartialNoNoIndirectNo
F1BetterNoNoIndirectNo
ROC-AUCInvariantNoYesYesNo
PR-AUCYes (base-rate aware)NoYesYesNo
Log-lossYes (reweighted)YesYesYesYes
BrierYesYesYesYesYes
ECEYesYes (directly)YesNoYes

Interactive · the metric explorer

Synthetic binary classification. Set class imbalance, separability (AUC-equivalent), and decision threshold. Watch every metric move; the diagnosis below tells you what story each tells.

Metric explorer
Try: 1% positives, AUC 0.85, threshold 0.5 — note the precision at the default cut. Slide threshold to 0.8 — same AUC, very different precision/recall trade. The headline "0.85" hides both operating points.
accuracy
precision
recall
F1
ROC-AUC
PR-AUC
log-loss
Brier
Reading

Interview prompts you should be ready for

  1. "Why is AUC misleading for highly imbalanced classes?" (Invariant to base rate — same AUC at 50% or 1%. PR-AUC is the informative single number when positives are rare.)
  2. "Walk through PR-AUC vs ROC-AUC. Pick which when?" (ROC summarises ranking; PR summarises the precision side, which users feel on rare-positive problems. Modest imbalance → ROC. Severe → PR.)
  3. "Your boss wants 'just one number.' What do you report?" (Calibrated probabilities downstream → log-loss. Ranking → ROC-AUC. Rare-positive alerting → PR-AUC or F1 at the deployed threshold. Push back: pair the number with base-rate and threshold.)
  4. "99% accuracy on a 99%-negative dataset. What's wrong?" (The trivial "always negative" baseline already hits 99%. Need precision/recall on positives, PR-AUC, or expected cost.)
  5. "SMOTE vs class weights vs threshold tuning — when each?" (Threshold tuning first; free, no retrain. Class weights when the algorithm collapses to majority. SMOTE last — changes the data distribution, must re-calibrate.)
  6. "Fraud model AUC 0.95 but team says it misses real fraud. Diagnose." (AUC averages over all thresholds. At the deployed cut, recall on high-value sub-segments may be low; high AUC may come from easy negatives. Check PR-AUC, segment-stratified recall, operating-point P/R.)
  7. "Why is log-loss the training loss but rarely the headline?" (Unbounded, base-rate-dependent, hard to communicate across datasets. AUC/PR-AUC are easier to compare. Log-loss belongs in every report — just not the single bullet.)
Takeaway
Pick the metric that matches the decision your model feeds. Accuracy almost never qualifies. ROC-AUC for ranking on balanced-ish data; PR-AUC for rare positives; log-loss/Brier when probabilities are used downstream; calibration as a separate axis. For imbalance: natural distribution + proper scoring rule + base-rate-aware metric + threshold-tune for cost. SMOTE is a fallback, not a default.