Calibration & uncertainty
A model predicts "80% confident". If those events happen 80% of the time, it's calibrated; if they happen 60% of the time, it's overconfident. Calibration is orthogonal to accuracy — and once you start using model probabilities (auctions, selective prediction, RAG retrieval), it's the metric that matters.
Confidence ≠ correctness
An accurate classifier gets the right answer. A calibrated classifier's predicted probabilities match the empirical frequencies. They're independent:
| Model | Accuracy | Calibration | Usable for |
|---|---|---|---|
| "Always 99% confident, randomly right" | 50% | terrible (overconfident) | Nothing |
| "Always 50% confident, randomly right" | 50% | perfect | Trivially right; uninformative |
| Deep net, post-softmax | 90% | overconfident (ECE ~0.1) | Argmax decisions only — not thresholding |
| Logistic regression on log-features | 85% | well-calibrated | Auctions, gating, abstention |
For tasks where you only need the argmax (image classification top-1), miscalibration is fine. For tasks that consume the probability — pricing, abstention, retrieval scoring, A/B test analysis, RL value estimates — miscalibration is a silent bug.
Reliability diagrams — the picture
Bin predictions by predicted probability (e.g. 10 bins of width 0.1). For each bin, plot:
- x-axis: average predicted probability in the bin.
- y-axis: empirical accuracy in the bin (fraction that were actually positive).
A perfectly calibrated model lies on the y=x line. Bins above the line = under-confident (the model said 60% but was right 80%). Bins below = over-confident (the model said 90% but was right 60%).
ECE — Expected Calibration Error
The headline calibration metric. Take the weighted average of (predicted − actual) across bins:
where B_b is the set of predictions in bin b, acc(B_b) is the empirical accuracy in the bin, conf(B_b) is the mean predicted probability in the bin. ECE = 0 is perfect; ECE = 0.5 is degenerate. Typical deep nets pre-calibration: ECE = 0.05–0.15. Post-calibration: ECE < 0.02.
Brier score and the proper-scoring-rule view
The Brier score for binary classification is just MSE on probabilities:
It decomposes neatly:
So a low Brier score requires both calibration and sharpness (informative — not just always predicting 0.5). It's a proper scoring rule: it's minimised in expectation by reporting the true posterior. So is log-loss (cross-entropy): -y log p - (1-y) log(1-p).
Reporting only ECE can hide a model that's calibrated but uninformative. Brier or log-loss penalises both.
Why deep nets are overconfident
Three contributing causes, all empirical:
- Training to cross-entropy on hard labels. The CE loss is minimised when the model puts all mass on the true label. There's no penalty for being too confident — only for being wrong. So a model that's right 90% of the time and predicts 99.9% gets near-zero loss. The implicit regularisation against overconfidence (capacity, finite data) wears off as models scale up.
- Capacity to memorise. A high-capacity model can drive its training loss to near-zero by becoming arbitrarily confident on training points. It then carries that confidence to test points it shouldn't be so sure about.
- The softmax temperature is fixed at 1. The logits' magnitude is determined by training; nothing constrains them to produce calibrated probabilities. Modern architectures have high logit norms → sharp softmaxes → overconfidence.
Empirically, modern LLMs (GPT-4, LLaMA) are roughly calibrated on multiple-choice tasks after pretraining — they predict 75% on questions they get right 75% of the time. Then RLHF breaks calibration.
RLHF breaks calibration
From the RL track: RLHF trains the policy to maximise human preference. This optimisation pressure pushes the model to be confident and assertive — humans prefer answers that sound certain. Post-RLHF, the model:
- Hedges less even when it should ("I'm not sure" disappears).
- Becomes more confident in wrong answers.
- ECE can rise dramatically — the GPT-4 system card reports MMLU ECE going from ~0.007 (base) to ~0.07 (post-RLHF), roughly 10×.
The Anthropic / OpenAI mitigation is to explicitly train for calibration as a sub-objective (e.g., reward models that prefer hedging when uncertain), or post-hoc recalibrate. Hard problem; not fully solved.
Three calibration techniques
| Method | Train data | Form | When to use |
|---|---|---|---|
| Platt scaling | Validation set | Fit a logistic regression on (logit, label): p' = σ(a · logit + b) | Binary classification, simple monotone correction |
| Isotonic regression | Validation set | Fit a non-parametric monotone function from logits to calibrated probabilities | More flexible than Platt; needs more validation data |
| Temperature scaling (Guo et al. 2017) | Validation set | p' = softmax(logits / T); fit single scalar T > 1 to minimise NLL | Multi-class deep nets — simplest, almost always works, doesn't change argmax |
Uncertainty beyond calibration
Calibration says "if you say 80%, you're right 80% of the time". It does not distinguish:
- Aleatoric uncertainty — irreducible randomness in the data (the coin is genuinely 50/50, no model can do better).
- Epistemic uncertainty — reducible uncertainty from limited data or model capacity (with more training data, this shrinks).
Most uncertainty estimation techniques try to separate these. For interview purposes:
| Method | Captures | Cost |
|---|---|---|
| Softmax entropy | Total uncertainty (aleatoric + epistemic mixed) | Free |
| Deep ensembles | Both; epistemic is variance across ensemble members | N× train + N× inference |
| MC dropout | Approximation to Bayesian uncertainty | K× inference (K stochastic forward passes) |
| Last-layer Laplace | Bayesian over last layer only | Cheap (closed-form Gaussian on the last layer) |
| Conformal prediction | Distribution-free coverage guarantees | One calibration set; one extra threshold |
For LLMs specifically, "sampling temperature 0.7 and counting answer agreement" is a cheap empirical uncertainty estimate that correlates with correctness probability — used by self-consistency methods (Wang et al. 2022).
Selective prediction — abstain when unsure
If the model can output "I don't know", we trade coverage for accuracy. Two knobs:
- Threshold on the confidence. Below threshold → abstain.
- Reward shape. Right = +1, wrong = -X, abstain = 0. Pick X to encode the cost of wrong answers.
Calibration is what makes the threshold meaningful. An over-confident model never crosses below the threshold and never abstains. A well-calibrated model abstains on the right fraction.
Interactive · calibration before and after
The interview probes
- "A model has 90% accuracy and ECE = 0.15. Is it usable?" Depends on use case. For argmax decisions (which class?), yes. For probabilities (auction bid, abstention), no — temperature-scale first, ECE should be < 0.02 before you trust the probabilities.
- "Why does temperature scaling work?" Modern deep nets have high logit norms — softmax saturates. T > 1 softens the softmax, distributing probability mass more uniformly. It's a single-scalar correction with infinite data efficiency at fitting (one scalar) and no risk of overfitting. The argmax is preserved because softmax(z/T)'s ranking is the same as softmax(z)'s.
- "Will training longer make the model better calibrated?" No — usually worse. Longer training pushes train loss to 0 → confident predictions on train → over-confident on test. Early stopping is one of the few sources of implicit calibration regularisation. The dominant effect of training-longer is on accuracy; calibration follows its own path.
- "How would you check calibration of an LLM that outputs only text?" Bracket: ask multiple-choice questions, measure the predicted P(A vs B vs C vs D) by token-probability of "A", "B", etc. Bin and compare to empirical accuracy. The OpenAI / Anthropic eval suites do exactly this.
- "Can you have ECE = 0 and be wrong half the time?" Yes. Predict 0.5 for every example. ECE = 0 (calibrated), accuracy = 50% (uninformative). This is why you report ECE and accuracy (or Brier) together.
Interview prompts you should be ready for
- "Derive that softmax with temperature preserves argmax." (softmax(z/T)_i = e^(z_i/T) / Σ e^(z_j/T). For T > 0, dividing all logits by T is a strictly monotone transformation. argmax(softmax(z/T)) = argmax(z/T) = argmax(z), unchanged.)
- "You're shipping a content moderator. It has 95% accuracy and ECE 0.20. False positive cost is high. Action?" (Calibrate first — temperature scaling on a labelled validation set. Once probabilities are reliable, choose a threshold so that P(positive) > threshold ↔ predicted positive, where threshold is tuned to your FP/FN cost. Without calibration, the threshold is meaningless.)
- "What's a proper scoring rule?" (A loss function for probabilistic predictions whose expectation is minimised by the true posterior. Log-loss, Brier, and spherical scores are proper. ECE is not (it's a metric, not a loss). Training to a proper scoring rule incentivises calibrated predictions.)
- "Why is ECE evaluated over bins?" (Predictions are continuous; "P(positive) = 0.73 exactly" is too narrow to compute an empirical accuracy. Binning groups nearby predictions for statistical power. The downside: binning choice affects the metric — equal-width vs equal-mass bins can give different ECE values. Always report both or use Brier.)
- "Conformal prediction — when is it useful?" (When you need finite-sample coverage guarantees, e.g., "this prediction set contains the true label 95% of the time". Distribution-free; requires only an exchangeable calibration set. Used in medical and safety-critical settings where probability estimates need finite-sample backing, not asymptotic.)
- "Your LLM after RLHF is overconfident on math. What do you do?" (1. Recalibrate post-hoc with temperature scaling on a math eval set. 2. Add calibration to the RLHF reward (penalise overconfidence on incorrect answers). 3. Verify-and-edit — second-stage critic model that re-scores the LLM's confidence. 4. Tool use: delegate math to a calculator so the LLM never has to "know" the answer.)