deep_learning / 09 · calibration lesson 9 / 12

Calibration & uncertainty

A model predicts "80% confident". If those events happen 80% of the time, it's calibrated; if they happen 60% of the time, it's overconfident. Calibration is orthogonal to accuracy — and once you start using model probabilities (auctions, selective prediction, RAG retrieval), it's the metric that matters.

Confidence ≠ correctness

An accurate classifier gets the right answer. A calibrated classifier's predicted probabilities match the empirical frequencies. They're independent:

ModelAccuracyCalibrationUsable for
"Always 99% confident, randomly right"50%terrible (overconfident)Nothing
"Always 50% confident, randomly right"50%perfectTrivially right; uninformative
Deep net, post-softmax90%overconfident (ECE ~0.1)Argmax decisions only — not thresholding
Logistic regression on log-features85%well-calibratedAuctions, gating, abstention

For tasks where you only need the argmax (image classification top-1), miscalibration is fine. For tasks that consume the probability — pricing, abstention, retrieval scoring, A/B test analysis, RL value estimates — miscalibration is a silent bug.

Reliability diagrams — the picture

Bin predictions by predicted probability (e.g. 10 bins of width 0.1). For each bin, plot:

A perfectly calibrated model lies on the y=x line. Bins above the line = under-confident (the model said 60% but was right 80%). Bins below = over-confident (the model said 90% but was right 60%).

accuracy ▲ 1.0 ┤ ╱ │ ╱ │ ╱ ← y=x: perfectly calibrated │ ╱ │ ╱ ●●●● ← deep net: bins curve below y=x 0.5 ┤ ● (overconfident) │ ● │ 0.0 └─────────────────────► 0.0 0.5 1.0 predicted probability

ECE — Expected Calibration Error

The headline calibration metric. Take the weighted average of (predicted − actual) across bins:

ECE = Σ_b (|B_b|/N) · |acc(B_b) − conf(B_b)|

where B_b is the set of predictions in bin b, acc(B_b) is the empirical accuracy in the bin, conf(B_b) is the mean predicted probability in the bin. ECE = 0 is perfect; ECE = 0.5 is degenerate. Typical deep nets pre-calibration: ECE = 0.05–0.15. Post-calibration: ECE < 0.02.

ECE failure modes — they're real
ECE depends on the binning scheme. Equal-width bins (10 bins of width 0.1) penalise the high-probability bin disproportionately if most predictions are concentrated there. Equal-mass bins (10 bins each with N/10 samples) are more robust. Always report both, or use a binning-free metric like Brier score (next section).

Brier score and the proper-scoring-rule view

The Brier score for binary classification is just MSE on probabilities:

Brier = (1/N) Σ_i (p_i − y_i)²   — y_i ∈ {0, 1}

It decomposes neatly:

Brier = (reliability − resolution) + uncertainty = (calibration error) − (sharpness) + (intrinsic noise)

So a low Brier score requires both calibration and sharpness (informative — not just always predicting 0.5). It's a proper scoring rule: it's minimised in expectation by reporting the true posterior. So is log-loss (cross-entropy): -y log p - (1-y) log(1-p).

Reporting only ECE can hide a model that's calibrated but uninformative. Brier or log-loss penalises both.

Why deep nets are overconfident

Three contributing causes, all empirical:

  1. Training to cross-entropy on hard labels. The CE loss is minimised when the model puts all mass on the true label. There's no penalty for being too confident — only for being wrong. So a model that's right 90% of the time and predicts 99.9% gets near-zero loss. The implicit regularisation against overconfidence (capacity, finite data) wears off as models scale up.
  2. Capacity to memorise. A high-capacity model can drive its training loss to near-zero by becoming arbitrarily confident on training points. It then carries that confidence to test points it shouldn't be so sure about.
  3. The softmax temperature is fixed at 1. The logits' magnitude is determined by training; nothing constrains them to produce calibrated probabilities. Modern architectures have high logit norms → sharp softmaxes → overconfidence.

Empirically, modern LLMs (GPT-4, LLaMA) are roughly calibrated on multiple-choice tasks after pretraining — they predict 75% on questions they get right 75% of the time. Then RLHF breaks calibration.

RLHF breaks calibration

From the RL track: RLHF trains the policy to maximise human preference. This optimisation pressure pushes the model to be confident and assertive — humans prefer answers that sound certain. Post-RLHF, the model:

The Anthropic / OpenAI mitigation is to explicitly train for calibration as a sub-objective (e.g., reward models that prefer hedging when uncertain), or post-hoc recalibrate. Hard problem; not fully solved.

Three calibration techniques

MethodTrain dataFormWhen to use
Platt scaling Validation set Fit a logistic regression on (logit, label): p' = σ(a · logit + b) Binary classification, simple monotone correction
Isotonic regression Validation set Fit a non-parametric monotone function from logits to calibrated probabilities More flexible than Platt; needs more validation data
Temperature scaling (Guo et al. 2017) Validation set p' = softmax(logits / T); fit single scalar T > 1 to minimise NLL Multi-class deep nets — simplest, almost always works, doesn't change argmax
Why temperature scaling is the default
Three properties: (1) it's a single scalar, so no overfitting risk; (2) it doesn't change the argmax — accuracy is unchanged; (3) it's calibration-preserving in the limit of infinite data on the validation set. For multi-class deep nets, T ≈ 1.5–2.5 typically. If T < 1, your model is under-confident (rare in modern nets).

Uncertainty beyond calibration

Calibration says "if you say 80%, you're right 80% of the time". It does not distinguish:

Most uncertainty estimation techniques try to separate these. For interview purposes:

MethodCapturesCost
Softmax entropyTotal uncertainty (aleatoric + epistemic mixed)Free
Deep ensemblesBoth; epistemic is variance across ensemble membersN× train + N× inference
MC dropoutApproximation to Bayesian uncertaintyK× inference (K stochastic forward passes)
Last-layer LaplaceBayesian over last layer onlyCheap (closed-form Gaussian on the last layer)
Conformal predictionDistribution-free coverage guaranteesOne calibration set; one extra threshold

For LLMs specifically, "sampling temperature 0.7 and counting answer agreement" is a cheap empirical uncertainty estimate that correlates with correctness probability — used by self-consistency methods (Wang et al. 2022).

Selective prediction — abstain when unsure

If the model can output "I don't know", we trade coverage for accuracy. Two knobs:

  1. Threshold on the confidence. Below threshold → abstain.
  2. Reward shape. Right = +1, wrong = -X, abstain = 0. Pick X to encode the cost of wrong answers.

Calibration is what makes the threshold meaningful. An over-confident model never crosses below the threshold and never abstains. A well-calibrated model abstains on the right fraction.

coverage-accuracy curve ▲ 1.0 ┤●●●●●●●●●●●●●●●●●●●●● ← perfect: high accuracy at full coverage │ ╲ │ ╲ │ ●● ← real model: accuracy drops as │ ╲ coverage increases │ ╲ │ ╲ 0.5 ┤ ╲ ← random baseline └───────────────────────────────► 0.5 1.0 coverage

Interactive · calibration before and after

Reliability diagram with temperature scaling
Synthetic data with controllable overconfidence. Move T to see the curve straighten toward y=x. The ECE updates live.
accuracy
ECE
Brier score
avg confidence
Reading

The interview probes

  1. "A model has 90% accuracy and ECE = 0.15. Is it usable?" Depends on use case. For argmax decisions (which class?), yes. For probabilities (auction bid, abstention), no — temperature-scale first, ECE should be < 0.02 before you trust the probabilities.
  2. "Why does temperature scaling work?" Modern deep nets have high logit norms — softmax saturates. T > 1 softens the softmax, distributing probability mass more uniformly. It's a single-scalar correction with infinite data efficiency at fitting (one scalar) and no risk of overfitting. The argmax is preserved because softmax(z/T)'s ranking is the same as softmax(z)'s.
  3. "Will training longer make the model better calibrated?" No — usually worse. Longer training pushes train loss to 0 → confident predictions on train → over-confident on test. Early stopping is one of the few sources of implicit calibration regularisation. The dominant effect of training-longer is on accuracy; calibration follows its own path.
  4. "How would you check calibration of an LLM that outputs only text?" Bracket: ask multiple-choice questions, measure the predicted P(A vs B vs C vs D) by token-probability of "A", "B", etc. Bin and compare to empirical accuracy. The OpenAI / Anthropic eval suites do exactly this.
  5. "Can you have ECE = 0 and be wrong half the time?" Yes. Predict 0.5 for every example. ECE = 0 (calibrated), accuracy = 50% (uninformative). This is why you report ECE and accuracy (or Brier) together.

Interview prompts you should be ready for

  1. "Derive that softmax with temperature preserves argmax." (softmax(z/T)_i = e^(z_i/T) / Σ e^(z_j/T). For T > 0, dividing all logits by T is a strictly monotone transformation. argmax(softmax(z/T)) = argmax(z/T) = argmax(z), unchanged.)
  2. "You're shipping a content moderator. It has 95% accuracy and ECE 0.20. False positive cost is high. Action?" (Calibrate first — temperature scaling on a labelled validation set. Once probabilities are reliable, choose a threshold so that P(positive) > threshold ↔ predicted positive, where threshold is tuned to your FP/FN cost. Without calibration, the threshold is meaningless.)
  3. "What's a proper scoring rule?" (A loss function for probabilistic predictions whose expectation is minimised by the true posterior. Log-loss, Brier, and spherical scores are proper. ECE is not (it's a metric, not a loss). Training to a proper scoring rule incentivises calibrated predictions.)
  4. "Why is ECE evaluated over bins?" (Predictions are continuous; "P(positive) = 0.73 exactly" is too narrow to compute an empirical accuracy. Binning groups nearby predictions for statistical power. The downside: binning choice affects the metric — equal-width vs equal-mass bins can give different ECE values. Always report both or use Brier.)
  5. "Conformal prediction — when is it useful?" (When you need finite-sample coverage guarantees, e.g., "this prediction set contains the true label 95% of the time". Distribution-free; requires only an exchangeable calibration set. Used in medical and safety-critical settings where probability estimates need finite-sample backing, not asymptotic.)
  6. "Your LLM after RLHF is overconfident on math. What do you do?" (1. Recalibrate post-hoc with temperature scaling on a math eval set. 2. Add calibration to the RLHF reward (penalise overconfidence on incorrect answers). 3. Verify-and-edit — second-stage critic model that re-scores the LLM's confidence. 4. Tool use: delegate math to a calculator so the LLM never has to "know" the answer.)
Takeaway
Calibration is whether your probabilities match reality. It's orthogonal to accuracy. Deep nets are systematically overconfident; modern LLMs lose calibration after RLHF. Fix with temperature scaling (cheap, preserves argmax) or Platt/isotonic regression. Report ECE + Brier or log-loss to catch the "calibrated but uninformative" failure mode. For anything that consumes probabilities downstream (auctions, abstention, RAG scoring, A/B tests), calibration is the metric that matters.