Losses & calibration — why the loss decides whether the auction works
The architecture is settled (lesson 4). Now: what objective do you minimize, and does its output mean P(click) or just a ranking score? The answer is the difference between an ads stack that bids correctly and one that's wrong by 2×.
Pointwise BCE — the workhorse
The default loss for CTR/CVR prediction is binary cross-entropy on per-impression labels. For logit z = f(x) and label y ∈ {0,1}:
Why this and not something else? Because BCE is a proper scoring rule. The optimum of E(x,y)[LBCE] is achieved iff σ(f(x)) = P(y = 1 | x). The model is being asked to estimate the true posterior — calibration is the property BCE is solving for, not a separate goal you bolt on. Two practical consequences: (1) the output is directly usable as a probability — an auction can multiply it by a bid and get expected revenue with the right units; (2) shifting all logits by a constant changes the loss (σ is not shift-invariant), so the model is forced to learn the absolute level, not just the order.
Pairwise losses — the order is all that matters
If you only ever rank items relative to each other within a request, you don't need probabilities. You can train on preference pairs: an observed positive x⁺ (clicked) and a negative x⁻ (skipped or sampled). With scores s⁺ = f(x⁺), s⁻ = f(x⁻):
This is BPR (Rendle et al. 2009). RankNet (Burges et al. 2005, web search) is the same loss; BPR rediscovered it from the recsys community four years later.
The structural fact about pairwise losses: the loss depends only on the score difference. Add a million to every score and nothing changes. The output is a ranking signal, not a probability — it tells you "x⁺ beats x⁻" but emphatically does not tell you "x⁺ has CTR = 0.04". Post-hoc calibration is required if a downstream consumer wants probabilities.
Listwise losses — when the metric is a property of the whole list
Pairwise loss treats all pairs as equally important. But a swap at positions (1, 2) usually moves the ranking metric (NDCG, MRR) far more than a swap at (97, 98). LambdaRank (Burges, Ragno, Le, NIPS 2006; the 2010 paper is the unified survey) — and its GBDT incarnation LambdaMART, the historical state of the art for web search — fixes this by weighting each pair's gradient by the change in NDCG the swap would produce:
LightGBM's lambdarank objective is essentially this. The other listwise family treats the list as a softmax: Lsm = −log [ exp(s⁺) / Σj exp(sj) ]. This is the classic matching/retrieval loss — every positive competes against the rest of the in-batch (or sampled) negatives; lesson 6 picks up the sampled-negatives side. ListNet/ListMLE go further by defining a probability distribution over permutations and minimising KL — theoretically clean, rarely used in production because of cost.
Pointwise vs pairwise vs listwise — the decision rule
| Pointwise BCE | Pairwise (BPR/RankNet) | Listwise (Lambda/softmax) | |
|---|---|---|---|
| Output | Calibrated probability | Ranking score (shift-invariant) | Ranking score (shift-invariant) |
| Calibration | Native — proper scoring rule | None — post-hoc only | None — post-hoc only |
| Sample efficiency | OK; one example, one gradient | Better on hard pairs; pair-mining helps | Best when metric is rank-aware |
| Gradient variance | Low | Moderate; depends on pair sampler | Depends on listwise definition |
| Multi-task heads | Easy — each head BCE on its own label | Awkward — pair structure differs per task | Awkward — list structure differs per task |
| Inference shape | Per-item: probability | Per-item: score; rank-only | Often per-list scoring possible |
Production reality at large platforms: pointwise BCE multi-task is the default. You get calibrated probabilities for every head (CTR, P(watch), P(convert | click)); a policy combines them. Listwise re-rankers may sit on top for inter-item effects — but the per-item scores feeding into them are still BCE-trained.
What "calibrated" actually means
A binary classifier p̂(x) is calibrated if, for every predicted probability level p:
Operationally: take all impressions where the model predicted between 0.29 and 0.31; compute the empirical click rate; if it's near 0.30, the model is calibrated in that bucket. Repeat for every bucket.
- Reliability diagram — bucket predictions into K bins (10–20); plot empirical click rate vs mean predicted probability. Perfect calibration is the diagonal.
- Expected Calibration Error — ECE = Σk (nk / N) · |empk − predk|. Single-number summary.
- Calibration ≠ accuracy. AUC is invariant to any monotone transform of the score; calibration cares about the absolute level. Independent dimensions.
p → p². AUC is unchanged (monotone). The reliability diagram has collapsed. Every probability is now under-estimated; an auction multiplying bid × p will under-bid. The ranker still ranks correctly, but the system around it is broken.
Why calibration matters for ads (and less for recsys)
The ads auction's expected revenue per impression is, roughly,
If P(click) is off by a factor of 2 systematically, eCPM is off by a factor of 2. The auction picks the wrong winner; the second-price clearing rule charges the wrong CPC; pacing controllers (which use absolute CTR to bid-shade against a budget) misfire. Calibration is the difference between charging the right CPC and charging double or half — production ads stacks routinely have a team owning the calibration layer alone.
For a pure recsys feed, the only thing that matters downstream is the order. A miscalibrated model that ranks correctly still produces the correct feed, so recsys teams sometimes get away with pairwise/listwise losses where ads teams cannot. Said another way: ads stacks consume probabilities; recsys feeds consume permutations.
Why models miscalibrate in the first place
- Sampled negatives without log-Q correction. If training replaces "all negatives" with a sample, the logits are biased by the log of the sampling probability (lesson 6).
- Class imbalance + downsampled negatives. Production click rates are 1–3%, but negatives are routinely downsampled at training time. If the prior correction is omitted at serve, every prediction is offset.
- Feature drift between train and serve. When a feature is missing/stale at serve, the model falls back to its training-prior estimate.
- Adversarial logging (selection bias). Logged training data over-represents items the previous policy already ranked at the top. IPS-uncorrected training inherits this bias (lesson 7).
- Distillation and pseudo-labels. A teacher's sharp predictions become hard labels for the student; the student inherits over-confidence.
Post-hoc calibration techniques
| Method | Form | Parameters | When it shines |
|---|---|---|---|
| Platt scaling | pcal = σ(a · z + b) | Two scalars (slope, intercept) | Monotonic miscalibration; small held-out set. Handles a prior shift via b and slope via a. |
| Isotonic regression | Step function, monotone non-decreasing | Non-parametric (≈ K knots) | Non-monotonic shapes (S-curves with bends). Needs more data so the steps don't overfit. |
| Temperature scaling | pcal = σ(z / T) | One scalar | Pure over/underconfidence on deep nets. Doesn't move the prior — only sharpens or softens. |
| Beta calibration (Kull et al. 2017) | Three-parameter (a,b,c) logistic regression on (log p, log(1−p)) features | Three scalars | Sigmoid-output models where Platt is the wrong functional form. |
A useful axis: what distortion can each method fix? Temperature only stretches the logit axis — it fixes overconfidence but cannot translate the curve to fix a prior shift. Platt has a bias term, so it can. Isotonic can fix non-monotone bends that Platt's two parameters can't bend around. The widget below makes this concrete.
Interactive · the calibration explorer
Pick a failure mode and a calibrator. Watch the reliability diagram and ECE update. The point: different distortions need different fixes.
Interview prompts you should be ready for
- "Why does an ads team train with BCE and a search-results team sometimes get away with pairwise loss?" (Probes: do you understand that ads consume probabilities — eCPM = bid × P(click) — while a feed only consumes order? Pairwise loss is shift-invariant; the model never learns the absolute level.)
- "Your CTR model has AUC 0.85 but the auction is bidding wildly. Diagnose." (Probes: AUC is invariant to monotone transforms of the score; the absolute level can be arbitrarily off while AUC stays high. The fix is a reliability diagram + ECE + a post-hoc calibrator. Then check upstream causes: missing log-Q correction, prior shift, feature staleness.)
- "Walk me through Platt scaling vs isotonic regression. When is each appropriate?" (Probes: parametric vs non-parametric, data requirements, the shape of distortions each can correct, the failure modes — Platt assumes a sigmoid distortion, isotonic needs enough data per knot.)
- "Your training data is 50% positive (downsampled), but production is 2% positive. What goes wrong if you don't correct?" (Probes: the model's predicted probabilities will sit near 0.5 on average. The auction multiplies these by bid and over-bids by ≈25×. The fix is the well-known log-odds correction: subtract log(ptrain / (1 − ptrain)) − log(pserve / (1 − pserve)) from the logit.)
- "When would you reach for LambdaRank instead of pointwise BCE?" (Probes: GBDT-based ranker for web search where the metric is rank-aware (NDCG), no downstream consumer needs a probability, dataset is judged not click-logged. Don't reach for it in an ads stack.)
- "Calibrated AUC vs uncalibrated AUC — same or different?" (Same. AUC is invariant to any monotone transform of the score, and every post-hoc calibrator is monotone. Calibration cannot change AUC. Junior candidates often think it does.)