Losses & calibration — why the loss decides whether the auction works

The architecture is settled (lesson 4). Now: what objective do you minimize, and does its output mean P(click) or just a ranking score? The answer is the difference between an ads stack that bids correctly and one that's wrong by 2×.

Pointwise BCE — the workhorse

The default loss for CTR/CVR prediction is binary cross-entropy on per-impression labels. For logit z = f(x) and label y ∈ {0,1}:

L_BCE(x, y) = −[ y · log σ(z) + (1 − y) · log(1 − σ(z)) ]

Why this and not something else? Because BCE is a proper scoring rule. The optimum of E_(x,y)[L_BCE] is achieved iff σ(f(x)) = P(y = 1 | x). The model is being asked to estimate the true posterior — calibration is the property BCE is solving for, not a separate goal you bolt on. Two practical consequences: (1) the output is directly usable as a probability — an auction can multiply it by a bid and get expected revenue with the right units; (2) shifting all logits by a constant changes the loss (σ is not shift-invariant), so the model is forced to learn the absolute level, not just the order.

Pairwise losses — the order is all that matters

If you only ever rank items relative to each other within a request, you don't need probabilities. You can train on preference pairs: an observed positive x⁺ (clicked) and a negative x⁻ (skipped or sampled). With scores s⁺ = f(x⁺), s⁻ = f(x⁻):

L_BPR = −log σ(s⁺ − s⁻)

This is BPR (Rendle et al. 2009). RankNet (Burges et al. 2005, web search) is the same loss; BPR rediscovered it from the recsys community four years later.

The structural fact about pairwise losses: the loss depends only on the score difference. Add a million to every score and nothing changes. The output is a ranking signal, not a probability — it tells you "x⁺ beats x⁻" but emphatically does not tell you "x⁺ has CTR = 0.04". Post-hoc calibration is required if a downstream consumer wants probabilities.

Listwise losses — when the metric is a property of the whole list

Pairwise loss treats all pairs as equally important. But a swap at positions (1, 2) usually moves the ranking metric (NDCG, MRR) far more than a swap at (97, 98). LambdaRank (Burges, Ragno, Le, NIPS 2006; the 2010 paper is the unified survey) — and its GBDT incarnation LambdaMART, the historical state of the art for web search — fixes this by weighting each pair's gradient by the change in NDCG the swap would produce:

∂L / ∂s_i ∝ Σ_j [ −σ(s_j − s_i) · |ΔNDCG_ij| ]

LightGBM's lambdarank objective is essentially this. The other listwise family treats the list as a softmax: L_sm = −log [ exp(s⁺) / Σ_j exp(s_j) ]. This is the classic matching/retrieval loss — every positive competes against the rest of the in-batch (or sampled) negatives; lesson 6 picks up the sampled-negatives side. ListNet/ListMLE go further by defining a probability distribution over permutations and minimising KL — theoretically clean, rarely used in production because of cost.

Pointwise vs pairwise vs listwise — the decision rule

	Pointwise BCE	Pairwise (BPR/RankNet)	Listwise (Lambda/softmax)
Output	Calibrated probability	Ranking score (shift-invariant)	Ranking score (shift-invariant)
Calibration	Native — proper scoring rule	None — post-hoc only	None — post-hoc only
Sample efficiency	OK; one example, one gradient	Better on hard pairs; pair-mining helps	Best when metric is rank-aware
Gradient variance	Low	Moderate; depends on pair sampler	Depends on listwise definition
Multi-task heads	Easy — each head BCE on its own label	Awkward — pair structure differs per task	Awkward — list structure differs per task
Inference shape	Per-item: probability	Per-item: score; rank-only	Often per-list scoring possible

Production reality at large platforms: pointwise BCE multi-task is the default. You get calibrated probabilities for every head (CTR, P(watch), P(convert | click)); a policy combines them. Listwise re-rankers may sit on top for inter-item effects — but the per-item scores feeding into them are still BCE-trained.

What "calibrated" actually means

A binary classifier p̂(x) is calibrated if, for every predicted probability level p:

P( y = 1 | p̂(x) ∈ [p, p + dp] ) ≈ p

Operationally: take all impressions where the model predicted between 0.29 and 0.31; compute the empirical click rate; if it's near 0.30, the model is calibrated in that bucket. Repeat for every bucket.

Reliability diagram — bucket predictions into K bins (10–20); plot empirical click rate vs mean predicted probability. Perfect calibration is the diagonal.
Expected Calibration Error — ECE = Σ_k (n_k / N) · |emp_k − pred_k|. Single-number summary.
Calibration ≠ accuracy. AUC is invariant to any monotone transform of the score; calibration cares about the absolute level. Independent dimensions.

The cleanest way to internalize it

Take a perfectly calibrated model and pass its output through p → p². AUC is unchanged (monotone). The reliability diagram has collapsed. Every probability is now under-estimated; an auction multiplying bid × p will under-bid. The ranker still ranks correctly, but the system around it is broken.

Why calibration matters for ads (and less for recsys)

The ads auction's expected revenue per impression is, roughly,

eCPM = bid × P(click) × P(conv | click) × 1000

If P(click) is off by a factor of 2 systematically, eCPM is off by a factor of 2. The auction picks the wrong winner; the second-price clearing rule charges the wrong CPC; pacing controllers (which use absolute CTR to bid-shade against a budget) misfire. Calibration is the difference between charging the right CPC and charging double or half — production ads stacks routinely have a team owning the calibration layer alone.

For a pure recsys feed, the only thing that matters downstream is the order. A miscalibrated model that ranks correctly still produces the correct feed, so recsys teams sometimes get away with pairwise/listwise losses where ads teams cannot. Said another way: ads stacks consume probabilities; recsys feeds consume permutations.

Why models miscalibrate in the first place

Sampled negatives without log-Q correction. If training replaces "all negatives" with a sample, the logits are biased by the log of the sampling probability (lesson 6).
Class imbalance + downsampled negatives. Production click rates are 1–3%, but negatives are routinely downsampled at training time. If the prior correction is omitted at serve, every prediction is offset.
Feature drift between train and serve. When a feature is missing/stale at serve, the model falls back to its training-prior estimate.
Adversarial logging (selection bias). Logged training data over-represents items the previous policy already ranked at the top. IPS-uncorrected training inherits this bias (lesson 7).
Distillation and pseudo-labels. A teacher's sharp predictions become hard labels for the student; the student inherits over-confidence.

Post-hoc calibration techniques

Method	Form	Parameters	When it shines
Platt scaling	p_cal = σ(a · z + b)	Two scalars (slope, intercept)	Monotonic miscalibration; small held-out set. Handles a prior shift via b and slope via a.
Isotonic regression	Step function, monotone non-decreasing	Non-parametric (≈ K knots)	Non-monotonic shapes (S-curves with bends). Needs more data so the steps don't overfit.
Temperature scaling	p_cal = σ(z / T)	One scalar	Pure over/underconfidence on deep nets. Doesn't move the prior — only sharpens or softens.
Beta calibration (Kull et al. 2017)	Three-parameter (a,b,c) logistic regression on (log p, log(1−p)) features	Three scalars	Sigmoid-output models where Platt is the wrong functional form.

A useful axis: what distortion can each method fix? Temperature only stretches the logit axis — it fixes overconfidence but cannot translate the curve to fix a prior shift. Platt has a bias term, so it can. Isotonic can fix non-monotone bends that Platt's two parameters can't bend around. The widget below makes this concrete.

Interactive · the calibration explorer

Pick a failure mode and a calibrator. Watch the reliability diagram and ECE update. The point: different distortions need different fixes.

Reliability diagram & calibrators

10 bins on [0,1]. Plotted: empirical click rate vs mean predicted probability. The "." diagonal is perfect calibration. Try: overconfident + temperature (fix). Prior-shift + temperature (does not fix). Prior-shift + Platt (fixes it).

FAILURE MODE

CALIBRATOR

failure

none

calibrator

raw

ECE (raw)

—

ECE (after)

—

Reading

—

Interview prompts you should be ready for

"Why does an ads team train with BCE and a search-results team sometimes get away with pairwise loss?" (Probes: do you understand that ads consume probabilities — eCPM = bid × P(click) — while a feed only consumes order? Pairwise loss is shift-invariant; the model never learns the absolute level.)
"Your CTR model has AUC 0.85 but the auction is bidding wildly. Diagnose." (Probes: AUC is invariant to monotone transforms of the score; the absolute level can be arbitrarily off while AUC stays high. The fix is a reliability diagram + ECE + a post-hoc calibrator. Then check upstream causes: missing log-Q correction, prior shift, feature staleness.)
"Walk me through Platt scaling vs isotonic regression. When is each appropriate?" (Probes: parametric vs non-parametric, data requirements, the shape of distortions each can correct, the failure modes — Platt assumes a sigmoid distortion, isotonic needs enough data per knot.)
"Your training data is 50% positive (downsampled), but production is 2% positive. What goes wrong if you don't correct?" (Probes: the model's predicted probabilities will sit near 0.5 on average. The auction multiplies these by bid and over-bids by ≈25×. The fix is the well-known log-odds correction: subtract log(p_train / (1 − p_train)) − log(p_serve / (1 − p_serve)) from the logit.)
"When would you reach for LambdaRank instead of pointwise BCE?" (Probes: GBDT-based ranker for web search where the metric is rank-aware (NDCG), no downstream consumer needs a probability, dataset is judged not click-logged. Don't reach for it in an ads stack.)
"Calibrated AUC vs uncalibrated AUC — same or different?" (Same. AUC is invariant to any monotone transform of the score, and every post-hoc calibrator is monotone. Calibration cannot change AUC. Junior candidates often think it does.)

Takeaway

The loss decides what the output means. BCE makes the output a probability (calibration is native); pairwise/listwise make the output a ranking score (calibration must be added post-hoc). Ads consume probabilities, so the ads stack uses BCE plus an explicit calibration layer; recsys feeds only consume order and can sometimes skip both. When debugging a "good AUC but bad bidding" story, the answer is always somewhere on the calibration axis, and the right calibrator depends on whether the distortion is over/underconfidence (temperature), prior-shift (Platt), or non-monotone bends (isotonic).