search_ads_recsys / 05 · losses & calibration lesson 5 / 11

Losses & calibration — why the loss decides whether the auction works

The architecture is settled (lesson 4). Now: what objective do you minimize, and does its output mean P(click) or just a ranking score? The answer is the difference between an ads stack that bids correctly and one that's wrong by 2×.

Pointwise BCE — the workhorse

The default loss for CTR/CVR prediction is binary cross-entropy on per-impression labels. For logit z = f(x) and label y ∈ {0,1}:

LBCE(x, y) = −[ y · log σ(z) + (1 − y) · log(1 − σ(z)) ]

Why this and not something else? Because BCE is a proper scoring rule. The optimum of E(x,y)[LBCE] is achieved iff σ(f(x)) = P(y = 1 | x). The model is being asked to estimate the true posterior — calibration is the property BCE is solving for, not a separate goal you bolt on. Two practical consequences: (1) the output is directly usable as a probability — an auction can multiply it by a bid and get expected revenue with the right units; (2) shifting all logits by a constant changes the loss (σ is not shift-invariant), so the model is forced to learn the absolute level, not just the order.

Pairwise losses — the order is all that matters

If you only ever rank items relative to each other within a request, you don't need probabilities. You can train on preference pairs: an observed positive x⁺ (clicked) and a negative x⁻ (skipped or sampled). With scores s⁺ = f(x⁺), s⁻ = f(x⁻):

LBPR = −log σ(s⁺ − s⁻)

This is BPR (Rendle et al. 2009). RankNet (Burges et al. 2005, web search) is the same loss; BPR rediscovered it from the recsys community four years later.

The structural fact about pairwise losses: the loss depends only on the score difference. Add a million to every score and nothing changes. The output is a ranking signal, not a probability — it tells you "x⁺ beats x⁻" but emphatically does not tell you "x⁺ has CTR = 0.04". Post-hoc calibration is required if a downstream consumer wants probabilities.

Listwise losses — when the metric is a property of the whole list

Pairwise loss treats all pairs as equally important. But a swap at positions (1, 2) usually moves the ranking metric (NDCG, MRR) far more than a swap at (97, 98). LambdaRank (Burges, Ragno, Le, NIPS 2006; the 2010 paper is the unified survey) — and its GBDT incarnation LambdaMART, the historical state of the art for web search — fixes this by weighting each pair's gradient by the change in NDCG the swap would produce:

∂L / ∂si ∝ Σj [ −σ(sj − si) · |ΔNDCGij| ]

LightGBM's lambdarank objective is essentially this. The other listwise family treats the list as a softmax: Lsm = −log [ exp(s⁺) / Σj exp(sj) ]. This is the classic matching/retrieval loss — every positive competes against the rest of the in-batch (or sampled) negatives; lesson 6 picks up the sampled-negatives side. ListNet/ListMLE go further by defining a probability distribution over permutations and minimising KL — theoretically clean, rarely used in production because of cost.

Pointwise vs pairwise vs listwise — the decision rule

Pointwise BCEPairwise (BPR/RankNet)Listwise (Lambda/softmax)
OutputCalibrated probabilityRanking score (shift-invariant)Ranking score (shift-invariant)
CalibrationNative — proper scoring ruleNone — post-hoc onlyNone — post-hoc only
Sample efficiencyOK; one example, one gradientBetter on hard pairs; pair-mining helpsBest when metric is rank-aware
Gradient varianceLowModerate; depends on pair samplerDepends on listwise definition
Multi-task headsEasy — each head BCE on its own labelAwkward — pair structure differs per taskAwkward — list structure differs per task
Inference shapePer-item: probabilityPer-item: score; rank-onlyOften per-list scoring possible

Production reality at large platforms: pointwise BCE multi-task is the default. You get calibrated probabilities for every head (CTR, P(watch), P(convert | click)); a policy combines them. Listwise re-rankers may sit on top for inter-item effects — but the per-item scores feeding into them are still BCE-trained.

What "calibrated" actually means

A binary classifier p̂(x) is calibrated if, for every predicted probability level p:

P( y = 1 | p̂(x) ∈ [p, p + dp] ) ≈ p

Operationally: take all impressions where the model predicted between 0.29 and 0.31; compute the empirical click rate; if it's near 0.30, the model is calibrated in that bucket. Repeat for every bucket.

The cleanest way to internalize it
Take a perfectly calibrated model and pass its output through p → p². AUC is unchanged (monotone). The reliability diagram has collapsed. Every probability is now under-estimated; an auction multiplying bid × p will under-bid. The ranker still ranks correctly, but the system around it is broken.

Why calibration matters for ads (and less for recsys)

The ads auction's expected revenue per impression is, roughly,

eCPM = bid × P(click) × P(conv | click) × 1000

If P(click) is off by a factor of 2 systematically, eCPM is off by a factor of 2. The auction picks the wrong winner; the second-price clearing rule charges the wrong CPC; pacing controllers (which use absolute CTR to bid-shade against a budget) misfire. Calibration is the difference between charging the right CPC and charging double or half — production ads stacks routinely have a team owning the calibration layer alone.

For a pure recsys feed, the only thing that matters downstream is the order. A miscalibrated model that ranks correctly still produces the correct feed, so recsys teams sometimes get away with pairwise/listwise losses where ads teams cannot. Said another way: ads stacks consume probabilities; recsys feeds consume permutations.

Why models miscalibrate in the first place

Post-hoc calibration techniques

MethodFormParametersWhen it shines
Platt scaling pcal = σ(a · z + b) Two scalars (slope, intercept) Monotonic miscalibration; small held-out set. Handles a prior shift via b and slope via a.
Isotonic regression Step function, monotone non-decreasing Non-parametric (≈ K knots) Non-monotonic shapes (S-curves with bends). Needs more data so the steps don't overfit.
Temperature scaling pcal = σ(z / T) One scalar Pure over/underconfidence on deep nets. Doesn't move the prior — only sharpens or softens.
Beta calibration (Kull et al. 2017) Three-parameter (a,b,c) logistic regression on (log p, log(1−p)) features Three scalars Sigmoid-output models where Platt is the wrong functional form.

A useful axis: what distortion can each method fix? Temperature only stretches the logit axis — it fixes overconfidence but cannot translate the curve to fix a prior shift. Platt has a bias term, so it can. Isotonic can fix non-monotone bends that Platt's two parameters can't bend around. The widget below makes this concrete.

Interactive · the calibration explorer

Pick a failure mode and a calibrator. Watch the reliability diagram and ECE update. The point: different distortions need different fixes.

Reliability diagram & calibrators
10 bins on [0,1]. Plotted: empirical click rate vs mean predicted probability. The "." diagonal is perfect calibration. Try: overconfident + temperature (fix). Prior-shift + temperature (does not fix). Prior-shift + Platt (fixes it).
FAILURE MODE
CALIBRATOR
failure
none
calibrator
raw
ECE (raw)
ECE (after)
Reading

Interview prompts you should be ready for

  1. "Why does an ads team train with BCE and a search-results team sometimes get away with pairwise loss?" (Probes: do you understand that ads consume probabilities — eCPM = bid × P(click) — while a feed only consumes order? Pairwise loss is shift-invariant; the model never learns the absolute level.)
  2. "Your CTR model has AUC 0.85 but the auction is bidding wildly. Diagnose." (Probes: AUC is invariant to monotone transforms of the score; the absolute level can be arbitrarily off while AUC stays high. The fix is a reliability diagram + ECE + a post-hoc calibrator. Then check upstream causes: missing log-Q correction, prior shift, feature staleness.)
  3. "Walk me through Platt scaling vs isotonic regression. When is each appropriate?" (Probes: parametric vs non-parametric, data requirements, the shape of distortions each can correct, the failure modes — Platt assumes a sigmoid distortion, isotonic needs enough data per knot.)
  4. "Your training data is 50% positive (downsampled), but production is 2% positive. What goes wrong if you don't correct?" (Probes: the model's predicted probabilities will sit near 0.5 on average. The auction multiplies these by bid and over-bids by ≈25×. The fix is the well-known log-odds correction: subtract log(ptrain / (1 − ptrain)) − log(pserve / (1 − pserve)) from the logit.)
  5. "When would you reach for LambdaRank instead of pointwise BCE?" (Probes: GBDT-based ranker for web search where the metric is rank-aware (NDCG), no downstream consumer needs a probability, dataset is judged not click-logged. Don't reach for it in an ads stack.)
  6. "Calibrated AUC vs uncalibrated AUC — same or different?" (Same. AUC is invariant to any monotone transform of the score, and every post-hoc calibrator is monotone. Calibration cannot change AUC. Junior candidates often think it does.)
Takeaway
The loss decides what the output means. BCE makes the output a probability (calibration is native); pairwise/listwise make the output a ranking score (calibration must be added post-hoc). Ads consume probabilities, so the ads stack uses BCE plus an explicit calibration layer; recsys feeds only consume order and can sometimes skip both. When debugging a "good AUC but bad bidding" story, the answer is always somewhere on the calibration axis, and the right calibrator depends on whether the distortion is over/underconfidence (temperature), prior-shift (Platt), or non-monotone bends (isotonic).