Bias & debiasing — your logs are not the world

Position bias, selection bias, popularity loops. The most subtle topic in production ranking, and the one that separates seniors from juniors in interviews.

The fundamental problem, stated cleanly

You collect training data from production logs — rows like (user, item, position, clicked?) — and fit a model you call "P(click | user, item)". That label is a lie. What you actually learned is:

f(x) ≈ P(click | x, shown at the position the old ranker placed it).

What you wanted is P(click | x, shown at the top) — intrinsic relevance, independent of placement. The gap is bias, with three canonical manifestations every senior ranking MLE names on cue:

Bias	Mechanism	What the model wrongly learns
Position bias	Items at the top get more clicks because they are seen first, not because they are better.	"Top-position items are intrinsically relevant" — confounds position with quality.
Selection bias	The training set only contains items the old ranker chose to surface. Everything else has zero observations.	Items outside the historical action set look uniformly bad — there is no signal on them at all.
Popularity bias	Already-popular items get shown more → click more → become more confident positives → shown even more. A self-reinforcing loop across retrain cycles.	Collapse onto the head of the catalogue. Long-tail relevance vanishes.

Why this is the senior-vs-junior topic

Juniors describe the model. Seniors describe the data-generating process and the gap between it and the deployment distribution. Most "offline moved but online flat" failures are bias; most "the new model is worse than the old one by week 4" failures are feedback loops. Naming the mechanism is half the interview.

The examination hypothesis

The standard decomposition (Richardson et al. 2007; Joachims et al. 2005) factors a click into two latent events: did the user see the item, and given that they saw it, did they click?

P(click = 1 | u, i, pos) = P(E = 1 | pos) · P(R = 1 | u, i).

Left: position 1 sees nearly every user; position 10 is examined ~4% of the time. Right: under IPS, the rare click at low position carries a weight of 1 / P(E | pos) — order of 20× — so the few survivors dominate the loss.

This holds under the assumption that examination depends only on position, and relevance only on (u, i).

The first factor depends only on position, not on the item. The second is the unbiased relevance we want. If we knew P(examine | pos), we could divide it out of every training label and recover unbiased signal.

Key assumption: examination is independent of relevance given position. Sometimes false — a highly relevant item near the top can attract the eye down the page; a satisfied click can terminate examination. The richer click models below relax this.

Click models — the spectrum

The click model you assume is the debiasing recipe — different assumptions produce different corrections.

Model	Assumption	Strength / weakness
PBM (position-based)	Examination depends only on position. One global examination curve P(examine \| pos).	Cleanest, easiest to estimate, easiest to divide out. Ignores cascades and click context.
Cascade (Craswell et al. 2008)	User scans top-to-bottom, clicks the first relevant item, then stops examining.	Captures the "satisfied click" termination. Fails on multi-click sessions.
DBN (Chapelle & Zhang 2009)	Cascade + a per-item satisfaction probability; user keeps scanning unless satisfied.	Handles multi-click. More parameters, harder to fit.
UBM (user browsing)	Examination depends on position and the distance to the previous click.	Captures "I clicked something good, I'll scroll past the next few." Heavier inference.

For most production systems PBM is enough — the dominant bias is "top positions get clicked more," and PBM corrects exactly that. The others matter when you have rich multi-click sessions and care about the tail.

Estimating position bias — the hard part

You can't debias without an estimate of P(examine | pos). Three approaches, in increasing order of "doesn't hurt the user":

Method	How	Cost
Result randomization (Joachims 2005)	On a slice of traffic, randomly permute positions. CTR of a fixed item averaged over random positions reveals the examination curve.	Hurts users. Usually run on ~1% of traffic, often only on a sub-slice of slots.
EM / joint estimation (Wang et al. 2018; Guo et al. 2019 ("PAL"))	Treat examination and relevance as latents; fit both from logs via EM or a two-tower model where one tower depends only on position.	No traffic cost; identifiability requires assumptions (same item appearing at multiple positions across sessions).
Intervention harvesting (Agarwal et al. 2019)	Find natural randomization already in logs — A/B tests, ranker reshuffles, latency-induced ordering noise — as a quasi-experiment.	Cheap but data-limited; only where you happen to have natural variation.

IPS — Inverse Propensity Scoring

Once you have an examination curve, the unbiased estimator is to reweight each training example by the inverse of its examination probability:

L_IPS = Σ_n [click_n / P(examine | pos_n)] · ℓ(model(x_n), label_n).

A click at position 1 (examination prob ≈ 1) gets weight ≈ 1. A click at position 20 (examination prob ≈ 0.05) gets weight ≈ 20. The rare click that survives the position handicap is "worth more" as signal. The estimator is unbiased but high-variance: a few low-position clicks dominate the loss.

Weight clipping. Cap weights at some max M. Reintroduces bias but kills variance. M is a bias-variance knob, tuned empirically.
Doubly robust (Dudik et al. 2011). Combine IPS with a regression-based imputation of the counterfactual reward. Consistent if either the propensity model or the reward model is correct — weaker than requiring both.

The position-feature trick

A cheap and effective alternative to IPS, used by Google (Wang et al. 2016) and most ad systems:

At training time, concatenate the position the item was shown at as an extra feature.
At inference time, set that feature to a constant (e.g., position 1, or 0).

The model absorbs the position-correlated variance into the position feature; the rest of the network becomes (approximately) position-invariant. At serving, with the feature pinned, the score reflects intrinsic relevance.

The catch

This only works if position is a shallow leaf input the model can't combine with other features. Let position interact with item embedding deep in the network and the model encodes "this kind of item at this position is good" — exactly the bias you wanted to remove. Production implementations gate position into a side-tower with no cross-feature interactions.

Trade-offs: how do you actually pick one?

	Bias	Variance	Traffic cost	Eng. cost	Interpretability
Position feature	Low if leaf-only; can leak if interacted	Low (no reweighting)	None	Cheap — one feature	High — easy to explain
IPS	Unbiased if propensities correct	High (rare low-position clicks dominate)	None if propensities estimated from logs; some if randomization used	Moderate — needs a propensity model	Medium
Doubly robust	Unbiased if EITHER propensity or reward model is right	Lower than IPS	Same as IPS	Higher — two models to maintain	Lower — two failure modes to debug
Full randomization	None — gold standard	Low	High — random rankings are bad rankings	Low — just shuffle	Highest

In practice: position feature as a default; IPS or DR if you have a serious bias problem and the engineering budget; randomization only on a tiny slice for propensity estimation, never at scale.

Interactive · logged data vs the truth

An old ranker has placed five items at fixed positions. Each position has a declining examination probability; clicks are sampled from P(examine | pos) × relevance_i. Slide "log size" up — the raw-CTR estimate converges, but to the biased value. Toggle IPS or the position-feature trick to correct it.

What your logs say vs what's true

Five items with known intrinsic relevance, placed by an old ranker. Item 4 is highly relevant but was placed at slot 8. Item 1 is mediocre but was placed at slot 1. Without debiasing, item 1 looks great and item 4 looks weak. IPS reverses the verdict.

log size (impressions per item): 5000 IPS reweighting: position-feature trick (oracle): IPS weight clip (max): 50

item	placed at	true relevance	raw CTR	estimated relevance	error

Reading

—

Selection bias — the items you never showed

Position bias is about where items appeared. Selection bias is about whether they appeared at all. Items the old ranker never surfaced contribute zero rows — the model has no evidence on them. Two fixes, both painful:

Exploration. Surface items the production ranker would not have chosen on some fraction of traffic — uniformly at random, or via epsilon-greedy / Thompson / UCB. Costs user satisfaction; buys unbiased data on the unseen action set.
Counterfactual estimation over actions. IPS, but with the propensity defined over the action space rather than position. Same bias-variance trade, harder propensities, only works for items that occasionally appear.

Popularity bias and the feedback loop

Position and selection bias are static — they describe a single snapshot of logs. Popularity bias is dynamic. The feedback loop:

popular item → shown more → clicked more ↑ │ │ ▼ retrained on ← more confident ← logged as positive yesterday's in this item with high CTR clicks

Each retrain concentrates probability mass further onto the head of the catalogue. After enough cycles the model is confidently wrong about anything outside the top 1% of items it has already seen succeed. Selection-bias fixes (exploration, IPS over actions) are necessary but not sufficient — popularity bias usually also wants an active diversification signal at re-rank time (lesson 11).

A common production failure

Retrain daily on yesterday's clicks. Week 1 the model improves. By week 4, online metrics regress: the model has narrowed to a shrinking slice of the catalogue and saturated. "Train more" makes it worse. Fix: inject exploration or reweight against popularity.

Causal vs predictive — what are you actually estimating?

A senior interviewer may probe this. P(click | x, shown) is a predictive quantity. The causal one is closer to "how much does click probability change if we show this item vs the counterfactual ranking?"

For ranking decisions, the predictive quantity (correctly position-debiased) is usually adequate — you're picking the item with the highest expected click rate at a fixed top slot. For uplift decisions (should an advertiser bid more, should we re-rank this user's feed away from the default), causal estimation matters more and is genuinely harder. Lesson 9/10 territory.

Interview prompts you should be ready for

"Your old ranker shows movie A at position 1 and movie B at position 5. B's logged CTR is lower than A's. Is B actually worse?" (Probes: do you immediately reach for position bias? Can you sketch the examination-hypothesis correction?)
"Walk me through the position-feature-at-train-only trick. Why does it work, and when does it break?" (Probes: does the candidate understand it depends on position being a leaf feature with no interactions? Can they articulate what happens if you let position cross with item embedding?)
"Compare IPS and doubly robust. When does each break?" (IPS breaks when propensities are wrong or extreme — variance blows up. DR is unbiased if either model is correct, but you now have two failure modes and a regression model whose own bias may leak in.)
"Estimating position bias from logs — outline the EM approach and what it assumes." (Probes: identifiability. Need same items appearing at different positions across sessions, or a separable functional form, or some other lever.)
"Your training pipeline retrains daily on yesterday's clicks. After 30 days the new model is worse than the original. What's happening?" (Feedback loop / popularity collapse. Fix is exploration or popularity reweighting, not more training data.)
"What's the failure mode of training a two-tower retriever on logged top-K data without any debiasing?" (The retriever inherits the upstream ranker's position bias, then amplifies it: items the old ranker pushed down never appear in K, so they never become positives, so the retriever pushes them down further. A cascade of biased retrains.)

Takeaway

Your logs measure a confounded mixture of relevance and the old ranker's choices. Debiasing is not optional polish — it is the difference between a model that learns the world and one that learns its own history. The senior signal you want to send: name the bias, name the mechanism, name a correction with its trade-off, and know when to spend on full randomization versus a position feature.