search_ads_recsys / 06 · negative sampling lesson 6 / 11

Negative sampling — the 99% of the data you cannot enumerate

Training a retrieval model is fundamentally a "which item out of a billion" problem. The positives are tiny; the negatives are everything else. How you pick the negatives is the model.

Why this problem exists at all

The "correct" loss for a two-tower retrieval model is a softmax over the entire catalogue:

P(item = i | user u) = exp(su,i) / Σj ∈ catalogue exp(su,j)

With N = 10⁹ items, computing the denominator for every training example is infeasible — that's one forward pass per item per gradient step. So the entire field of retrieval training is, in one sentence, "how do you approximate the denominator." Every technique below is an answer to that question.

Retrieval vs ranking — where this matters
For retrieval (two-tower, ANN-indexed) the catalogue is enormous and the denominator is the whole problem. For ranking, you train on logged impressions: a few hundred shown items per user session, each with a click label. The negatives are real shown-not-clicked items, and the volume is manageable — typically no sampling, or at most light subsampling for cost. Most of the negative-sampling literature is a retrieval story; if an interviewer asks "how do you negative-sample for the ranker", the right first move is to ask what they mean.

In-batch negatives — the workhorse

The cheapest trick that works: for each (user, positive item) example in a minibatch of size B, use the other positives in the batch as that example's negatives. You get B − 1 negatives per example at zero additional embedding cost — the item-tower forward passes were going to happen anyway.

Li = − log [ exp(sui, vi) / Σj ∈ batch exp(sui, vj) ]

This is the loss in every paper from sentence-BERT to YouTube two-tower (Yi et al. 2019). It has two properties you need to be precise about.

Property 1: the negative distribution is empirical popularity. An item appears in a batch as a positive in proportion to how often it's interacted with. So the implicit "negative sampling distribution" q(j) is the popularity distribution q(j) ∝ pop(j). Popular items are over-represented as negatives for everyone, which means the model learns to slightly down-score them — a form of automatic popularity correction.

Property 2: there are no hard negatives. The negatives are whatever happens to be in the batch, which is the popular-items distribution. Items in the same niche as the positive — the actually informative negatives — are unlikely to co-occur unless the batch is enormous.

The log-Q correction (a.k.a. sampled softmax correction)

If you sample negative j with probability q(j), the gradient is biased relative to the full softmax. The fix, from Bengio et al. (2003) and rediscovered by Covington et al. (2016) for YouTube: subtract log q(j) from each candidate's score before the softmax.

s'u,j = su,j − log q(j)

Intuition: without the correction, popular items appear as negatives more often than the full softmax would weight them, so the model over-penalises them — pushing their scores down for everyone. The −log q(j) adjustment cancels this: rare items get their training scores boosted (large positive −log q), so the softmax denominator stops being dominated by popular items, and the bias on popular-item scores disappears. Without this correction, popular items get systematically under-scored at inference. Recall on head items collapses.

A senior-level "gotcha"
Skipping log-Q correction is one of the most common bugs in two-tower codebases. The model trains fine, evaluation on a held-out set looks fine, but online recall on head queries is mysteriously bad. The reason: the held-out eval samples its negatives from the same biased distribution, so the bias is invisible offline. You only see it when you serve against the full catalogue.

Uniform sampled softmax

Pick K negatives uniformly at random from the catalogue. Each has q(j) = 1/N, so log q(j) is a constant and drops out — no correction needed. In the limit of large K this is unbiased for the full softmax.

The catch is that uniform random negatives are nearly all easy. With N = 10⁹ items and the positive set being a tiny niche, a uniform random item is almost guaranteed to be obviously unrelated. The score difference su,i⁺ − su,j is already large for these — the gradient is near zero. You're paying for compute that doesn't teach the model anything.

Hard negatives — where the gradient actually lives

The informative negatives are items that are close to the positive but not actually engaged. They're the boundary cases the model has not yet figured out. Two ways to find them:

Hard negatives can be 10–100× more sample-efficient than uniform random. The price: false negatives. An item that's semantically very close to i⁺ might actually be a perfect positive that simply hasn't been interacted with yet (sparsity, not negativity). Push it away in embedding space and you're teaching the model the wrong thing.

Standard mitigations:

The false-negative rate, quantified

Why uniform negatives are nearly safe and hard negatives are not, in arithmetic. Suppose a user has P = 100 true positives in a catalogue of N = 10⁹ items. A uniform random sample has false-negative probability P/N = 10⁻⁷. Effectively zero noise.

Now sample from the user's top-100 ANN neighbours of i⁺. If the embedding is even modestly informative, the user's true positives are concentrated near i⁺ — perhaps 10 of the 100 ANN neighbours are true positives. The false-negative rate is now 10⁻¹, six orders of magnitude worse (this is a worst-case estimate assuming embeddings are already near-perfect; typical hard-negative miners see false-negative rates of a few percent rather than 10%). This is why "exclude the top-K" matters and why blind hard-mining destroys recall.

Trade-offs at a glance

StrategyCostBiasSample efficiencyFalse-negative rateImplementation
Full softmaxInfeasible at scaleNone (it's the target)Maximal in theoryZeroTrivial mathematically, impossible practically
In-batchFree (B−1 per example)Popularity-distributed; needs log-Q to be unbiasedMediumLow (depends on batch composition)One line of code; standard
Uniform sampledO(K) per exampleUnbiased in the limitLow — most negatives are trivially easyEffectively zeroEasy; needs a fast item sampler
Hard-minedHigh — ANN over current embeddings, possibly per epochBiased toward boundary; needs careHighHigh — the central failure modeComplex; requires periodic index refresh

Interactive · negative-sampling mixer

Allocate budget across three negative sources. The widget reports a rough effective-gradient signal, a false-negative rate, the head/tail recall delta you should expect at inference, and a diagnosis of the failure mode.

Mix your negatives
Numbers are illustrative — they're a toy model meant to make the trade-offs concrete, not predict your production metrics. Adjust the three sources; the diagnosis on the right is what an interviewer wants you to articulate.
effective gradient / step
false-negative rate
head − tail recall delta
share total
Diagnosis

Production recipes

There is no canonical mix; the right split is dataset-specific and changes during training. A few stable patterns from published systems:

For an interview a defensible default is "in-batch with log-Q, plus a small fraction of uniform random for tail coverage, plus hard-mining once the model has a reasonable embedding space and the index is cheap to refresh." Then state the trade-offs you'd watch.

Negative sampling for ranking

The ranker trains on logged impressions. Every shown item with its click label is a labeled example — the "negatives" are real shown-not-clicked items, which are already much harder than a uniform catalogue sample (the retrieval funnel pre-filtered them). The interesting biases here are different:

How this lesson connects to the others

LessonHook to negative sampling
02 · candidate generationThe two-tower retriever uses in-batch + log-Q by default. Most retriever-quality work is negative-sampling work.
05 · losses & calibrationSampled-softmax pre-training distorts the score distribution; the calibrator in lesson 5 has to undo it. Subsampling negatives for ranking has the same prior-correction story.
07 · bias & debiasingRanker-distilled hard negatives use the ranker's score as a teacher — and inherit any bias the ranker carries (position, selection). Debiasing has to apply before distillation.

Interview prompts you should be ready for

  1. "Why do you subtract log(popularity) from in-batch negative scores?" (Probes: do you know what unbiased sampled softmax is, and that in-batch sampling implicitly weights by popularity?)
  2. "Your two-tower retrieval is excellent on head queries and terrible on tail. From the negative-sampling side, what's the most likely cause?" (Probes: in-batch only, no uniform or hard mining; tail items never appear as positives so never appear as in-batch negatives; the model never learns to distinguish them.)
  3. "Walk me through hard-negative mining. What's the failure mode?" (Probes: ANN over current embeddings, gradient lives at the boundary; false negatives — items that should be positives — get pushed away; mitigation is exclude top-K and mix with easy.)
  4. "If you had unlimited compute, would you use the full softmax?" (Trick: still no, in practice. Most of the catalogue is uninformative and you'd be burning compute on near-zero gradients. Hard-mined negatives are more sample-efficient than dense full-softmax even when full-softmax is affordable.)
  5. "How does the choice of negative-sampling strategy interact with the calibration story from lesson 5?" (Probes: sampled-softmax scores aren't probabilities over the catalogue; subsampling negatives in ranking distorts the prior; both require post-hoc calibration to match production base rates.)
  6. "For a ranker trained on logged impressions, what's the analog of negative sampling?" (Probes: no sampling — you use all impressions; the "bias" is now position and selection bias, not popularity; subsampling for cost requires reweighting.)
Takeaway
Negative sampling is "how you approximate the full-softmax denominator without paying for it." Three families — in-batch, uniform, hard-mined — each correct different parts of the bias / variance / efficiency triangle and break in different ways. The senior-MLE signal is being able to predict the failure mode of any given mix: head bias from missing log-Q, tail collapse from all-in-batch, embedding pathology from too-hard mining. The mix is task-specific; the reasoning is universal.