Negative sampling — the 99% of the data you cannot enumerate
Training a retrieval model is fundamentally a "which item out of a billion" problem. The positives are tiny; the negatives are everything else. How you pick the negatives is the model.
Why this problem exists at all
The "correct" loss for a two-tower retrieval model is a softmax over the entire catalogue:
P(item = i | user u) = exp(su,i) / Σj ∈ catalogue exp(su,j)
With N = 10⁹ items, computing the denominator for every training example is infeasible — that's one forward pass per item per gradient step. So the entire field of retrieval training is, in one sentence, "how do you approximate the denominator." Every technique below is an answer to that question.
In-batch negatives — the workhorse
The cheapest trick that works: for each (user, positive item) example in a minibatch of size B, use the other positives in the batch as that example's negatives. You get B − 1 negatives per example at zero additional embedding cost — the item-tower forward passes were going to happen anyway.
Li = − log [ exp(sui, vi) / Σj ∈ batch exp(sui, vj) ]
This is the loss in every paper from sentence-BERT to YouTube two-tower (Yi et al. 2019). It has two properties you need to be precise about.
Property 1: the negative distribution is empirical popularity. An item appears in a batch as a positive in proportion to how often it's interacted with. So the implicit "negative sampling distribution" q(j) is the popularity distribution q(j) ∝ pop(j). Popular items are over-represented as negatives for everyone, which means the model learns to slightly down-score them — a form of automatic popularity correction.
Property 2: there are no hard negatives. The negatives are whatever happens to be in the batch, which is the popular-items distribution. Items in the same niche as the positive — the actually informative negatives — are unlikely to co-occur unless the batch is enormous.
The log-Q correction (a.k.a. sampled softmax correction)
If you sample negative j with probability q(j), the gradient is biased relative to the full softmax. The fix, from Bengio et al. (2003) and rediscovered by Covington et al. (2016) for YouTube: subtract log q(j) from each candidate's score before the softmax.
s'u,j = su,j − log q(j)
Intuition: without the correction, popular items appear as negatives more often than the full softmax would weight them, so the model over-penalises them — pushing their scores down for everyone. The −log q(j) adjustment cancels this: rare items get their training scores boosted (large positive −log q), so the softmax denominator stops being dominated by popular items, and the bias on popular-item scores disappears. Without this correction, popular items get systematically under-scored at inference. Recall on head items collapses.
Uniform sampled softmax
Pick K negatives uniformly at random from the catalogue. Each has q(j) = 1/N, so log q(j) is a constant and drops out — no correction needed. In the limit of large K this is unbiased for the full softmax.
The catch is that uniform random negatives are nearly all easy. With N = 10⁹ items and the positive set being a tiny niche, a uniform random item is almost guaranteed to be obviously unrelated. The score difference su,i⁺ − su,j is already large for these — the gradient is near zero. You're paying for compute that doesn't teach the model anything.
Hard negatives — where the gradient actually lives
The informative negatives are items that are close to the positive but not actually engaged. They're the boundary cases the model has not yet figured out. Two ways to find them:
- Item-side mining. For each positive (u, i⁺), find items nearest to i⁺ in the current embedding space (ANN over the item tower's outputs). These are "items the model currently thinks are similar to i⁺". If they aren't actual positives for u, they're high-gradient negatives.
- Ranker-distilled mining. Items the ranker scored high but the user didn't engage with. These are negatives that already passed retrieval and got close to the user's attention — the retriever clearly under-rates the "not relevant" signal here. (This is the funnel-coupling story we revisit in lesson 7.)
Hard negatives can be 10–100× more sample-efficient than uniform random. The price: false negatives. An item that's semantically very close to i⁺ might actually be a perfect positive that simply hasn't been interacted with yet (sparsity, not negativity). Push it away in embedding space and you're teaching the model the wrong thing.
Standard mitigations:
- Exclude the very top-K of the ANN result (e.g. positions 1–5) — these are most likely false negatives. Mine from positions 6–100 instead.
- Mix hard with easy. Hard-only training is unstable; a few uniform random negatives anchor the model to the full-catalogue distribution.
- Curriculum: easy first, hard later. The model needs the coarse landscape before refining boundaries.
The false-negative rate, quantified
Why uniform negatives are nearly safe and hard negatives are not, in arithmetic. Suppose a user has P = 100 true positives in a catalogue of N = 10⁹ items. A uniform random sample has false-negative probability P/N = 10⁻⁷. Effectively zero noise.
Now sample from the user's top-100 ANN neighbours of i⁺. If the embedding is even modestly informative, the user's true positives are concentrated near i⁺ — perhaps 10 of the 100 ANN neighbours are true positives. The false-negative rate is now 10⁻¹, six orders of magnitude worse (this is a worst-case estimate assuming embeddings are already near-perfect; typical hard-negative miners see false-negative rates of a few percent rather than 10%). This is why "exclude the top-K" matters and why blind hard-mining destroys recall.
Trade-offs at a glance
| Strategy | Cost | Bias | Sample efficiency | False-negative rate | Implementation |
|---|---|---|---|---|---|
| Full softmax | Infeasible at scale | None (it's the target) | Maximal in theory | Zero | Trivial mathematically, impossible practically |
| In-batch | Free (B−1 per example) | Popularity-distributed; needs log-Q to be unbiased | Medium | Low (depends on batch composition) | One line of code; standard |
| Uniform sampled | O(K) per example | Unbiased in the limit | Low — most negatives are trivially easy | Effectively zero | Easy; needs a fast item sampler |
| Hard-mined | High — ANN over current embeddings, possibly per epoch | Biased toward boundary; needs care | High | High — the central failure mode | Complex; requires periodic index refresh |
Interactive · negative-sampling mixer
Allocate budget across three negative sources. The widget reports a rough effective-gradient signal, a false-negative rate, the head/tail recall delta you should expect at inference, and a diagnosis of the failure mode.
Production recipes
There is no canonical mix; the right split is dataset-specific and changes during training. A few stable patterns from published systems:
- Two-tower baseline (YouTube-style). YouTube two-tower (Yi et al. 2019) uses in-batch negatives with a streaming frequency estimator for q(·), not raw in-batch popularity. Cheap, scalable, decent recall at head; tail is weak because tail items rarely appear as in-batch positives.
- Mixed (Facebook's EBR, Huang et al. 2020). In-batch + small fraction of curated hard negatives. The hard negatives unlock tail recall; in-batch keeps training stable.
- Curriculum (ANCE, Xiong et al. 2020; in dense retrieval for IR). Periodically re-index the corpus with the current model, mine hard negatives from the new index, train, repeat. The "hardness" is updated alongside the embedding space.
For an interview a defensible default is "in-batch with log-Q, plus a small fraction of uniform random for tail coverage, plus hard-mining once the model has a reasonable embedding space and the index is cheap to refresh." Then state the trade-offs you'd watch.
Negative sampling for ranking
The ranker trains on logged impressions. Every shown item with its click label is a labeled example — the "negatives" are real shown-not-clicked items, which are already much harder than a uniform catalogue sample (the retrieval funnel pre-filtered them). The interesting biases here are different:
- Position bias. Items at the top of the list get more clicks regardless of relevance. The next lesson is entirely this story.
- Selection bias. You only see clicks on items the old system surfaced. Items the old system never showed have no labels at all — they don't appear as "negatives", they don't appear as anything.
- Negative subsampling for cost. If you do subsample the negatives (common for very large logs), you must reweight or recalibrate; this is the downsampled-prior correction from lesson 5. Without it, predicted CTRs are systematically too high.
How this lesson connects to the others
| Lesson | Hook to negative sampling |
|---|---|
| 02 · candidate generation | The two-tower retriever uses in-batch + log-Q by default. Most retriever-quality work is negative-sampling work. |
| 05 · losses & calibration | Sampled-softmax pre-training distorts the score distribution; the calibrator in lesson 5 has to undo it. Subsampling negatives for ranking has the same prior-correction story. |
| 07 · bias & debiasing | Ranker-distilled hard negatives use the ranker's score as a teacher — and inherit any bias the ranker carries (position, selection). Debiasing has to apply before distillation. |
Interview prompts you should be ready for
- "Why do you subtract log(popularity) from in-batch negative scores?" (Probes: do you know what unbiased sampled softmax is, and that in-batch sampling implicitly weights by popularity?)
- "Your two-tower retrieval is excellent on head queries and terrible on tail. From the negative-sampling side, what's the most likely cause?" (Probes: in-batch only, no uniform or hard mining; tail items never appear as positives so never appear as in-batch negatives; the model never learns to distinguish them.)
- "Walk me through hard-negative mining. What's the failure mode?" (Probes: ANN over current embeddings, gradient lives at the boundary; false negatives — items that should be positives — get pushed away; mitigation is exclude top-K and mix with easy.)
- "If you had unlimited compute, would you use the full softmax?" (Trick: still no, in practice. Most of the catalogue is uninformative and you'd be burning compute on near-zero gradients. Hard-mined negatives are more sample-efficient than dense full-softmax even when full-softmax is affordable.)
- "How does the choice of negative-sampling strategy interact with the calibration story from lesson 5?" (Probes: sampled-softmax scores aren't probabilities over the catalogue; subsampling negatives in ranking distorts the prior; both require post-hoc calibration to match production base rates.)
- "For a ranker trained on logged impressions, what's the analog of negative sampling?" (Probes: no sampling — you use all impressions; the "bias" is now position and selection bias, not popularity; subsampling for cost requires reweighting.)