Negative sampling — the 99% of the data you cannot enumerate

Training a retrieval model is fundamentally a "which item out of a billion" problem. The positives are tiny; the negatives are everything else. How you pick the negatives is the model.

Why this problem exists at all

The "correct" loss for a two-tower retrieval model is a softmax over the entire catalogue:

P(item = i | user u) = exp(s_u,i) / Σ_{j ∈ catalogue} exp(s_u,j)

With N = 10⁹ items, computing the denominator for every training example is infeasible — that's one forward pass per item per gradient step. So the entire field of retrieval training is, in one sentence, "how do you approximate the denominator." Every technique below is an answer to that question.

Retrieval vs ranking — where this matters

For retrieval (two-tower, ANN-indexed) the catalogue is enormous and the denominator is the whole problem. For ranking, you train on logged impressions: a few hundred shown items per user session, each with a click label. The negatives are real shown-not-clicked items, and the volume is manageable — typically no sampling, or at most light subsampling for cost. Most of the negative-sampling literature is a retrieval story; if an interviewer asks "how do you negative-sample for the ranker", the right first move is to ask what they mean.

In-batch negatives — the workhorse

The cheapest trick that works: for each (user, positive item) example in a minibatch of size B, use the other positives in the batch as that example's negatives. You get B − 1 negatives per example at zero additional embedding cost — the item-tower forward passes were going to happen anyway.

L_i = − log [ exp(s_{u_i, v_i}) / Σ_{j ∈ batch} exp(s_{u_i, v_j}) ]

This is the loss in every paper from sentence-BERT to YouTube two-tower (Yi et al. 2019). It has two properties you need to be precise about.

Property 1: the negative distribution is empirical popularity. An item appears in a batch as a positive in proportion to how often it's interacted with. So the implicit "negative sampling distribution" q(j) is the popularity distribution q(j) ∝ pop(j). Popular items are over-represented as negatives for everyone, which means the model learns to slightly down-score them — a form of automatic popularity correction.

Property 2: there are no hard negatives. The negatives are whatever happens to be in the batch, which is the popular-items distribution. Items in the same niche as the positive — the actually informative negatives — are unlikely to co-occur unless the batch is enormous.

The log-Q correction (a.k.a. sampled softmax correction)

If you sample negative j with probability q(j), the gradient is biased relative to the full softmax. The fix, from Bengio et al. (2003) and rediscovered by Covington et al. (2016) for YouTube: subtract log q(j) from each candidate's score before the softmax.

s'_u,j = s_u,j − log q(j)

Intuition: without the correction, popular items appear as negatives more often than the full softmax would weight them, so the model over-penalises them — pushing their scores down for everyone. The −log q(j) adjustment cancels this: rare items get their training scores boosted (large positive −log q), so the softmax denominator stops being dominated by popular items, and the bias on popular-item scores disappears. Without this correction, popular items get systematically under-scored at inference. Recall on head items collapses.

A senior-level "gotcha"

Skipping log-Q correction is one of the most common bugs in two-tower codebases. The model trains fine, evaluation on a held-out set looks fine, but online recall on head queries is mysteriously bad. The reason: the held-out eval samples its negatives from the same biased distribution, so the bias is invisible offline. You only see it when you serve against the full catalogue.

Uniform sampled softmax

Pick K negatives uniformly at random from the catalogue. Each has q(j) = 1/N, so log q(j) is a constant and drops out — no correction needed. In the limit of large K this is unbiased for the full softmax.

The catch is that uniform random negatives are nearly all easy. With N = 10⁹ items and the positive set being a tiny niche, a uniform random item is almost guaranteed to be obviously unrelated. The score difference s_u,i⁺ − s_u,j is already large for these — the gradient is near zero. You're paying for compute that doesn't teach the model anything.

Hard negatives — where the gradient actually lives

The informative negatives are items that are close to the positive but not actually engaged. They're the boundary cases the model has not yet figured out. Two ways to find them:

Item-side mining. For each positive (u, i⁺), find items nearest to i⁺ in the current embedding space (ANN over the item tower's outputs). These are "items the model currently thinks are similar to i⁺". If they aren't actual positives for u, they're high-gradient negatives.
Ranker-distilled mining. Items the ranker scored high but the user didn't engage with. These are negatives that already passed retrieval and got close to the user's attention — the retriever clearly under-rates the "not relevant" signal here. (This is the funnel-coupling story we revisit in lesson 7.)

Hard negatives can be 10–100× more sample-efficient than uniform random. The price: false negatives. An item that's semantically very close to i⁺ might actually be a perfect positive that simply hasn't been interacted with yet (sparsity, not negativity). Push it away in embedding space and you're teaching the model the wrong thing.

Standard mitigations:

Exclude the very top-K of the ANN result (e.g. positions 1–5) — these are most likely false negatives. Mine from positions 6–100 instead.
Mix hard with easy. Hard-only training is unstable; a few uniform random negatives anchor the model to the full-catalogue distribution.
Curriculum: easy first, hard later. The model needs the coarse landscape before refining boundaries.

The false-negative rate, quantified

Why uniform negatives are nearly safe and hard negatives are not, in arithmetic. Suppose a user has P = 100 true positives in a catalogue of N = 10⁹ items. A uniform random sample has false-negative probability P/N = 10⁻⁷. Effectively zero noise.

Now sample from the user's top-100 ANN neighbours of i⁺. If the embedding is even modestly informative, the user's true positives are concentrated near i⁺ — perhaps 10 of the 100 ANN neighbours are true positives. The false-negative rate is now 10⁻¹, six orders of magnitude worse (this is a worst-case estimate assuming embeddings are already near-perfect; typical hard-negative miners see false-negative rates of a few percent rather than 10%). This is why "exclude the top-K" matters and why blind hard-mining destroys recall.

Trade-offs at a glance

Strategy	Cost	Bias	Sample efficiency	False-negative rate	Implementation
Full softmax	Infeasible at scale	None (it's the target)	Maximal in theory	Zero	Trivial mathematically, impossible practically
In-batch	Free (B−1 per example)	Popularity-distributed; needs log-Q to be unbiased	Medium	Low (depends on batch composition)	One line of code; standard
Uniform sampled	O(K) per example	Unbiased in the limit	Low — most negatives are trivially easy	Effectively zero	Easy; needs a fast item sampler
Hard-mined	High — ANN over current embeddings, possibly per epoch	Biased toward boundary; needs care	High	High — the central failure mode	Complex; requires periodic index refresh

Interactive · negative-sampling mixer

Allocate budget across three negative sources. The widget reports a rough effective-gradient signal, a false-negative rate, the head/tail recall delta you should expect at inference, and a diagnosis of the failure mode.

Mix your negatives

Numbers are illustrative — they're a toy model meant to make the trade-offs concrete, not predict your production metrics. Adjust the three sources; the diagnosis on the right is what an interviewer wants you to articulate.

in-batch share (%): 50 uniform random share (%): 25 hard-mined share (%): 25 apply log-Q correction: exclude ANN top-5 from hard set:

effective gradient / step

—

false-negative rate

—

head − tail recall delta

—

share total

—

Diagnosis

—

Production recipes

There is no canonical mix; the right split is dataset-specific and changes during training. A few stable patterns from published systems:

Two-tower baseline (YouTube-style). YouTube two-tower (Yi et al. 2019) uses in-batch negatives with a streaming frequency estimator for q(·), not raw in-batch popularity. Cheap, scalable, decent recall at head; tail is weak because tail items rarely appear as in-batch positives.
Mixed (Facebook's EBR, Huang et al. 2020). In-batch + small fraction of curated hard negatives. The hard negatives unlock tail recall; in-batch keeps training stable.
Curriculum (ANCE, Xiong et al. 2020; in dense retrieval for IR). Periodically re-index the corpus with the current model, mine hard negatives from the new index, train, repeat. The "hardness" is updated alongside the embedding space.

For an interview a defensible default is "in-batch with log-Q, plus a small fraction of uniform random for tail coverage, plus hard-mining once the model has a reasonable embedding space and the index is cheap to refresh." Then state the trade-offs you'd watch.

Negative sampling for ranking

The ranker trains on logged impressions. Every shown item with its click label is a labeled example — the "negatives" are real shown-not-clicked items, which are already much harder than a uniform catalogue sample (the retrieval funnel pre-filtered them). The interesting biases here are different:

Position bias. Items at the top of the list get more clicks regardless of relevance. The next lesson is entirely this story.
Selection bias. You only see clicks on items the old system surfaced. Items the old system never showed have no labels at all — they don't appear as "negatives", they don't appear as anything.
Negative subsampling for cost. If you do subsample the negatives (common for very large logs), you must reweight or recalibrate; this is the downsampled-prior correction from lesson 5. Without it, predicted CTRs are systematically too high.

How this lesson connects to the others

Lesson	Hook to negative sampling
02 · candidate generation	The two-tower retriever uses in-batch + log-Q by default. Most retriever-quality work is negative-sampling work.
05 · losses & calibration	Sampled-softmax pre-training distorts the score distribution; the calibrator in lesson 5 has to undo it. Subsampling negatives for ranking has the same prior-correction story.
07 · bias & debiasing	Ranker-distilled hard negatives use the ranker's score as a teacher — and inherit any bias the ranker carries (position, selection). Debiasing has to apply before distillation.

Interview prompts you should be ready for

"Why do you subtract log(popularity) from in-batch negative scores?" (Probes: do you know what unbiased sampled softmax is, and that in-batch sampling implicitly weights by popularity?)
"Your two-tower retrieval is excellent on head queries and terrible on tail. From the negative-sampling side, what's the most likely cause?" (Probes: in-batch only, no uniform or hard mining; tail items never appear as positives so never appear as in-batch negatives; the model never learns to distinguish them.)
"Walk me through hard-negative mining. What's the failure mode?" (Probes: ANN over current embeddings, gradient lives at the boundary; false negatives — items that should be positives — get pushed away; mitigation is exclude top-K and mix with easy.)
"If you had unlimited compute, would you use the full softmax?" (Trick: still no, in practice. Most of the catalogue is uninformative and you'd be burning compute on near-zero gradients. Hard-mined negatives are more sample-efficient than dense full-softmax even when full-softmax is affordable.)
"How does the choice of negative-sampling strategy interact with the calibration story from lesson 5?" (Probes: sampled-softmax scores aren't probabilities over the catalogue; subsampling negatives in ranking distorts the prior; both require post-hoc calibration to match production base rates.)
"For a ranker trained on logged impressions, what's the analog of negative sampling?" (Probes: no sampling — you use all impressions; the "bias" is now position and selection bias, not popularity; subsampling for cost requires reweighting.)

Takeaway

Negative sampling is "how you approximate the full-softmax denominator without paying for it." Three families — in-batch, uniform, hard-mined — each correct different parts of the bias / variance / efficiency triangle and break in different ways. The senior-MLE signal is being able to predict the failure mode of any given mix: head bias from missing log-Q, tail collapse from all-in-batch, embedding pathology from too-hard mining. The mix is task-specific; the reasoning is universal.