search_ads_recsys / 02 · candidate generation lesson 2 / 11

Candidate generation — from CF to two-tower

The retrieval stage has one job: maximize recall at K₁ candidates, with a per-item cost measured in microseconds. That single constraint forces every architectural decision below.

Restating the problem from first principles

Lesson 1 established the funnel. Now zoom in on the first stage. The retrieval stage sees the full catalogue N ≈ 10⁸ – 10¹⁰ and must emit K₁ ≈ 10² – 10⁴ items in around 10 ms. That is roughly 1 – 100 ns of compute per catalogue item if you scan linearly, or ~10 µs per returned item if you have a sub-linear index. Either way: per-item compute is fixed and tiny.

Two architectural constraints follow immediately, and the rest of this lesson is downstream of them:

  1. Item representations must be precomputable. If scoring an item requires the user's identity as a model input at training time (e.g., a learned embedding keyed on user_id shared between the user and item sides), you cannot build an item-only index. Every new user breaks it.
  2. No cross-features at the top. If the final score is f(user_features, item_features) with a non-decomposable interaction (e.g., a transformer that attends across user and item tokens), then scoring item i requires the user's representation at index time. You're forced into a full cross-product over N items at request time, and you're back to the 14-hour problem from lesson 1.

The fix to both is the same: the final operation between the user side and the item side must be a late, cheap interaction — concretely, a dot product or cosine. This is called the two-tower constraint, and it is the single most important architectural fact about retrieval.

The retrieval theorem (informal)
Nearly every production retriever factors this way: score(u, i) = ⟨φU(user), φI(item)⟩ for some tower functions φU, φI. Anything richer than a dot product on top moves work back to request time and breaks the index. You can make the towers arbitrarily deep — that compute happens once at indexing — but the join must be a dot product. (The notable exception is tree-based retrieval, e.g. Alibaba's TDM/JTM, which uses beam search over a learned item tree. It's a minority approach and we don't cover it here.)

Collaborative filtering — the historical baseline

The earliest serious retrieval systems were collaborative filtering (CF). The interaction matrix R ∈ ℝU×I has a value when user u rated or clicked item i, and zeros (or missing) elsewhere. Two flavors:

CF is interpretable ("people who liked X also liked Y"), needs no item features, and works surprisingly well on dense interaction data. But it inherits three weaknesses that motivate everything that comes after: cold start (new items have no column, new users have no row), sparsity (when R is > 99.9% empty, cosines are noisy), and popularity bias (popular items co-occur with everything and dominate the neighbour lists).

Matrix factorization — CF, but with a generalization

Funk's Netflix Prize recipe (2006) reframed CF as a low-rank factorization. Approximate the sparse interaction matrix as a product of two dense low-rank matrices:

R ≈ U · VT,   U ∈ ℝ|users|×d,   V ∈ ℝ|items|×d

Each user gets a d-dimensional embedding ui; each item gets vj; the predicted preference is ⟨ui, vj. The loss is either squared error on observed ratings (explicit feedback) or a pairwise ranking loss like BPR (Rendle et al. 2009 — Bayesian Personalized Ranking) on implicit feedback:

LBPR = − Σ(u, i⁺, i⁻) log σ(⟨u, vi⁺⟩ − ⟨u, vi⁻⟩)

where i⁺ is an item the user engaged with and i⁻ is sampled from the catalogue. SVD++ and friends add bias terms and implicit-feedback features but the skeleton is the same.

The reason MF matters in 2025 is that it is structurally a special case of a two-tower model with one-hot input features. Read that sentence twice. ui = UT · onehot(user_id) is exactly what a single embedding-lookup layer computes. So MF is a two-tower model whose only input feature on each side is the ID. Everything that follows is "what if the towers had more features and more layers?"

The two-tower model

Generalize MF: replace each embedding lookup with a deep network that takes arbitrary features.

user features user MLP φ_U u ∈ ℝᵈ computed online (per request) item features item MLP φ_I v ∈ ℝᵈ precomputed (full catalogue indexed) ⟨ u , v ⟩ → score NO CROSS FEATURES (late interaction)

Crucially: the two towers do not share parameters. They can be wildly different architectures — a transformer over the user's recent watch history on one side, a CNN over the thumbnail and an embedding bag over the title on the other. They only need to produce vectors in the same d-dimensional space, and the final operation is a dot product. d is small by neural-net standards: 64 – 256 is typical, because d sets the ANN index size and recall trade-off (lesson 3).

Training is done with sampled softmax or in-batch negatives: in a minibatch of (user, positive_item) pairs, every other item in the batch is treated as a negative for every user. The loss is

L = − Σu log [ exp(⟨u, v+⟩ / τ) / Σv ∈ batch exp(⟨u, v⟩ / τ) ]

This sidesteps the impossibility of a softmax over N = 10⁹ classes. There are well-known biases this introduces (popular items appear more often as in-batch negatives, so they get over-penalized) and well-known fixes (logQ correction, mixed-negative sampling) — those are the subject of lesson 6.

Why the late-interaction constraint is non-negotiable

Junior candidates often propose "what if I just add a few cross-features at the top of the two-tower?" — e.g., a user × item ID interaction, or a small MLP over [u; v; u⊙v]. Watch what happens at serving time:

ArchitectureServing cost per requestIndex possible?
⟨u, v⟩ (pure two-tower) 1 user forward + ANN lookup (O(log N) or O(N · d) linear scan) Yes — item vectors are static.
MLP([u; v]) (late MLP on top) 1 user forward + N × MLP forward — no factorization No — every item needs the user vector concatenated and pushed through the MLP.
cross-attention(u tokens, v tokens) Same as above, worse. N × attention per request. No.
FM-style: u, v, ⟨u, v⟩, plus learned bias terms Reducible to a dot product over an augmented vector (concat the bias dims into u and v). OK. Yes, with care.

The principle: any operation between the user side and the item side that is not a dot product (or that can't be algebraically rewritten as one) breaks the index. That's the architectural cost of expressiveness. The ranker can afford it — it sees K₁ = 10³ items, not N = 10⁹. The retriever cannot.

One retriever is never enough — multi-source retrieval

A single two-tower model captures one notion of relevance. Production systems run several retrieval sources in parallel and union their outputs, because each source captures a different slice of the catalogue.

SourceWhat it capturesFailure mode in isolation
Collaborative two-tower Co-engagement patterns. Strongest signal on head queries / popular items. Item cold start. Popularity bias.
Content two-tower Semantic match on item features (text, image, taxonomy). Scoreable on day-zero items. Misses purely behavioural patterns (item A and item B share no content but co-occur in real users).
Recent-engagement retrieval Item-item nearest neighbours of items the user just touched. Captures session intent. Echo chamber within the session; doesn't broaden interests.
Popularity / trending The (regional, daily) top items, irrespective of user. Cheap safety net. Non-personalized. Drowns the tail.
Exploration Random or Thompson-sampled items, often weighted toward items with low impression count. By design noisy; needs the ranker to filter.

YouTube's recsys is the canonical example: Covington, Adams, Sargin (2016) — Deep Neural Networks for YouTube Recommendations — describes a multi-source retrieval setup, and later writeups indicate the production system fans out to more than ten candidate sources. Each is a small, focused model with its own training data and ANN index. The ranker reconciles them.

Cold start, solved in the tower features

The seductive answer to cold start is "use a content-based source." The deeper principle: cold start is solved by which features go into the tower, not by which source you pick.

Common interview trap
Asked "how do you handle cold-start items?", a junior answer is "use a content-based recommender." The senior answer is: "I make sure the item tower of my collaborative model itself consumes content features, so the same model handles head and tail items with the same code path. I'd only add a dedicated content source if I have evidence the content tower is being starved of gradient signal on tail items inside the collaborative model."

The four retrieval architectures, compared

Item-item CFMatrix factorizationTwo-tower (content)Two-tower (collaborative)
Cold start (new items)Bad — no columnBad — no vj learnedExcellent — content features carry itGood if tower has content features
Cold start (new users)OK — works from first clickBad — no ui learnedExcellent if user tower has context featuresGood if tower has context features
ExpressivenessLowLow — bilinear onlyMedium — depth of towerMedium — same, with behavioural signal
Sparsity toleranceBad — needs co-occurrenceBetter — embeds through structureBest — features fill the gapsBest — features + behavior
Scalability (serving)Pre-compute item-item table, O(I²) storeANN over vjANN over vjANN over vj
Ease of debuggingVery high — "neighbours of X"High — inspect embeddingsMedium — feature attributionLower — entangled signals

Interactive · the candidate-sources mixer

Allocate your K₁ budget across five candidate sources and read the diagnosis. The numbers are illustrative toy estimates — the point is to feel how the trade-offs interact, not to calibrate a real system.

Mix your retrieval sources
Each slider is the fraction of K₁ slots allocated to that source (the widget renormalizes to sum to 1). The panels show toy estimates of recall on head queries, recall on tail/new items, diversity, and a diagnosis of the most likely failure mode of your mix.
recall · head queries
recall · tail / new items
diversity proxy
verdict
Reading

Interview prompts you should be ready for

  1. "Why don't you train one model that takes (user, item) features and outputs a score, then retrieve top-K from that?" (Probes: do you understand that no cross-features at the top is what makes the item index precomputable? Compute the cost: N × MLP_forward per request.)
  2. "Walk me through the loss function of a two-tower model." (Sampled softmax / in-batch negatives, temperature, logQ correction. Bonus if you mention popularity bias of in-batch negatives and the correction for it.)
  3. "Your two-tower retrieval has terrible recall on long-tail items. Diagnose." (Probes: item tower starved of content features so tail items have noisy embeddings; in-batch negatives over-penalize popular items but the tail isn't actually being trained on; missing exploration source; ranker filters tail out so positive labels never accumulate.)
  4. "When would you use matrix factorization in 2025 instead of a two-tower model?" (Honest answer: rarely, but — tiny catalogue, dense interactions, interpretability requirement, no item features available, baseline you need to ship in a week. MF is a special case of two-tower, not a different model class.)
  5. "Your candidate gen has 10 sources. How do you decide K₁ allocation across sources?" (Probes: offline recall-at-K per source on held-out positives, online A/B on allocation, marginal-utility curve per source. Sources with overlapping coverage waste slots — measure incremental recall, not raw recall.)
  6. "What's the failure mode of training the two towers end-to-end on logged-click data?" (Probes: position bias propagates into the retriever. Items shown at top get clicked more, retriever learns to surface them again, the loop tightens. This is the setup for lesson 7 on debiasing.)
Takeaway
The retrieval stage is shaped by one constraint — late, cheap interaction between user and item — and everything from CF to two-tower is a different way of populating the two sides of that dot product. Multi-source retrieval is the rule, not the exception, because no single tower captures head, tail, intent, and exploration at once. Cold start lives in the tower features, not in the source mix.