Candidate generation — from CF to two-tower
The retrieval stage has one job: maximize recall at K₁ candidates, with a per-item cost measured in microseconds. That single constraint forces every architectural decision below.
Restating the problem from first principles
Lesson 1 established the funnel. Now zoom in on the first stage. The retrieval stage sees the full catalogue N ≈ 10⁸ – 10¹⁰ and must emit K₁ ≈ 10² – 10⁴ items in around 10 ms. That is roughly 1 – 100 ns of compute per catalogue item if you scan linearly, or ~10 µs per returned item if you have a sub-linear index. Either way: per-item compute is fixed and tiny.
Two architectural constraints follow immediately, and the rest of this lesson is downstream of them:
- Item representations must be precomputable. If scoring an item requires the user's identity as a model input at training time (e.g., a learned embedding keyed on
user_idshared between the user and item sides), you cannot build an item-only index. Every new user breaks it. - No cross-features at the top. If the final score is f(user_features, item_features) with a non-decomposable interaction (e.g., a transformer that attends across user and item tokens), then scoring item i requires the user's representation at index time. You're forced into a full cross-product over N items at request time, and you're back to the 14-hour problem from lesson 1.
The fix to both is the same: the final operation between the user side and the item side must be a late, cheap interaction — concretely, a dot product or cosine. This is called the two-tower constraint, and it is the single most important architectural fact about retrieval.
Collaborative filtering — the historical baseline
The earliest serious retrieval systems were collaborative filtering (CF). The interaction matrix R ∈ ℝU×I has a value when user u rated or clicked item i, and zeros (or missing) elsewhere. Two flavors:
- Item-item CF. Compute similarity between items by taking the cosine (or adjusted-cosine, mean-centered per user) between their columns of R. To recommend for user u, look at items u already engaged with, retrieve their nearest neighbours, union and de-duplicate. Famously what powered Amazon recommendations in the 2000s.
- User-user CF. Symmetric: similarity over rows of R. Recommend items popular among u's nearest user-neighbours.
CF is interpretable ("people who liked X also liked Y"), needs no item features, and works surprisingly well on dense interaction data. But it inherits three weaknesses that motivate everything that comes after: cold start (new items have no column, new users have no row), sparsity (when R is > 99.9% empty, cosines are noisy), and popularity bias (popular items co-occur with everything and dominate the neighbour lists).
Matrix factorization — CF, but with a generalization
Funk's Netflix Prize recipe (2006) reframed CF as a low-rank factorization. Approximate the sparse interaction matrix as a product of two dense low-rank matrices:
R ≈ U · VT, U ∈ ℝ|users|×d, V ∈ ℝ|items|×d
Each user gets a d-dimensional embedding ui; each item gets vj; the predicted preference is ⟨ui, vj⟩. The loss is either squared error on observed ratings (explicit feedback) or a pairwise ranking loss like BPR (Rendle et al. 2009 — Bayesian Personalized Ranking) on implicit feedback:
LBPR = − Σ(u, i⁺, i⁻) log σ(⟨u, vi⁺⟩ − ⟨u, vi⁻⟩)
where i⁺ is an item the user engaged with and i⁻ is sampled from the catalogue. SVD++ and friends add bias terms and implicit-feedback features but the skeleton is the same.
The reason MF matters in 2025 is that it is structurally a special case of a two-tower model with one-hot input features. Read that sentence twice. ui = UT · onehot(user_id) is exactly what a single embedding-lookup layer computes. So MF is a two-tower model whose only input feature on each side is the ID. Everything that follows is "what if the towers had more features and more layers?"
The two-tower model
Generalize MF: replace each embedding lookup with a deep network that takes arbitrary features.
Crucially: the two towers do not share parameters. They can be wildly different architectures — a transformer over the user's recent watch history on one side, a CNN over the thumbnail and an embedding bag over the title on the other. They only need to produce vectors in the same d-dimensional space, and the final operation is a dot product. d is small by neural-net standards: 64 – 256 is typical, because d sets the ANN index size and recall trade-off (lesson 3).
Training is done with sampled softmax or in-batch negatives: in a minibatch of (user, positive_item) pairs, every other item in the batch is treated as a negative for every user. The loss is
L = − Σu log [ exp(⟨u, v+⟩ / τ) / Σv ∈ batch exp(⟨u, v⟩ / τ) ]
This sidesteps the impossibility of a softmax over N = 10⁹ classes. There are well-known biases this introduces (popular items appear more often as in-batch negatives, so they get over-penalized) and well-known fixes (logQ correction, mixed-negative sampling) — those are the subject of lesson 6.
Why the late-interaction constraint is non-negotiable
Junior candidates often propose "what if I just add a few cross-features at the top of the two-tower?" — e.g., a user × item ID interaction, or a small MLP over [u; v; u⊙v]. Watch what happens at serving time:
| Architecture | Serving cost per request | Index possible? |
|---|---|---|
| ⟨u, v⟩ (pure two-tower) | 1 user forward + ANN lookup (O(log N) or O(N · d) linear scan) | Yes — item vectors are static. |
| MLP([u; v]) (late MLP on top) | 1 user forward + N × MLP forward — no factorization | No — every item needs the user vector concatenated and pushed through the MLP. |
| cross-attention(u tokens, v tokens) | Same as above, worse. N × attention per request. | No. |
| FM-style: u, v, ⟨u, v⟩, plus learned bias terms | Reducible to a dot product over an augmented vector (concat the bias dims into u and v). OK. | Yes, with care. |
The principle: any operation between the user side and the item side that is not a dot product (or that can't be algebraically rewritten as one) breaks the index. That's the architectural cost of expressiveness. The ranker can afford it — it sees K₁ = 10³ items, not N = 10⁹. The retriever cannot.
One retriever is never enough — multi-source retrieval
A single two-tower model captures one notion of relevance. Production systems run several retrieval sources in parallel and union their outputs, because each source captures a different slice of the catalogue.
| Source | What it captures | Failure mode in isolation |
|---|---|---|
| Collaborative two-tower | Co-engagement patterns. Strongest signal on head queries / popular items. | Item cold start. Popularity bias. |
| Content two-tower | Semantic match on item features (text, image, taxonomy). Scoreable on day-zero items. | Misses purely behavioural patterns (item A and item B share no content but co-occur in real users). |
| Recent-engagement retrieval | Item-item nearest neighbours of items the user just touched. Captures session intent. | Echo chamber within the session; doesn't broaden interests. |
| Popularity / trending | The (regional, daily) top items, irrespective of user. Cheap safety net. | Non-personalized. Drowns the tail. |
| Exploration | Random or Thompson-sampled items, often weighted toward items with low impression count. | By design noisy; needs the ranker to filter. |
YouTube's recsys is the canonical example: Covington, Adams, Sargin (2016) — Deep Neural Networks for YouTube Recommendations — describes a multi-source retrieval setup, and later writeups indicate the production system fans out to more than ten candidate sources. Each is a small, focused model with its own training data and ANN index. The ranker reconciles them.
Cold start, solved in the tower features
The seductive answer to cold start is "use a content-based source." The deeper principle: cold start is solved by which features go into the tower, not by which source you pick.
- New users. If the user tower's only input is
user_id(i.e., a learned embedding lookup), a new user has a random embedding and recommendations are useless. If the user tower also consumes context features (device, locale, query, time-of-day, demographic priors), then a new user has a sensible vector from request 1. - New items. Same on the item side. If the item tower consumes only
item_id, a new item is invisible until it accrues clicks. If the item tower consumes content features (title, image, creator, category), it has a sensible vector at indexing time and is retrievable from day zero. This is why content towers are essential for marketplaces with fresh inventory (jobs, news, listings, ads).
The four retrieval architectures, compared
| Item-item CF | Matrix factorization | Two-tower (content) | Two-tower (collaborative) | |
|---|---|---|---|---|
| Cold start (new items) | Bad — no column | Bad — no vj learned | Excellent — content features carry it | Good if tower has content features |
| Cold start (new users) | OK — works from first click | Bad — no ui learned | Excellent if user tower has context features | Good if tower has context features |
| Expressiveness | Low | Low — bilinear only | Medium — depth of tower | Medium — same, with behavioural signal |
| Sparsity tolerance | Bad — needs co-occurrence | Better — embeds through structure | Best — features fill the gaps | Best — features + behavior |
| Scalability (serving) | Pre-compute item-item table, O(I²) store | ANN over vj | ANN over vj | ANN over vj |
| Ease of debugging | Very high — "neighbours of X" | High — inspect embeddings | Medium — feature attribution | Lower — entangled signals |
Interactive · the candidate-sources mixer
Allocate your K₁ budget across five candidate sources and read the diagnosis. The numbers are illustrative toy estimates — the point is to feel how the trade-offs interact, not to calibrate a real system.
Interview prompts you should be ready for
- "Why don't you train one model that takes (user, item) features and outputs a score, then retrieve top-K from that?" (Probes: do you understand that no cross-features at the top is what makes the item index precomputable? Compute the cost: N × MLP_forward per request.)
- "Walk me through the loss function of a two-tower model." (Sampled softmax / in-batch negatives, temperature, logQ correction. Bonus if you mention popularity bias of in-batch negatives and the correction for it.)
- "Your two-tower retrieval has terrible recall on long-tail items. Diagnose." (Probes: item tower starved of content features so tail items have noisy embeddings; in-batch negatives over-penalize popular items but the tail isn't actually being trained on; missing exploration source; ranker filters tail out so positive labels never accumulate.)
- "When would you use matrix factorization in 2025 instead of a two-tower model?" (Honest answer: rarely, but — tiny catalogue, dense interactions, interpretability requirement, no item features available, baseline you need to ship in a week. MF is a special case of two-tower, not a different model class.)
- "Your candidate gen has 10 sources. How do you decide K₁ allocation across sources?" (Probes: offline recall-at-K per source on held-out positives, online A/B on allocation, marginal-utility curve per source. Sources with overlapping coverage waste slots — measure incremental recall, not raw recall.)
- "What's the failure mode of training the two towers end-to-end on logged-click data?" (Probes: position bias propagates into the retriever. Items shown at top get clicked more, retriever learns to surface them again, the loop tightens. This is the setup for lesson 7 on debiasing.)