The funnel — retrieve, rank, re-rank

Why production ranking is never one model. The latency, recall, and precision budget that forces the cascade. The first thing you draw in any system-design interview.

From first principles: why not a single model?

Imagine you have a perfect oracle: a single function f(user, item, context) → score that returns exactly the relevance you want. You run it on every (user, item) pair and pick the top K. Done.

The reason that fails in production is arithmetic, not modelling. Suppose:

Catalogue size N = 10⁹ items (YouTube videos, products, ads).
p99 serving budget: 100 ms end-to-end. The ML budget is maybe 50 ms of that — the rest is network, feature fetch, business logic, rendering.
The "good" ranking model is a deep network with hundreds of features and embedding lookups. A reasonable per-item scoring cost: ~100 µs. (This is generous — production DLRM-class models with sparse features and cross-feature lookups can be 100–500 µs.)

Score every item: 10⁹ × 100 µs = 10⁵ seconds ≈ 28 hours. Per request. The math doesn't allow one model.

So you cascade. A cheap model — typically dot product between two precomputed embeddings — over the full catalogue produces a few hundred or a few thousand candidates. The expensive model scores only those. Optionally a third stage applies business logic (diversity, freshness, advertiser/policy filters) on the top dozens.

┌────────────────────┐ │ 1 user request │ user_id, query, context └─────────┬──────────┘ │ ▼ ┌────────────────────┐ │ CANDIDATE GEN │ N = 10⁹ → K₁ = 10³ │ (retrieval) │ per-item cost: O(µs) via ANN │ optimize: RECALL │ latency budget: ~10 ms └─────────┬──────────┘ │ ▼ ┌────────────────────┐ │ RANKING │ K₁ = 10³ → K₂ = 10² │ (heavy model) │ per-item cost: O(100 µs) │ optimize: PRECISION│ latency budget: ~30 ms └─────────┬──────────┘ │ ▼ ┌────────────────────┐ │ RE-RANKING │ K₂ = 10² → shown items │ (business logic, │ per-item cost: O(ms) ok │ diversity, ads │ latency budget: ~10 ms │ blending, etc.) │ └────────────────────┘

What each stage optimizes for — and why they can't be the same metric

This is the trap interviewers love to set: "if every stage optimizes for the same thing, why have stages?" The answer is that they don't.

Stage	Metric	Why this one
Candidate gen	Recall@K₁	You only have to not drop good items. Ranking is the ranker's job. False negatives here are fatal; false positives are cheap (they get filtered later).
Ranking	NDCG / AUC / calibrated CTR	You have to order the K₁ candidates well, and (for ads) emit a probability you can multiply with a bid. Precision matters because only the top dozens are shown.
Re-ranking	Long-term engagement, diversity, business constraints	The ranker scores items in isolation; users see a list. Re-ranking accounts for inter-item effects, freshness, exploration, ad load, policy violations.

A way to internalize this: the candidate generator is allowed to be wrong about ordering as long as good items are in the K₁. The ranker is allowed to ignore items that didn't make it into K₁. The re-ranker is allowed to ignore most ranking detail and worry about the top-of-list user experience. Each stage narrows, and each stage gets richer features and a more expensive model as it does.

The recall/precision asymmetry, stated cleanly

A recall-stage error is unrecoverable downstream: a missed item is permanently gone. Precision is bounded: a precision-stage error costs you one slot in the result list. So the stage facing the largest catalogue (candidate gen) optimizes recall, and the stage just before the user (ranking) optimizes precision. This isn't a convention — it's forced by the cost asymmetry.

Interactive · the latency budget

Drag the sliders to set the catalogue size, end-to-end latency budget, and per-item model cost. The widget tells you whether you can serve from one stage, two, or three — and where the bottleneck is.

The recsys ↔ ads parallel

The same funnel works for both. Naming and metrics differ, but the structure is one-to-one.

	Recommender (e.g. YouTube Home)	Search ads (e.g. Google Sponsored Search)
Retrieve from	All ~10⁹ candidate items	All eligible ads matching the query (typically 10³–10⁵ after match-type and policy filters)
Retrieval signal	User × item affinity (two-tower)	Query × ad relevance (keyword + semantic match)
Ranker predicts	P(watch ≥ 30s), P(like), engagement	P(click), P(convert) — multiplied by advertiser bid
Re-rank applies	Diversity, freshness, creator caps	Auction, reserve, ad-load, policy
Optimizes	Long-term engagement	Long-term auction efficiency (≈ user clicks × advertiser ROI × revenue)

The auction layer is the only thing ads adds. Everything else — retrieval, ranking, calibration, debiasing — is the same machine. This is why senior ads MLEs and senior recsys MLEs are roughly interchangeable; only the last layer of vocabulary differs.

Multi-objective ranking — the ranker rarely predicts one thing

In an interview you'll be expected to know that "the ranking model" almost never has a single output. Modern production rankers are multi-task: they predict several engagement signals simultaneously, then combine them into one scalar score for sorting.

For YouTube-style recsys, a paper-canonical example (Zhao et al. 2019, "Recommending What Video to Watch Next") predicts roughly:

P(click) — did the user click the thumbnail?
P(watch ≥ 10s) — past the bounce threshold?
E[watchtime | click] — regression head, conditional on a click.
P(like), P(share), P(subscribe) — explicit engagement.
P(skip), P(dislike) — negative signals.

The final score is a hand-tuned (or learned-to-rank) combination, e.g. α·P(watch) + β·E[wt] − γ·P(skip) + …. The weights are policy decisions, not model outputs — and interviewers love to ask "who picks the weights and how?" The honest answer is some combination of online A/B testing and Pareto-frontier evaluation; there is no off-the-shelf calibration for "long-term user happiness".

For ads, the multi-task ranker predicts P(click) and P(conversion | click), since the auction's expected revenue is bid × P(click) × P(conv | click) (or cpc_bid × P(click) for click-based campaigns). One model produces both — multi-task heads share embeddings and are much more sample-efficient than two separate models.

Where junior candidates trip

Asked "what does the ranker predict?", a junior answer is "CTR". A correct answer is "several things, combined into one score; the combination is a policy choice." If the interviewer then asks "what if a video has high P(click) but low watchtime?" you have the framework to answer: the policy weights P(watch ≥ 30s) higher because that's the long-term metric, so high-CTR clickbait gets demoted.

Where the funnel breaks — and what production fixes look like

Three failure modes that show up in interviews because they show up in production.

Failure	Symptom	Production fix
Retrieval ↔ ranker mismatch	The ranker is great offline but the funnel underperforms online: the items the ranker thinks are best aren't in K₁.	Co-train. Ranker-distillation into the retriever (the retriever's loss includes the ranker's score on hard negatives). Or, periodically, hard-mine ranker top-K items the retriever missed and add them as positives.
Position bias compounds across stages	The training data for the ranker is clicks on items the old retriever surfaced at top positions. New retrievals never get a fair shot because their items appear lower (or never).	IPS / examination-model debiasing (lesson 7). Explicit exploration slot. Counterfactual evaluation before launch.
Re-rank starves the ranker	Heuristics at the re-rank stage (diversity, ad-load) routinely override the top-ranked items, so improving ranking quality has near-zero online impact.	Audit re-rank overrides as if they were a separate model with their own metric. Move policy-as-code to a learned re-ranker (e.g., generative re-rankers, listwise transformers).

Interview prompts you should be ready for

"Design YouTube Home feed end-to-end." (Draw the funnel within 30 seconds. Then go deep on whichever stage the interviewer steers you toward.)
"Why can't you train one neural network on (user, item, context) over the full catalogue?" (Answer in arithmetic: per-item cost × catalogue size vs latency budget. Do the multiplication out loud.)
"Your ranker improves NDCG offline by 5% but the A/B test is flat. What happened?" (Probes: retrieval mismatch, position bias, the ranker improvement was on items the retrieval never surfaces, novelty bias, or the metric used offline doesn't match the production metric.)
"How do you decide K₁ — the candidate-set size?" (Trade-off between recall and ranker cost. Empirically: plot recall@K vs K, pick the elbow, then verify the ranker latency budget. There's no closed-form answer.)
"What's the difference between the metric a ranker optimizes and the metric a feed optimizes?" (Probes: ranker outputs per-item probabilities; feed quality is a property of the list — diversity, freshness, ad-load. Different objects.)
"For ads, the auction is at the end. Why not earlier?" (You can only auction items you have a calibrated CTR for. So auction is downstream of ranking. Retrieving cheaply and ranking carefully is what feeds the auction useful inputs.)

Takeaway

The funnel exists because arithmetic forces it: one model can't see the whole catalogue. Each stage has its own metric, model class, latency budget, and failure mode. The interview signal you want to send is that you can draw the funnel without thinking, then reason about which stage is the bottleneck for any specific question.