The funnel — retrieve, rank, re-rank
Why production ranking is never one model. The latency, recall, and precision budget that forces the cascade. The first thing you draw in any system-design interview.
From first principles: why not a single model?
Imagine you have a perfect oracle: a single function f(user, item, context) → score that returns exactly the relevance you want. You run it on every (user, item) pair and pick the top K. Done.
The reason that fails in production is arithmetic, not modelling. Suppose:
- Catalogue size N = 10⁹ items (YouTube videos, products, ads).
- p99 serving budget: 100 ms end-to-end. The ML budget is maybe 50 ms of that — the rest is network, feature fetch, business logic, rendering.
- The "good" ranking model is a deep network with hundreds of features and embedding lookups. A reasonable per-item scoring cost: ~100 µs. (This is generous — production DLRM-class models with sparse features and cross-feature lookups can be 100–500 µs.)
Score every item: 10⁹ × 100 µs = 10⁵ seconds ≈ 28 hours. Per request. The math doesn't allow one model.
So you cascade. A cheap model — typically dot product between two precomputed embeddings — over the full catalogue produces a few hundred or a few thousand candidates. The expensive model scores only those. Optionally a third stage applies business logic (diversity, freshness, advertiser/policy filters) on the top dozens.
What each stage optimizes for — and why they can't be the same metric
This is the trap interviewers love to set: "if every stage optimizes for the same thing, why have stages?" The answer is that they don't.
| Stage | Metric | Why this one |
|---|---|---|
| Candidate gen | Recall@K₁ | You only have to not drop good items. Ranking is the ranker's job. False negatives here are fatal; false positives are cheap (they get filtered later). |
| Ranking | NDCG / AUC / calibrated CTR | You have to order the K₁ candidates well, and (for ads) emit a probability you can multiply with a bid. Precision matters because only the top dozens are shown. |
| Re-ranking | Long-term engagement, diversity, business constraints | The ranker scores items in isolation; users see a list. Re-ranking accounts for inter-item effects, freshness, exploration, ad load, policy violations. |
A way to internalize this: the candidate generator is allowed to be wrong about ordering as long as good items are in the K₁. The ranker is allowed to ignore items that didn't make it into K₁. The re-ranker is allowed to ignore most ranking detail and worry about the top-of-list user experience. Each stage narrows, and each stage gets richer features and a more expensive model as it does.
Interactive · the latency budget
Drag the sliders to set the catalogue size, end-to-end latency budget, and per-item model cost. The widget tells you whether you can serve from one stage, two, or three — and where the bottleneck is.
The recsys ↔ ads parallel
The same funnel works for both. Naming and metrics differ, but the structure is one-to-one.
| Recommender (e.g. YouTube Home) | Search ads (e.g. Google Sponsored Search) | |
|---|---|---|
| Retrieve from | All ~10⁹ candidate items | All eligible ads matching the query (typically 10³–10⁵ after match-type and policy filters) |
| Retrieval signal | User × item affinity (two-tower) | Query × ad relevance (keyword + semantic match) |
| Ranker predicts | P(watch ≥ 30s), P(like), engagement | P(click), P(convert) — multiplied by advertiser bid |
| Re-rank applies | Diversity, freshness, creator caps | Auction, reserve, ad-load, policy |
| Optimizes | Long-term engagement | Long-term auction efficiency (≈ user clicks × advertiser ROI × revenue) |
The auction layer is the only thing ads adds. Everything else — retrieval, ranking, calibration, debiasing — is the same machine. This is why senior ads MLEs and senior recsys MLEs are roughly interchangeable; only the last layer of vocabulary differs.
Multi-objective ranking — the ranker rarely predicts one thing
In an interview you'll be expected to know that "the ranking model" almost never has a single output. Modern production rankers are multi-task: they predict several engagement signals simultaneously, then combine them into one scalar score for sorting.
For YouTube-style recsys, a paper-canonical example (Zhao et al. 2019, "Recommending What Video to Watch Next") predicts roughly:
- P(click) — did the user click the thumbnail?
- P(watch ≥ 10s) — past the bounce threshold?
- E[watchtime | click] — regression head, conditional on a click.
- P(like), P(share), P(subscribe) — explicit engagement.
- P(skip), P(dislike) — negative signals.
The final score is a hand-tuned (or learned-to-rank) combination, e.g. α·P(watch) + β·E[wt] − γ·P(skip) + …. The weights are policy decisions, not model outputs — and interviewers love to ask "who picks the weights and how?" The honest answer is some combination of online A/B testing and Pareto-frontier evaluation; there is no off-the-shelf calibration for "long-term user happiness".
For ads, the multi-task ranker predicts P(click) and P(conversion | click), since the auction's expected revenue is bid × P(click) × P(conv | click) (or cpc_bid × P(click) for click-based campaigns). One model produces both — multi-task heads share embeddings and are much more sample-efficient than two separate models.
Where the funnel breaks — and what production fixes look like
Three failure modes that show up in interviews because they show up in production.
| Failure | Symptom | Production fix |
|---|---|---|
| Retrieval ↔ ranker mismatch | The ranker is great offline but the funnel underperforms online: the items the ranker thinks are best aren't in K₁. | Co-train. Ranker-distillation into the retriever (the retriever's loss includes the ranker's score on hard negatives). Or, periodically, hard-mine ranker top-K items the retriever missed and add them as positives. |
| Position bias compounds across stages | The training data for the ranker is clicks on items the old retriever surfaced at top positions. New retrievals never get a fair shot because their items appear lower (or never). | IPS / examination-model debiasing (lesson 7). Explicit exploration slot. Counterfactual evaluation before launch. |
| Re-rank starves the ranker | Heuristics at the re-rank stage (diversity, ad-load) routinely override the top-ranked items, so improving ranking quality has near-zero online impact. | Audit re-rank overrides as if they were a separate model with their own metric. Move policy-as-code to a learned re-ranker (e.g., generative re-rankers, listwise transformers). |
Interview prompts you should be ready for
- "Design YouTube Home feed end-to-end." (Draw the funnel within 30 seconds. Then go deep on whichever stage the interviewer steers you toward.)
- "Why can't you train one neural network on (user, item, context) over the full catalogue?" (Answer in arithmetic: per-item cost × catalogue size vs latency budget. Do the multiplication out loud.)
- "Your ranker improves NDCG offline by 5% but the A/B test is flat. What happened?" (Probes: retrieval mismatch, position bias, the ranker improvement was on items the retrieval never surfaces, novelty bias, or the metric used offline doesn't match the production metric.)
- "How do you decide K₁ — the candidate-set size?" (Trade-off between recall and ranker cost. Empirically: plot recall@K vs K, pick the elbow, then verify the ranker latency budget. There's no closed-form answer.)
- "What's the difference between the metric a ranker optimizes and the metric a feed optimizes?" (Probes: ranker outputs per-item probabilities; feed quality is a property of the list — diversity, freshness, ad-load. Different objects.)
- "For ads, the auction is at the end. Why not earlier?" (You can only auction items you have a calibrated CTR for. So auction is downstream of ranking. Retrieving cheaply and ranking carefully is what feeds the auction useful inputs.)