System design — the capstone walkthrough
The interviewer says "design Instagram Reels ranking" or "design Google Sponsored Search." This is the answer, with every previous lesson slotted into its place. The latency-budget simulator at the bottom is the arithmetic you should be ready to do out loud.
The interview has a predictable rhythm — ride it
Most system-design interviews follow the same arc, regardless of which product the interviewer names. Senior candidates don't fight it; they ride it, because the rhythm is also a natural way to actually design a system.
- Clarify scope. Scale, latency, single- vs multi-task, cold start, what counts as success. Two minutes, five questions.
- Define metrics. Top-line online metric, offline proxy, guardrails. One minute.
- Draw the funnel. Retrieve → rank → re-rank, with K₁, K₂, K_shown and a per-stage latency budget. Two minutes. This is the diagram you must be able to draw in 30 seconds.
- Zoom in where steered. The interviewer will pick a stage — cold start, debiasing, the auction, calibration drift. The remaining 25 minutes lives here.
- Failure modes and operations. Train/serve skew, calibration monitoring, A/B test power, on-call playbook. Five minutes, often near the end.
Worked example 1 — "Design Instagram Reels Home ranking"
Step 1 · Clarify
Five questions, asked out loud:
- Scale. Roughly a billion monthly users, on the order of 10⁹ active short videos, ~10² impressions per session.
- Latency. End-to-end per-impression on the order of 50–100 ms; the ML budget is maybe 40 ms inside that.
- Cold start. Both. New users (no history) and new videos (just uploaded) both matter — Reels grows by both.
- Objective. Multi-task. Watch time, like, share, retain. Not just CTR — clickbait is a known failure mode for video.
- Constraints. Diversity (no five basketball clips in a row), creator caps, ad slots inserted, policy filters.
Step 2 · Metrics
| Layer | Metric | Why |
|---|---|---|
| Top-line online | DAU + time-spent + 28-day retention | What the business actually cares about. Slow-moving, noisy, but the truth. |
| A/B proxy | Session watch-time, completion rate, like-rate | Powered enough to move in a 1–2 week test. |
| Offline proxy | NDCG / AUC per head, calibration error | What you can compute on yesterday's logs to gate launches. |
| Guardrails | Skip rate, dislike rate, complaint rate, creator concentration | Things you must not regress on, even if top-line goes up. |
Step 3 · Draw the funnel
Step 4 · Each stage in one breath
Retrieve (lessons 2–3). Multiple sources because no single retrieval signal covers the space. A collaborative two-tower (user_id × video_id, trained on watch/like history) for behavioral affinity. A content two-tower (text/audio/video embedding × user-history embedding) for new videos with no engagement signal. A recent-network source pulling videos posted by accounts the user follows. A trending baseline. An exploration source — a small budget for items the system is uncertain about. Each source has its own index (HNSW for the dense towers, inverted index for trending). Union, dedupe, hand the top K₁ to the ranker.
Rank (lessons 4–5). A DLRM-class network. Sparse features: user_id, video_id, creator_id, hashtags, audio_id. Dense features: video age, duration, prior engagement rates, geo. User history: last 100 actions, pooled (mean or small transformer). Multi-task heads share the trunk: sigmoid heads per binary signal (P(watch ≥ 5s), P(like), P(skip), P(share)) plus a regression head for E[watchtime]. Final score is a hand-tuned linear combination — policy weights, not learned end-to-end.
Re-rank. Listwise constraints the per-item ranker can't see. Creator caps in the shown window, topic diversity, ads inserted at policy positions, one exploration slot in the first 10, policy filters.
Step 5 · Train and eval
Pointwise BCE per binary head + MSE / Huber on log-watchtime. Daily retrain from yesterday's impressions joined with engagement labels (24-hour attribution window). Position-as-feature at train time only, set to a constant (e.g., position 0 or 1) at inference (lesson 7). Calibration re-fit hourly because the daily model drifts. Offline: NDCG and AUC per head, calibration error on a held-out hour. Online: A/B on session watch-time and DAU as primary; CUPED variance reduction (lesson 8). Long-term holdout cohort to catch retention regressions a 2-week A/B can't see.
Step 6 · Cold start
New users: skip the collaborative tower, retrieve from content-only and trending until the first ~10 sessions, then mix in the collaborative source. New videos: compute content embeddings at upload time, push into the ANN index immediately; the exploration source guarantees first impressions so the video has behavioral signal by the time the collaborative retriever needs it.
Worked example 2 — "Design Google Sponsored Search ads"
Step 1 · Clarify
- Scale. Billions of queries per day; a long-tail head distribution — the top 1% of queries get most of the traffic.
- Latency. ~100 ms end-to-end for the ad system, parallel with organic search.
- Surfaces. Multi-slot — top mainline, bottom mainline, side. Each has different reserves.
- Objective. Three-way: revenue, user click quality (don't show bad ads), advertiser efficiency (advertisers reaching their CPA targets).
Step 2 · The funnel — five stages, not three
Step 3 · Each stage in one breath
Match is ads-specific. Three modes. Exact match: query string == bid keyword. Phrase match: query contains the bid keyword. Broad match: semantic — a query-tower and keyword-tower (two-tower again) decide whether the query is close enough to the advertiser's keyword. Broad match is itself a recall problem and uses retrieval-style ML.
Pre-rank. A cheap CTR model (linear or shallow MLP) scores the eligible 10²–10⁴ ads. Keep the top ~100.
Rank. Full DLRM. Sparse features: query × ad cross, advertiser_id, keyword_id, vertical. Two heads: P(click) and P(conversion | click). Calibration is mandatory — the auction multiplies pCTR by bid, so a calibration error is a revenue or efficiency error of the same size. Position-feature at train time, set to a constant (e.g., position 0 or 1) at inference.
Auction (lesson 9). Rank-score = bid · pCTR · quality_score. GSP or VCG-style depending on surface. Reserve prices per query.
Pricing. Under GSP, the price for position k is score_{k+1} / pCTR_k — the minimum that would have kept you in your slot. VCG pays the externality you impose.
Autobidder (lesson 10). In parallel on the advertiser side, a separate system per advertiser account sets the per-impression bid to hit a target CPA or ROAS, using the same calibrated pCTR/pCVR signals.
Step 4 · Train and eval
Pointwise BCE for the click head; BCE for the conversion head trained only on clicked impressions (beware selection bias). IPS-debias on position. Daily retrain. Calibration monitoring is the on-call alarm — drift beyond threshold on a major vertical pages someone. Revenue eval: A/B by advertiser segment, advertiser CPA as secondary, user-side click quality (search abandonment, refinement) as a guardrail.
Operational concerns every senior interviewer will probe
| Concern | What to say |
|---|---|
| Latency budgeting | Do the arithmetic out loud. ML budget is a fraction of end-to-end; per-stage budget a fraction of ML budget. Per-item cost × stage-K must fit. (Lesson 1.) |
| Index / model freshness | Item embeddings nightly batch; user embeddings on the fly from last-N actions; model daily retrain; calibration hourly. Pure online learning is rare — stability dominates. |
| Feature stores | Online (Redis-backed, p99 < 5 ms) and offline (BigQuery / HDFS for training). Feature versioning is non-optional. |
| A/B infrastructure | Hash-based assignment on stable id. Hold-out cohorts for long-term. Network-effect mitigation (cluster-randomized or switchback) for marketplaces. (Lesson 8.) |
| Train/serve skew | Same feature code path at train and serve. Log features at serve, replay at train. Mismatch on a single feature = silent regressions you can't diagnose. The bug to mention unprompted. |
The five most common follow-up directions
Whichever product the interviewer names, one of these five follow-ups arrives:
| Question | Where it lives | Headline answer |
|---|---|---|
| "How do you handle a new user with no history?" | Lesson 2 | Content-only retrieval until first N sessions; exploration source; demographic priors; do not use a randomly-initialized user embedding. |
| "Your offline metric improved but A/B is flat." | Lessons 7 + 8 | Position bias on training data, retrieval/ranker mismatch, novelty, or the offline metric isn't measuring what the A/B measures. Walk through which it is. |
| "How do you scale the embedding table to 10B rows?" | Lesson 4 | Hash trick / feature hashing, sparse-aware optimizers (Adagrad on the embedding sub-graph), parameter-server sharding, frequency-based pruning, learned hash buckets for long-tail. |
| "What does the loss look like?" | Lesson 5 | Multi-task: L = Σ_k w_k · BCE_k + w_reg · MSE_reg, possibly with uncertainty-weighting (Kendall et al.) or PCGrad-style gradient surgery if heads conflict. |
| "What's the auction's incentive structure?" | Lessons 9 + 10 | GSP is not truthful; VCG is. Most platforms run GSP for legacy reasons. The autobidder reasons about your bidding strategy on top of the platform's auction. |
Interactive · latency-budget timing simulator
Lesson 1 said the budget arithmetic is the load-bearing senior signal. This widget makes it concrete. You allocate milliseconds across four pipeline stages out of a 100 ms end-to-end budget; under each stage the widget tells you how many items that budget can score given a representative per-item cost. The diagnosis at the bottom names the bottleneck — the stage that constrains end-to-end K — and tells you whether the funnel's K-values are internally consistent.
What separates senior from junior, distilled
| Junior answer | Senior answer | |
|---|---|---|
| Opening | Starts drawing the funnel immediately. | Asks 3–5 clarifying questions, then draws the funnel. |
| Depth | Equal time on every stage, depth-first. | Skims the funnel, then asks the interviewer which stage to go deep on. |
| Naming | Names DLRM, two-tower, embeddings. Stops. | Names the components and their failure modes: "DLRM with multi-task heads; watch out for calibration drift when one head's loss dominates." |
| Operations | Skipped or mentioned only if asked. | Volunteered. Train/serve skew, monitoring, on-call playbook, A/B power. |
| Trade-offs | Picks one design and defends it. | Names the design and the trade-off it's making, plus the alternative they considered. |
| Offline / online gap | Surprised when asked about it. | Brings it up themselves: "we'd expect offline NDCG to improve more than online watch-time, and here's why." |
Interview prompts you should be ready for
- "Design YouTube Home ranking. Walk me through it end to end." (The canonical Reels/Reels-adjacent prompt. Use the worked example above, swap names. Watch-time, multi-task, the recsys funnel.)
- "Design Google Sponsored Search ads. Where does the auction enter the loop?" (After ranking, before pricing. Be ready to defend GSP vs VCG and to discuss the autobidder running in parallel on the advertiser side.)
- "Your design has a 10 ms p99 budget. Walk through which components fit." (Lesson 1 arithmetic. ANN retrieval fits, single-stage DLRM doesn't, you need to pre-filter aggressively. Show the math.)
- "How would you instrument this system to detect a CTR-prediction drift?" (Calibration monitoring — logged pCTR vs observed CTR, per major segment, with alerts on drift >X%. PSI on input feature distributions. Daily holdout AUC.)
- "You ship a re-ranker that improves diversity but reduces immediate CTR by 0.3%. How do you decide whether to launch?" (Long-term holdout, retention, complaint rate, the diversity-engagement trade-off curve. Not a CTR-only decision.)
- "Walk me through what you'd monitor on day one of launching this system." (Latency p50/p99, calibration error, feature-fetch failures, head-conflict in multi-task loss, top-line metric per segment, on-call playbook for the top three failure modes.)
That's the full search/ads/recsys series. The interactive widgets in lessons 1, 5, 7, 8, 9, and this one are the ones worth replaying before an interview — they exercise the arithmetic you'll need to do out loud. Good luck.