System design — the capstone walkthrough

The interviewer says "design Instagram Reels ranking" or "design Google Sponsored Search." This is the answer, with every previous lesson slotted into its place. The latency-budget simulator at the bottom is the arithmetic you should be ready to do out loud.

The interview has a predictable rhythm — ride it

Most system-design interviews follow the same arc, regardless of which product the interviewer names. Senior candidates don't fight it; they ride it, because the rhythm is also a natural way to actually design a system.

Clarify scope. Scale, latency, single- vs multi-task, cold start, what counts as success. Two minutes, five questions.
Define metrics. Top-line online metric, offline proxy, guardrails. One minute.
Draw the funnel. Retrieve → rank → re-rank, with K₁, K₂, K_shown and a per-stage latency budget. Two minutes. This is the diagram you must be able to draw in 30 seconds.
Zoom in where steered. The interviewer will pick a stage — cold start, debiasing, the auction, calibration drift. The remaining 25 minutes lives here.
Failure modes and operations. Train/serve skew, calibration monitoring, A/B test power, on-call playbook. Five minutes, often near the end.

The senior move

After you draw the funnel, ask which piece the interviewer wants to dig into. Don't depth-first into retrieval just because retrieval comes first in the diagram. A junior candidate dumps everything they know; a senior candidate asks where the signal is and aims their depth there.

Worked example 1 — "Design Instagram Reels Home ranking"

Step 1 · Clarify

Five questions, asked out loud:

Scale. Roughly a billion monthly users, on the order of 10⁹ active short videos, ~10² impressions per session.
Latency. End-to-end per-impression on the order of 50–100 ms; the ML budget is maybe 40 ms inside that.
Cold start. Both. New users (no history) and new videos (just uploaded) both matter — Reels grows by both.
Objective. Multi-task. Watch time, like, share, retain. Not just CTR — clickbait is a known failure mode for video.
Constraints. Diversity (no five basketball clips in a row), creator caps, ad slots inserted, policy filters.

Step 2 · Metrics

Layer	Metric	Why
Top-line online	DAU + time-spent + 28-day retention	What the business actually cares about. Slow-moving, noisy, but the truth.
A/B proxy	Session watch-time, completion rate, like-rate	Powered enough to move in a 1–2 week test.
Offline proxy	NDCG / AUC per head, calibration error	What you can compute on yesterday's logs to gate launches.
Guardrails	Skip rate, dislike rate, complaint rate, creator concentration	Things you must not regress on, even if top-line goes up.

Step 3 · Draw the funnel

request (user_id, context, device, geo) │ ▼ ┌─────────────────────────────────────────┐ │ RETRIEVAL — N≈10⁹ → K₁≈10³ │ ~10 ms │ • two-tower collab (user × video) │ │ • two-tower content (text/audio/video)│ │ • recent-network (your follows) │ │ • local/trending baseline │ │ • exploration source │ └─────────────────┬───────────────────────┘ │ union, dedupe ▼ ┌─────────────────────────────────────────┐ │ RANKING — K₁ → K₂≈10² │ ~25 ms │ DLRM / DCN-v2, multi-task heads │ │ P(watch≥5s), E[wt|impression], P(like),│ │ P(skip), P(share) │ │ score = α·P(wt) + β·E[wt] − γ·P(skip) │ └─────────────────┬───────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ RE-RANK — K₂ → K_shown≈25/session │ ~5 ms │ diversity, creator caps, ad blending, │ │ freshness boost, policy filter │ └─────────────────────────────────────────┘

Step 4 · Each stage in one breath

Retrieve (lessons 2–3). Multiple sources because no single retrieval signal covers the space. A collaborative two-tower (user_id × video_id, trained on watch/like history) for behavioral affinity. A content two-tower (text/audio/video embedding × user-history embedding) for new videos with no engagement signal. A recent-network source pulling videos posted by accounts the user follows. A trending baseline. An exploration source — a small budget for items the system is uncertain about. Each source has its own index (HNSW for the dense towers, inverted index for trending). Union, dedupe, hand the top K₁ to the ranker.

Rank (lessons 4–5). A DLRM-class network. Sparse features: user_id, video_id, creator_id, hashtags, audio_id. Dense features: video age, duration, prior engagement rates, geo. User history: last 100 actions, pooled (mean or small transformer). Multi-task heads share the trunk: sigmoid heads per binary signal (P(watch ≥ 5s), P(like), P(skip), P(share)) plus a regression head for E[watchtime]. Final score is a hand-tuned linear combination — policy weights, not learned end-to-end.

Re-rank. Listwise constraints the per-item ranker can't see. Creator caps in the shown window, topic diversity, ads inserted at policy positions, one exploration slot in the first 10, policy filters.

Step 5 · Train and eval

Pointwise BCE per binary head + MSE / Huber on log-watchtime. Daily retrain from yesterday's impressions joined with engagement labels (24-hour attribution window). Position-as-feature at train time only, set to a constant (e.g., position 0 or 1) at inference (lesson 7). Calibration re-fit hourly because the daily model drifts. Offline: NDCG and AUC per head, calibration error on a held-out hour. Online: A/B on session watch-time and DAU as primary; CUPED variance reduction (lesson 8). Long-term holdout cohort to catch retention regressions a 2-week A/B can't see.

Step 6 · Cold start

New users: skip the collaborative tower, retrieve from content-only and trending until the first ~10 sessions, then mix in the collaborative source. New videos: compute content embeddings at upload time, push into the ANN index immediately; the exploration source guarantees first impressions so the video has behavioral signal by the time the collaborative retriever needs it.

Where the conversation lands

At this point the interviewer steers. Common pivots: "go deep on the cold-start exploration policy" (lesson 2), "what about negative sampling for the two-tower" (lesson 6), "how do you debias position bias" (lesson 7), "your offline NDCG improves but A/B is flat — what happened" (lessons 7 + 8). You have a pointer for each.

Worked example 2 — "Design Google Sponsored Search ads"

Step 1 · Clarify

Scale. Billions of queries per day; a long-tail head distribution — the top 1% of queries get most of the traffic.
Latency. ~100 ms end-to-end for the ad system, parallel with organic search.
Surfaces. Multi-slot — top mainline, bottom mainline, side. Each has different reserves.
Objective. Three-way: revenue, user click quality (don't show bad ads), advertiser efficiency (advertisers reaching their CPA targets).

Step 2 · The funnel — five stages, not three

query │ ▼ ┌────────────────────────────────────────────┐ │ MATCH — query → eligible ads │ │ exact / phrase / broad (semantic) match │ │ eligible set: ~10²–10⁴ ads │ └─────────────────┬──────────────────────────┘ ▼ ┌────────────────────────────────────────────┐ │ PRE-RANK — cheap CTR model │ │ keep top ~100 │ └─────────────────┬──────────────────────────┘ ▼ ┌────────────────────────────────────────────┐ │ RANK — full DLRM │ │ P(click), P(conv | click), calibrated │ └─────────────────┬──────────────────────────┘ ▼ ┌────────────────────────────────────────────┐ │ AUCTION — GSP / VCG-style │ │ score = bid · pCTR · quality │ │ reserve per query │ └─────────────────┬──────────────────────────┘ ▼ ┌────────────────────────────────────────────┐ │ PRICING — next-rank score / your pCTR │ └────────────────────────────────────────────┘

Step 3 · Each stage in one breath

Match is ads-specific. Three modes. Exact match: query string == bid keyword. Phrase match: query contains the bid keyword. Broad match: semantic — a query-tower and keyword-tower (two-tower again) decide whether the query is close enough to the advertiser's keyword. Broad match is itself a recall problem and uses retrieval-style ML.

Pre-rank. A cheap CTR model (linear or shallow MLP) scores the eligible 10²–10⁴ ads. Keep the top ~100.

Rank. Full DLRM. Sparse features: query × ad cross, advertiser_id, keyword_id, vertical. Two heads: P(click) and P(conversion | click). Calibration is mandatory — the auction multiplies pCTR by bid, so a calibration error is a revenue or efficiency error of the same size. Position-feature at train time, set to a constant (e.g., position 0 or 1) at inference.

Auction (lesson 9). Rank-score = bid · pCTR · quality_score. GSP or VCG-style depending on surface. Reserve prices per query.

Pricing. Under GSP, the price for position k is score_{k+1} / pCTR_k — the minimum that would have kept you in your slot. VCG pays the externality you impose.

Autobidder (lesson 10). In parallel on the advertiser side, a separate system per advertiser account sets the per-impression bid to hit a target CPA or ROAS, using the same calibrated pCTR/pCVR signals.

Step 4 · Train and eval

Pointwise BCE for the click head; BCE for the conversion head trained only on clicked impressions (beware selection bias). IPS-debias on position. Daily retrain. Calibration monitoring is the on-call alarm — drift beyond threshold on a major vertical pages someone. Revenue eval: A/B by advertiser segment, advertiser CPA as secondary, user-side click quality (search abandonment, refinement) as a guardrail.

Operational concerns every senior interviewer will probe

Concern	What to say
Latency budgeting	Do the arithmetic out loud. ML budget is a fraction of end-to-end; per-stage budget a fraction of ML budget. Per-item cost × stage-K must fit. (Lesson 1.)
Index / model freshness	Item embeddings nightly batch; user embeddings on the fly from last-N actions; model daily retrain; calibration hourly. Pure online learning is rare — stability dominates.
Feature stores	Online (Redis-backed, p99 < 5 ms) and offline (BigQuery / HDFS for training). Feature versioning is non-optional.
A/B infrastructure	Hash-based assignment on stable id. Hold-out cohorts for long-term. Network-effect mitigation (cluster-randomized or switchback) for marketplaces. (Lesson 8.)
Train/serve skew	Same feature code path at train and serve. Log features at serve, replay at train. Mismatch on a single feature = silent regressions you can't diagnose. The bug to mention unprompted.

The five most common follow-up directions

Whichever product the interviewer names, one of these five follow-ups arrives:

Question	Where it lives	Headline answer
"How do you handle a new user with no history?"	Lesson 2	Content-only retrieval until first N sessions; exploration source; demographic priors; do not use a randomly-initialized user embedding.
"Your offline metric improved but A/B is flat."	Lessons 7 + 8	Position bias on training data, retrieval/ranker mismatch, novelty, or the offline metric isn't measuring what the A/B measures. Walk through which it is.
"How do you scale the embedding table to 10B rows?"	Lesson 4	Hash trick / feature hashing, sparse-aware optimizers (Adagrad on the embedding sub-graph), parameter-server sharding, frequency-based pruning, learned hash buckets for long-tail.
"What does the loss look like?"	Lesson 5	Multi-task: L = Σ_k w_k · BCE_k + w_reg · MSE_reg, possibly with uncertainty-weighting (Kendall et al.) or PCGrad-style gradient surgery if heads conflict.
"What's the auction's incentive structure?"	Lessons 9 + 10	GSP is not truthful; VCG is. Most platforms run GSP for legacy reasons. The autobidder reasons about your bidding strategy on top of the platform's auction.

Interactive · latency-budget timing simulator

Lesson 1 said the budget arithmetic is the load-bearing senior signal. This widget makes it concrete. You allocate milliseconds across four pipeline stages out of a 100 ms end-to-end budget; under each stage the widget tells you how many items that budget can score given a representative per-item cost. The diagnosis at the bottom names the bottleneck — the stage that constrains end-to-end K — and tells you whether the funnel's K-values are internally consistent.

Allocate your 100 ms budget

Per-item costs are profile-dependent. Load a preset to see Reels-style vs Google-Sponsored-Search-style profiles. The "network + render" remainder is whatever the four stages don't consume.

retrieve (ms): 10 — rank (ms): 25 — re-rank (ms): 5 — auction (ms): 0 —

total used (ms)

—

network + render (ms)

—

retrieval K₁ ceiling

—

ranker K₂ ceiling

—

re-rank K_shown ceiling

—

bottleneck stage

—

Diagnosis

—

What separates senior from junior, distilled

	Junior answer	Senior answer
Opening	Starts drawing the funnel immediately.	Asks 3–5 clarifying questions, then draws the funnel.
Depth	Equal time on every stage, depth-first.	Skims the funnel, then asks the interviewer which stage to go deep on.
Naming	Names DLRM, two-tower, embeddings. Stops.	Names the components and their failure modes: "DLRM with multi-task heads; watch out for calibration drift when one head's loss dominates."
Operations	Skipped or mentioned only if asked.	Volunteered. Train/serve skew, monitoring, on-call playbook, A/B power.
Trade-offs	Picks one design and defends it.	Names the design and the trade-off it's making, plus the alternative they considered.
Offline / online gap	Surprised when asked about it.	Brings it up themselves: "we'd expect offline NDCG to improve more than online watch-time, and here's why."

The single most common failure

Strong technical candidates often blow the system-design round by going depth-first on the first stage they draw. They spend 20 minutes on retrieval, never reach ranking, and the interviewer never gets to ask their planned follow-up. Draw the funnel once, completely, fast, then ask "which piece do you want to go deep on?" That single sentence resets the budget and reads as senior.

Interview prompts you should be ready for

"Design YouTube Home ranking. Walk me through it end to end." (The canonical Reels/Reels-adjacent prompt. Use the worked example above, swap names. Watch-time, multi-task, the recsys funnel.)
"Design Google Sponsored Search ads. Where does the auction enter the loop?" (After ranking, before pricing. Be ready to defend GSP vs VCG and to discuss the autobidder running in parallel on the advertiser side.)
"Your design has a 10 ms p99 budget. Walk through which components fit." (Lesson 1 arithmetic. ANN retrieval fits, single-stage DLRM doesn't, you need to pre-filter aggressively. Show the math.)
"How would you instrument this system to detect a CTR-prediction drift?" (Calibration monitoring — logged pCTR vs observed CTR, per major segment, with alerts on drift >X%. PSI on input feature distributions. Daily holdout AUC.)
"You ship a re-ranker that improves diversity but reduces immediate CTR by 0.3%. How do you decide whether to launch?" (Long-term holdout, retention, complaint rate, the diversity-engagement trade-off curve. Not a CTR-only decision.)
"Walk me through what you'd monitor on day one of launching this system." (Latency p50/p99, calibration error, feature-fetch failures, head-conflict in multi-task loss, top-line metric per segment, on-call playbook for the top three failure modes.)

Takeaway

The system-design round is a rhythm — clarify, metrics, funnel, deep-dive, operations. Senior candidates ride it, ask which stage to go deep on, and volunteer the things juniors wait to be asked about (calibration, train/serve skew, position bias, the offline/online gap, freshness, cold start). Every previous lesson in this series is one such "thing to mention." When you can draw the funnel in 30 seconds and have a one-sentence answer to each of the five most-likely follow-ups, you're ready.

That's the full search/ads/recsys series. The interactive widgets in lessons 1, 5, 7, 8, 9, and this one are the ones worth replaying before an interview — they exercise the arithmetic you'll need to do out loud. Good luck.