The ranking model — LR to DLRM to DCN-v2

Retrieval handed you K₁ candidates. The ranker decides which dozen a human sees. Fifteen years of architecture history, compressed.

What the ranker actually eats

The shaping constraint: a production ranker takes (user, item, context) and must emit a calibrated score in O(100 µs). Inside the triple, features come in four shapes, each handled by a different sub-network. Confusing the shapes is the most common junior error.

Feature kind	Examples	Cardinality / shape	Handled by
Sparse categorical	user_id, item_id, ad_id, query token, geo, device_model	10⁶ – 10⁹ unique values, one ID per slot	Embedding table lookup. One row per ID (often after hashing).
Sparse multi-valued	last 100 clicked item_ids, queries in session, watched videos	Variable-length list of IDs per row	Lookup each ID, then pool (sum/mean) or attend (transformer over history).
Dense numerical	item age (days), price, ctr_30d, user account age	Scalars in ℝ	Standardize / bucket / log-transform, then feed an MLP.
Cross / context	query × ad cosine, time-of-day, hour-of-week, slot position	Mostly derived at scoring time	Either pre-computed feature, or implicit via the cross-network.

An interviewer who asks "how do you featurize a user" is asking whether you know those four shapes exist and enter the network through different doors.

The embedding table is the model

A hashed user_id space of 10⁸ at embedding dim 64 is 6.4 × 10⁹ parameters. In one table. A production ranker has tens of such tables. The MLP on top is maybe a million parameters — six orders of magnitude smaller.

TYPICAL RANKER PARAMETER BUDGET ────────────────────────────────────────────────────────── user_id table : 10⁸ rows × 64 = 6.4 B params item_id table : 10⁹ rows × 32 = 32 B params query_token table : 10⁶ rows × 64 = 64 M params ad_creative_id : 10⁸ rows × 32 = 3.2 B params geo / device / … : 10⁴ rows × 16 = ~M params ────────────────────────────────────────────────────────── TOTAL SPARSE : O(10¹⁰ – 10¹¹) params MLP / top of model : O(10⁶) params ← rounding error

This drives the production-systems story. The MLP fits on one GPU; the embeddings don't fit on a rack. Hence: hashing tricks, model-parallel tables, sharded parameter servers, hot-row caching — covered in the system_ml folder. For this lesson, the imbalance is the headline fact.

The one-line summary you should be able to give

A modern ranker is an enormous sparse embedding store with a small dense network on top. The embeddings are where the knowledge lives; the network is where the interactions get computed.

The architecture timeline

Logistic regression (~2010)

One scalar weight per feature, one bias, sigmoid: p(click | x) = σ(w · x + b). "Features" are one-hot encodings of every sparse ID and bucketed dense values. Interactions — e.g., the (query="running shoes", ad_id=42) pair being high-CTR even though query and ad in isolation aren't — must be hand-engineered: query_token × ad_id becomes a new sparse field. The number of these crosses explodes combinatorially.

LR is interpretable, trivially parallelizable (FTRL, online learning), and almost free to serve. It defined the "feature-engineering era" of Big Tech ads ML from roughly 2008–2015 — engineers writing cross-feature recipes.

Factorization Machines (Rendle 2010) and FFM (Juan 2016)

The first move toward learned crosses. FM keeps the linear term, then adds explicit pairwise interactions via latent vectors:

ŷ(x) = w₀ + Σᵢ wᵢ xᵢ + Σᵢ Σⱼ>ᵢ ⟨vᵢ, vⱼ⟩ xᵢ xⱼ

Each feature gets a k-dim latent; their interaction is the dot product. You don't enumerate O(d²) crosses — you learn them in O(d k) parameters. FFM gives each feature a different latent per "field" (interactions can differ across query vs ad). FM/FFM dominated the Criteo/Avazu Kaggle era.

Wide & Deep (Cheng et al. 2016, Google Play)

The template every subsequent architecture inherits: a wide LR head plus a deep MLP head, summed at logit. p = σ(w_wide · x_wide + MLP_deep(x_deep) + b).

Wide handles memorization: hand-crafted crosses, exact (user, item) co-occurrence.
Deep handles generalization: embeddings → MLP, so similar users see similar items even without co-occurrence.

This decomposition — explicit crosses for memorization + neural net for generalization — is the pattern DLRM and DCN both inherit. The idea from Wide & Deep that survived is the hybrid itself.

DLRM (Naumov et al. 2019, Meta)

The production-canonical Meta-style ranker. Forward pass:

sparse_1 → emb_1 ─┐ sparse_2 → emb_2 ─┤ ... ├─► pairwise dot products ─► TOP MLP ─► logit sparse_K → emb_K ─┤ ▲ │ │ dense → BOTTOM MLP ┴──────────────────────────────┘

Note: the dense bottom MLP's output is treated as an additional "embedding" that participates in the pairwise dot products alongside the sparse-feature embeddings; it ALSO skip-connects directly to the top MLP.

Three pieces: bottom MLP projects dense features to embedding dim; pairwise dot products ⟨eᵢ, eⱼ⟩ compute explicit second-order interactions (the deep analogue of FM's pairwise term); top MLP eats the dot-product vector + dense bottom output and emits the logit.

DLRM's architectural commitment is that second-order interactions matter and should be computed explicitly. The top MLP can still learn higher-order effects, but order-2 is given for free.

DCN / DCN-v2 (Wang et al. 2017 / 2021)

DCN replaces "all pairwise dots" with a stack of "cross layers" that learn higher-order interactions efficiently. The DCN-v2 cross layer:

x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l

Each layer multiplies x₀ elementwise with a learned linear projection of x_l, then adds the residual. Stacking L layers expresses interactions up to order L+1 — with only L matrices of parameters, not O(d^{L+1}). DCN-v1 used a rank-1 vector form; v2's matrix form is more expressive and what production deploys.

DCN-v2 is usually paired with a parallel deep MLP — again echoing Wide & Deep: an explicit crosses tower next to an implicit MLP tower.

Why GBDTs lost

From roughly 2010–2016, gradient-boosted trees (XGBoost, LightGBM) were the strongest off-the-shelf CTR model on tabular data. Then they stopped winning, and the reason is structural, not fashion-driven.

	GBDT	Deep ranker (DLRM / DCN)
Sparse high-cardinality features	Painful — can't split on 10⁹ user_ids; you bucket or target-encode and lose information.	Native — embedding tables give every ID a learned vector.
Parameter sharing across users	None — every leaf is its own decision.	Two users with similar histories share gradient signal through shared embeddings.
Multi-task heads	One tree per task; embeddings not shared.	Shared trunk + per-task heads; sample-efficient.
Sequence features (user history)	Flatten or hand-aggregate.	Pool / attend natively.
Strengths that remain	Small dense tabular problems, fast iteration, no GPU.	Anything with sparse-categorical + multi-task + sequence.

The transitional architecture was Facebook's 2014 "Practical Lessons" paper: a GBDT extracts leaf-index features that feed an LR head. It was production for a few years. Once embedding tables made sparse IDs first-class, the GBDT step became dead weight.

What "feature engineering" means now

Feature engineering didn't go away with deep learning — it moved. You no longer write query_token × ad_id by hand; the network learns it. But you do decide:

Vocabulary and hash sizes. A 10⁸ user_id space hashed into 10⁶ buckets has collisions on every popular user.
Embedding dimensions per field. Geo with 200 values wants dim 8; user_id with 10⁹ values can carry dim 128. Uniform dims waste parameters.
Which user history to surface. Clicks? Impressions? Watched-to-completion? Each is a different multi-valued feature.
Bucketing dense features. Raw price is fine; log(1+price) in 32 quantiles is usually better — MLPs learn monotonic responses faster from buckets.
Position as a side input. Position is a feature at training time (it causes clicks). It is not a feature at inference (you're deciding it). Lesson 7.
Defining the label. Most consequential of all — lesson 5.

Trade-off table

	Capacity	Train cost	Inference cost	Interpretability	Feature-eng load
LR	Linear only	Trivial	Trivial	High — per-feature weights	Very high — all crosses by hand
FM / FFM	Linear + 2nd order	Low	Low	Medium	Medium
Wide & Deep	LR + MLP	Medium	Medium	Low	Medium — wide side still hand-crossed
DLRM	Explicit 2nd order + MLP	High (sparse-heavy)	High — pairwise dots are O(K²)	Low	Low — model learns crosses
DCN-v2	Learned order ≤ L+1	High	Medium — L cross layers	Low	Low

DCN-v2 vs DLRM is the live debate in current production. DCN-v2 often matches or modestly beats DLRM on the same feature set; gains are dataset-dependent and many production teams report parity. DLRM is older, simpler, easier to scale embedding-wise. Many teams run both — DCN cross-stack and DLRM dots, concatenated before the top MLP.

Interactive · feature-importance composer

Toggle features. The widget shows an expressiveness bar and a heuristic ΔAUC vs LR (toy — directionally correct, not literal). The diagnosis calls out the failure mode of each combination.

Compose a ranker feature set

All toggles default to "on" — that's a reasonable production ranker. Turn things off and read what breaks. The diagnoses are the interesting part; the numbers are toy.

sparse user_id embedding sparse item_id / ad_id embedding user history (pooled clicked-item embeddings) query × item cross (cosine + learned) dense item features (price, age, ctr_30d) position feature at TRAINING time position feature at INFERENCE time

expressiveness

—

ΔAUC vs LR baseline (toy)

—

personalization

—

verdict

—

Reading

—

Interview prompts you should be ready for

"Walk me through DLRM's forward pass." (Sparse → embeddings; dense → bottom MLP; all pairwise dot products; concat with dense bottom; top MLP → logit. Name the three sub-networks.)
"Why do production rankers have an embedding table with billions of parameters but an MLP with maybe a million?" (Capacity-per-parameter argument: each sparse ID needs its own row to be distinguishable; the MLP is shared across all (user, item) pairs and gets reused on every example.)
"Why use DCN-v2 instead of DLRM?" (Probes: DLRM hardcodes order-2 interactions via dot products; DCN-v2 cross layers learn order ≤ L+1 with a stack of L matrices. If higher-order effects matter — they usually do — DCN-v2 has the capacity to express them and DLRM doesn't.)
"Your model has 6B parameters and 99% of them are in one user_id table. How do you train this on 8 GPUs?" (Embedding model-parallelism: shard the table by ID across GPUs, all-to-all the looked-up rows, then data-parallel the dense top. This is the canonical "torchrec" / FBGEMM / "DLRM-style" answer.)
"Embed all features into 64-dim, or use per-feature dim? Trade-offs?" (Uniform dim is simpler and lets you concat freely; per-feature dim saves parameters on low-cardinality fields and lets high-cardinality fields breathe. Production systems typically per-feature.)
"When would you keep an LR baseline alive in 2025?" (Three reasons: as a sanity-check during ranker rollouts; for cold-start segments where the deep model has no signal; for explainability in policy/legal contexts. Plus: LR is what you fall back to during a deep-model serving outage.)
"You add 100 new sparse features to a DLRM. AUC barely moves. Why?" (Several real causes: the embedding tables are under-trained for the new fields, hashing collisions, embedding dim too small, pairwise dots now have 100 noisy terms swamping the signal, or the features are correlated with existing ones.)

Takeaway

The ranker is an embedding store with a small network on top. Architecture history (LR → FM → Wide&Deep → DLRM → DCN-v2) is the story of moving feature crosses from hand-engineered to learned. The interview signal: be able to name what each architecture commits to (LR commits to linearity; DLRM commits to explicit order-2 dots; DCN-v2 commits to a parameterized cross stack) and what that buys or costs.