The ranking model — LR to DLRM to DCN-v2
Retrieval handed you K₁ candidates. The ranker decides which dozen a human sees. Fifteen years of architecture history, compressed.
What the ranker actually eats
The shaping constraint: a production ranker takes (user, item, context) and must emit a calibrated score in O(100 µs). Inside the triple, features come in four shapes, each handled by a different sub-network. Confusing the shapes is the most common junior error.
| Feature kind | Examples | Cardinality / shape | Handled by |
|---|---|---|---|
| Sparse categorical | user_id, item_id, ad_id, query token, geo, device_model | 10⁶ – 10⁹ unique values, one ID per slot | Embedding table lookup. One row per ID (often after hashing). |
| Sparse multi-valued | last 100 clicked item_ids, queries in session, watched videos | Variable-length list of IDs per row | Lookup each ID, then pool (sum/mean) or attend (transformer over history). |
| Dense numerical | item age (days), price, ctr_30d, user account age | Scalars in ℝ | Standardize / bucket / log-transform, then feed an MLP. |
| Cross / context | query × ad cosine, time-of-day, hour-of-week, slot position | Mostly derived at scoring time | Either pre-computed feature, or implicit via the cross-network. |
An interviewer who asks "how do you featurize a user" is asking whether you know those four shapes exist and enter the network through different doors.
The embedding table is the model
A hashed user_id space of 10⁸ at embedding dim 64 is 6.4 × 10⁹ parameters. In one table. A production ranker has tens of such tables. The MLP on top is maybe a million parameters — six orders of magnitude smaller.
This drives the production-systems story. The MLP fits on one GPU; the embeddings don't fit on a rack. Hence: hashing tricks, model-parallel tables, sharded parameter servers, hot-row caching — covered in the system_ml folder. For this lesson, the imbalance is the headline fact.
The architecture timeline
Logistic regression (~2010)
One scalar weight per feature, one bias, sigmoid: p(click | x) = σ(w · x + b). "Features" are one-hot encodings of every sparse ID and bucketed dense values. Interactions — e.g., the (query="running shoes", ad_id=42) pair being high-CTR even though query and ad in isolation aren't — must be hand-engineered: query_token × ad_id becomes a new sparse field. The number of these crosses explodes combinatorially.
LR is interpretable, trivially parallelizable (FTRL, online learning), and almost free to serve. It defined the "feature-engineering era" of Big Tech ads ML from roughly 2008–2015 — engineers writing cross-feature recipes.
Factorization Machines (Rendle 2010) and FFM (Juan 2016)
The first move toward learned crosses. FM keeps the linear term, then adds explicit pairwise interactions via latent vectors:
ŷ(x) = w₀ + Σᵢ wᵢ xᵢ + Σᵢ Σⱼ>ᵢ ⟨vᵢ, vⱼ⟩ xᵢ xⱼ
Each feature gets a k-dim latent; their interaction is the dot product. You don't enumerate O(d²) crosses — you learn them in O(d k) parameters. FFM gives each feature a different latent per "field" (interactions can differ across query vs ad). FM/FFM dominated the Criteo/Avazu Kaggle era.
Wide & Deep (Cheng et al. 2016, Google Play)
The template every subsequent architecture inherits: a wide LR head plus a deep MLP head, summed at logit. p = σ(w_wide · x_wide + MLP_deep(x_deep) + b).
- Wide handles memorization: hand-crafted crosses, exact (user, item) co-occurrence.
- Deep handles generalization: embeddings → MLP, so similar users see similar items even without co-occurrence.
This decomposition — explicit crosses for memorization + neural net for generalization — is the pattern DLRM and DCN both inherit. The idea from Wide & Deep that survived is the hybrid itself.
DLRM (Naumov et al. 2019, Meta)
The production-canonical Meta-style ranker. Forward pass:
Note: the dense bottom MLP's output is treated as an additional "embedding" that participates in the pairwise dot products alongside the sparse-feature embeddings; it ALSO skip-connects directly to the top MLP.
Three pieces: bottom MLP projects dense features to embedding dim; pairwise dot products ⟨eᵢ, eⱼ⟩ compute explicit second-order interactions (the deep analogue of FM's pairwise term); top MLP eats the dot-product vector + dense bottom output and emits the logit.
DLRM's architectural commitment is that second-order interactions matter and should be computed explicitly. The top MLP can still learn higher-order effects, but order-2 is given for free.
DCN / DCN-v2 (Wang et al. 2017 / 2021)
DCN replaces "all pairwise dots" with a stack of "cross layers" that learn higher-order interactions efficiently. The DCN-v2 cross layer:
x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l
Each layer multiplies x₀ elementwise with a learned linear projection of x_l, then adds the residual. Stacking L layers expresses interactions up to order L+1 — with only L matrices of parameters, not O(d^{L+1}). DCN-v1 used a rank-1 vector form; v2's matrix form is more expressive and what production deploys.
DCN-v2 is usually paired with a parallel deep MLP — again echoing Wide & Deep: an explicit crosses tower next to an implicit MLP tower.
Why GBDTs lost
From roughly 2010–2016, gradient-boosted trees (XGBoost, LightGBM) were the strongest off-the-shelf CTR model on tabular data. Then they stopped winning, and the reason is structural, not fashion-driven.
| GBDT | Deep ranker (DLRM / DCN) | |
|---|---|---|
| Sparse high-cardinality features | Painful — can't split on 10⁹ user_ids; you bucket or target-encode and lose information. | Native — embedding tables give every ID a learned vector. |
| Parameter sharing across users | None — every leaf is its own decision. | Two users with similar histories share gradient signal through shared embeddings. |
| Multi-task heads | One tree per task; embeddings not shared. | Shared trunk + per-task heads; sample-efficient. |
| Sequence features (user history) | Flatten or hand-aggregate. | Pool / attend natively. |
| Strengths that remain | Small dense tabular problems, fast iteration, no GPU. | Anything with sparse-categorical + multi-task + sequence. |
The transitional architecture was Facebook's 2014 "Practical Lessons" paper: a GBDT extracts leaf-index features that feed an LR head. It was production for a few years. Once embedding tables made sparse IDs first-class, the GBDT step became dead weight.
What "feature engineering" means now
Feature engineering didn't go away with deep learning — it moved. You no longer write query_token × ad_id by hand; the network learns it. But you do decide:
- Vocabulary and hash sizes. A 10⁸ user_id space hashed into 10⁶ buckets has collisions on every popular user.
- Embedding dimensions per field. Geo with 200 values wants dim 8; user_id with 10⁹ values can carry dim 128. Uniform dims waste parameters.
- Which user history to surface. Clicks? Impressions? Watched-to-completion? Each is a different multi-valued feature.
- Bucketing dense features. Raw
priceis fine;log(1+price)in 32 quantiles is usually better — MLPs learn monotonic responses faster from buckets. - Position as a side input. Position is a feature at training time (it causes clicks). It is not a feature at inference (you're deciding it). Lesson 7.
- Defining the label. Most consequential of all — lesson 5.
Trade-off table
| Capacity | Train cost | Inference cost | Interpretability | Feature-eng load | |
|---|---|---|---|---|---|
| LR | Linear only | Trivial | Trivial | High — per-feature weights | Very high — all crosses by hand |
| FM / FFM | Linear + 2nd order | Low | Low | Medium | Medium |
| Wide & Deep | LR + MLP | Medium | Medium | Low | Medium — wide side still hand-crossed |
| DLRM | Explicit 2nd order + MLP | High (sparse-heavy) | High — pairwise dots are O(K²) | Low | Low — model learns crosses |
| DCN-v2 | Learned order ≤ L+1 | High | Medium — L cross layers | Low | Low |
DCN-v2 vs DLRM is the live debate in current production. DCN-v2 often matches or modestly beats DLRM on the same feature set; gains are dataset-dependent and many production teams report parity. DLRM is older, simpler, easier to scale embedding-wise. Many teams run both — DCN cross-stack and DLRM dots, concatenated before the top MLP.
Interactive · feature-importance composer
Toggle features. The widget shows an expressiveness bar and a heuristic ΔAUC vs LR (toy — directionally correct, not literal). The diagnosis calls out the failure mode of each combination.
Interview prompts you should be ready for
- "Walk me through DLRM's forward pass." (Sparse → embeddings; dense → bottom MLP; all pairwise dot products; concat with dense bottom; top MLP → logit. Name the three sub-networks.)
- "Why do production rankers have an embedding table with billions of parameters but an MLP with maybe a million?" (Capacity-per-parameter argument: each sparse ID needs its own row to be distinguishable; the MLP is shared across all (user, item) pairs and gets reused on every example.)
- "Why use DCN-v2 instead of DLRM?" (Probes: DLRM hardcodes order-2 interactions via dot products; DCN-v2 cross layers learn order ≤ L+1 with a stack of L matrices. If higher-order effects matter — they usually do — DCN-v2 has the capacity to express them and DLRM doesn't.)
- "Your model has 6B parameters and 99% of them are in one user_id table. How do you train this on 8 GPUs?" (Embedding model-parallelism: shard the table by ID across GPUs, all-to-all the looked-up rows, then data-parallel the dense top. This is the canonical "torchrec" / FBGEMM / "DLRM-style" answer.)
- "Embed all features into 64-dim, or use per-feature dim? Trade-offs?" (Uniform dim is simpler and lets you concat freely; per-feature dim saves parameters on low-cardinality fields and lets high-cardinality fields breathe. Production systems typically per-feature.)
- "When would you keep an LR baseline alive in 2025?" (Three reasons: as a sanity-check during ranker rollouts; for cold-start segments where the deep model has no signal; for explainability in policy/legal contexts. Plus: LR is what you fall back to during a deep-model serving outage.)
- "You add 100 new sparse features to a DLRM. AUC barely moves. Why?" (Several real causes: the embedding tables are under-trained for the new fields, hashing collisions, embedding dim too small, pairwise dots now have 100 noisy terms swamping the signal, or the features are correlated with existing ones.)