Feature engineering
The unsexy part of ML that decides whether a model ships. The interview signal is knowing which engineering each model class needs, and recognising the leak patterns that look like wins.
Why this matters even with flexible models
The slogan "deep models learn features, we don't need feature engineering" is half right. A model that can in principle represent anything still needs the right inductive bias to learn from finite data with finite compute. Five well-engineered features routinely beat a hundred raw ones — every free parameter costs variance you pay out of a fixed training set. The interview signal is whether you can map model class to engineering effort:
| Model class | Engineering | Why |
|---|---|---|
| Linear / logistic | Heavy — scale, encode, interact, transform | Linear function. Every nonlinearity and interaction has to be in the feature. |
| GBDT (XGBoost / LightGBM) | Light — encode categoricals, handle missing | Scale-invariant; captures nonlinearity and interactions for free; native missing handling. |
| Deep tabular (DLRM, FT-Transformer) | Moderate — like linear: encode, scale, transform | Despite the hype, deep tabular benefits from the same hand-engineering as linear. Not magic. |
| Deep on text / image / audio | Minimal — let the net learn representations | Convolutions, attention, tokenisation are the feature engineering. |
Feature engineering shifts structure from parameters (estimated from finite data, with variance) into features (computed exactly, with none). Every fact you encode is one fewer parameter to learn — and one fewer direction your variance can leak through.
Categorical encoding — the four canonical choices
Every feature-engineering interview ends up here. Four practical encodings; pick by cardinality, model class, and leakage risk.
| Encoding | How | High cardinality | Leak risk | Best with |
|---|---|---|---|---|
| One-hot | k binary columns | Bad — 1M IDs → 1M cols | None directly; high-cardinality one-hot in trees can still memorize rare levels | LR / NN, low cardinality |
| Ordinal | Integers 0..k−1 | OK in storage, wrong semantically | None | Trees w/ meaningful order; almost never LR |
| Target encoding | E[y | cat], smoothed | Great | High — needs OOF | GBDT w/ careful CV; CatBoost native |
| Feature hashing | hash(cat) mod B | Excellent — constant memory | None (but collisions) | Billion-user ads/recsys, online learning |
Details that matter at senior level:
- Trees vs one-hot. GBDTs can split on a one-hot indicator, but LightGBM's native categorical handling (sort levels by target stat, split on the sorted axis) is meaningfully better for high-cardinality features. Use it if your library has it.
- Ordinal as a hidden bug.
LabelEncoderon country names assignsAlbania=0, Australia=1, Austria=2…. A linear model now believes Australia is between Albania and Austria. The fix: never label-encode without a defensible ordering. - Hashing collisions. Two categories landing in the same bucket get averaged. With B ≫ k, collisions are rare and the noise is well-tolerated. The right trade above ~10⁵ levels.
The target-encoding trap
Target encoding replaces category c with an estimate of E[y | c], often smoothed toward the global mean:
where n_c is the count, ȳ_c the in-category mean, ȳ the global mean, m a smoothing prior. It collapses arbitrary cardinality into one numeric feature with the right shape.
The trap: compute ȳ_c on the full training set, and row i's feature contains row i's own label. The "feature" is a partial decoding of the target.
The fix is out-of-fold target encoding: split training into K folds; for each fold, compute the encoding from the other K−1 folds. CatBoost generalises this with ordered target statistics: process rows in a fixed permutation, and for row i use only rows 1..i−1.
df.groupby(cat).target.mean()", the model is leaking and the offline numbers are fiction.
Numerical features — when to scale
| Model | Scale? | Why |
|---|---|---|
| k-NN, k-means, SVM (RBF) | Yes | Distances are dominated by the largest-scale feature. Income in dollars eclipses age in years. |
| Neural networks | Yes | Gradient magnitudes scale with input; large inputs saturate activations and slow convergence. |
| L1 / L2 regularised linear | Yes | Penalty is uniform; unscaled features get under- or over-penalised. |
| Unregularised linear | Optimum: no. Optimiser: yes | OLS is scale-equivariant; gradient descent is faster on well-conditioned inputs. |
| Trees / GBDT / RF | No | Splits are scale-invariant. x > 17 partitions the same rows as 2x > 34. |
Four scalers worth knowing by name:
- Standardisation (x − μ)/σ. Mean 0, variance 1. Assumes roughly symmetric distribution; outliers push μ and inflate σ.
- Min-max (x − min)/(max − min). Maps to [0,1]. Sensitive to outliers — one extreme value compresses everyone else.
- Robust scaling (x − median)/IQR. Outlier-resistant. Default when you don't trust the tails.
- Log / Yeo-Johnson / Box-Cox. For positively skewed counts, prices, durations. Often the single biggest win for a linear model.
Binning, missing values, and informative absence
Binning converts continuous x into k buckets. For linear models, this captures nonlinearity (each bucket gets its own coefficient) and absorbs outliers. The cost is within-bucket information loss. Quantile-based (equal-count) almost always beats fixed-width (equal-spacing). For trees, binning is unnecessary — the tree picks its own cuts.
Missing values have four standard treatments:
| Treatment | Cost | When to use |
|---|---|---|
| Drop the row | Loses signal; biased if missingness isn't random | Almost never. Only if missingness is rare and uninformative. |
| Mean / median / mode | Underestimates variance | Default for prototypes. Always add an is_missing indicator. |
| Model-based (k-NN, MICE) | Expensive; leaky if done globally | When features correlate and compute allows. |
| Native (GBDT, CatBoost) | None — built in | Usually best. The tree learns the optimal "missing direction" per split. The is_missing indicator is still useful even with native handling — it lets the model learn from WHY a value is missing, not just the fact of missingness. |
is_missing column throws that signal away. The two-column pattern (impute + indicator) costs nothing.
Interaction features — only for linear-ish models
Logistic regression cannot represent x_i · x_j without you handing it the product. Three approaches:
- Manual crosses. Pick known interactions (device × country, query × ad_category); add the concatenation or product. The pre-deep-learning Google Ads recipe, still effective.
- Polynomial features. All pairs (degree 2), all triples (degree 3). Combinatorial explosion past degree 2; pair with strong L1.
- Hash-product crosses. Hash the tuple (c_i, c_j) into a bucket. Standard trick for billion-scale ads: 10⁶ × 10⁶ is hopeless one-hot but tractable hashed to 10⁷.
For GBDTs, don't bother. A split on x_i followed by a split on x_j in the subtree is the interaction. Manual crosses on a GBDT add noise.
Time-based features and the leakage they invite
The standard kit:
- Lag features — value at t − 1, t − 7, t − 28.
- Rolling aggregates — 7-day mean, 30-day max, 24-hour stdev. Smooths noise; captures trend and volatility.
- Calendar features — day-of-week, hour, month, holiday indicator, days-since-launch.
- Time-since-last-event — for sequence data (seconds since the user's last click).
The rule junior candidates violate: only use information available at prediction time. A rolling mean centred on t uses data from after t — a leak. A "user lifetime value" feature computed on the full dataset includes future events — a leak. Compute every feature as if you were producing it online at t, with strict "no information from > t".
Target leakage — the silent killer
If a model looks too good, it is. Five common forms of leakage to scan for:
| Form | Example |
|---|---|
| Future information | "Total purchases in 30 days" computed after the conversion you're predicting. |
| Proxy of the label | "Transaction was reversed" used to predict fraud. Reversal happens because of fraud. |
| Preprocessing across split | Fitting scaler / imputer / encoder on the full dataset before splitting. Test statistics leak into training. |
| Group leakage | Same user_id in train and test in a churn task. The model memorises IDs. |
| Imputation leakage | Imputing with a global mean computed including test rows. Subtle, always wrong. |
The senior reflex: when an offline score looks suspiciously high, the first question is "what's the leak?", not "is the model good?". The second: "what would this feature look like computed at inference time, in production?" If it differs from the training-time feature, that's the leak.
Interactive · encoding showdown
Synthetic binary classification with a high-cardinality user_id (100 levels, target rate varies by user). Pick an encoding and a model. Three AUCs: train, validation (rows from users seen in training), and "production" (rows from new users never seen). The honesty gap exposes leakage and out-of-vocab failure.
When NOT to engineer features
- GBDT on tabular features. Manual crosses, polynomial features, and scaling are wasted compute and noise. Spend the time on data quality and target definition.
- Deep model on text / image / audio. The representation layers are the feature engineering. Hand-crafted features (bag-of-words on top of a transformer, edge detectors on top of a CNN) generally hurt.
- No hypothesis. Random feature engineering is gambling and you will eventually find a leak that looks like a win. Targeted engineering with a stated mechanism is science. The signal: "I added this because X causes Y", not "I tried 200 features and these 12 helped."
Interview prompts you should be ready for
- "Walk through the target-encoding leakage failure mode." (Encoding on full train includes each row's own label. Train and val both see the leak; production sees new rows / categories and the leak vanishes. Fix: out-of-fold or ordered target statistics.)
- "Your XGBoost has AUC 0.95 on val, 0.55 in prod. Hypothesise." (Leak. Candidates: target-encoded feature without OOF; preprocessing fit on full data; group leakage; feature derived from future timestamp; imputation using global stats.)
- "You have a categorical with 10M levels. How do you encode?" (One-hot is out. Target encoding with OOF, or hashing into ~10⁵–10⁶ buckets for constant memory and online learning. For GBDT: library-native categorical handling.)
- "When do you scale features?" (Always for distance-based methods, NNs, regularised linear. Never for trees. For unregularised linear: optimum unchanged, optimiser converges faster on well-conditioned features.)
- "Missing-value imputation — strategies and trade-offs." (Drop almost never. Mean/median impute +
is_missingindicator is the prototype default. Model-based when features correlate. GBDT/CatBoost handle natively, usually best.) - "How do you feature-engineer time-series?" (Lags, rolling aggregates, calendar features, time-since-last-event. Rule: only information available at prediction time. Validate with a temporal split, never random k-fold.)
is_missing indicators for free signal. High-cardinality categoricals: out-of-fold target encoding or hashing — never the leaky groupby. When an offline number looks too good, find the leak before celebrating.