Feature engineering

The unsexy part of ML that decides whether a model ships. The interview signal is knowing which engineering each model class needs, and recognising the leak patterns that look like wins.

Why this matters even with flexible models

The slogan "deep models learn features, we don't need feature engineering" is half right. A model that can in principle represent anything still needs the right inductive bias to learn from finite data with finite compute. Five well-engineered features routinely beat a hundred raw ones — every free parameter costs variance you pay out of a fixed training set. The interview signal is whether you can map model class to engineering effort:

Model class	Engineering	Why
Linear / logistic	Heavy — scale, encode, interact, transform	Linear function. Every nonlinearity and interaction has to be in the feature.
GBDT (XGBoost / LightGBM)	Light — encode categoricals, handle missing	Scale-invariant; captures nonlinearity and interactions for free; native missing handling.
Deep tabular (DLRM, FT-Transformer)	Moderate — like linear: encode, scale, transform	Despite the hype, deep tabular benefits from the same hand-engineering as linear. Not magic.
Deep on text / image / audio	Minimal — let the net learn representations	Convolutions, attention, tokenisation are the feature engineering.

Feature engineering shifts structure from parameters (estimated from finite data, with variance) into features (computed exactly, with none). Every fact you encode is one fewer parameter to learn — and one fewer direction your variance can leak through.

Categorical encoding — the four canonical choices

Every feature-engineering interview ends up here. Four practical encodings; pick by cardinality, model class, and leakage risk.

Encoding	How	High cardinality	Leak risk	Best with
One-hot	k binary columns	Bad — 1M IDs → 1M cols	None directly; high-cardinality one-hot in trees can still memorize rare levels	LR / NN, low cardinality
Ordinal	Integers 0..k−1	OK in storage, wrong semantically	None	Trees w/ meaningful order; almost never LR
Target encoding	E[y \| cat], smoothed	Great	High — needs OOF	GBDT w/ careful CV; CatBoost native
Feature hashing	hash(cat) mod B	Excellent — constant memory	None (but collisions)	Billion-user ads/recsys, online learning

Details that matter at senior level:

Trees vs one-hot. GBDTs can split on a one-hot indicator, but LightGBM's native categorical handling (sort levels by target stat, split on the sorted axis) is meaningfully better for high-cardinality features. Use it if your library has it.
Ordinal as a hidden bug. LabelEncoder on country names assigns Albania=0, Australia=1, Austria=2…. A linear model now believes Australia is between Albania and Austria. The fix: never label-encode without a defensible ordering.
Hashing collisions. Two categories landing in the same bucket get averaged. With B ≫ k, collisions are rare and the noise is well-tolerated. The right trade above ~10⁵ levels.

The target-encoding trap

Target encoding replaces category c with an estimate of E[y | c], often smoothed toward the global mean:

enc(c) = (n_c · ȳ_c + m · ȳ) / (n_c + m)

where n_c is the count, ȳ_c the in-category mean, ȳ the global mean, m a smoothing prior. It collapses arbitrary cardinality into one numeric feature with the right shape.

The trap: compute ȳ_c on the full training set, and row i's feature contains row i's own label. The "feature" is a partial decoding of the target.

train AUC 0.96 ← model can "see" the label through the feature validation AUC 0.95 ← same leaky encoding reused production AUC 0.55 ← new categories: leak vanishes, model collapses

The fix is out-of-fold target encoding: split training into K folds; for each fold, compute the encoding from the other K−1 folds. CatBoost generalises this with ordered target statistics: process rows in a fixed permutation, and for row i use only rows 1..i−1.

Interview gotcha

Near-perfect offline scores with target-encoded high-cardinality features? First question: "how did you compute the encoding?" If the answer is "df.groupby(cat).target.mean()", the model is leaking and the offline numbers are fiction.

Numerical features — when to scale

Model	Scale?	Why
k-NN, k-means, SVM (RBF)	Yes	Distances are dominated by the largest-scale feature. Income in dollars eclipses age in years.
Neural networks	Yes	Gradient magnitudes scale with input; large inputs saturate activations and slow convergence.
L1 / L2 regularised linear	Yes	Penalty is uniform; unscaled features get under- or over-penalised.
Unregularised linear	Optimum: no. Optimiser: yes	OLS is scale-equivariant; gradient descent is faster on well-conditioned inputs.
Trees / GBDT / RF	No	Splits are scale-invariant. x > 17 partitions the same rows as 2x > 34.

Four scalers worth knowing by name:

Standardisation (x − μ)/σ. Mean 0, variance 1. Assumes roughly symmetric distribution; outliers push μ and inflate σ.
Min-max (x − min)/(max − min). Maps to [0,1]. Sensitive to outliers — one extreme value compresses everyone else.
Robust scaling (x − median)/IQR. Outlier-resistant. Default when you don't trust the tails.
Log / Yeo-Johnson / Box-Cox. For positively skewed counts, prices, durations. Often the single biggest win for a linear model.

Binning, missing values, and informative absence

Binning converts continuous x into k buckets. For linear models, this captures nonlinearity (each bucket gets its own coefficient) and absorbs outliers. The cost is within-bucket information loss. Quantile-based (equal-count) almost always beats fixed-width (equal-spacing). For trees, binning is unnecessary — the tree picks its own cuts.

Missing values have four standard treatments:

Treatment	Cost	When to use
Drop the row	Loses signal; biased if missingness isn't random	Almost never. Only if missingness is rare and uninformative.
Mean / median / mode	Underestimates variance	Default for prototypes. Always add an `is_missing` indicator.
Model-based (k-NN, MICE)	Expensive; leaky if done globally	When features correlate and compute allows.
Native (GBDT, CatBoost)	None — built in	Usually best. The tree learns the optimal "missing direction" per split. The is_missing indicator is still useful even with native handling — it lets the model learn from WHY a value is missing, not just the fact of missingness.

Missingness is a feature

In churn, fraud, and credit, the fact that a user didn't fill a field in is often more predictive than the value they would have filled in. Median-imputing without an is_missing column throws that signal away. The two-column pattern (impute + indicator) costs nothing.

Interaction features — only for linear-ish models

Logistic regression cannot represent x_i · x_j without you handing it the product. Three approaches:

Manual crosses. Pick known interactions (device × country, query × ad_category); add the concatenation or product. The pre-deep-learning Google Ads recipe, still effective.
Polynomial features. All pairs (degree 2), all triples (degree 3). Combinatorial explosion past degree 2; pair with strong L1.
Hash-product crosses. Hash the tuple (c_i, c_j) into a bucket. Standard trick for billion-scale ads: 10⁶ × 10⁶ is hopeless one-hot but tractable hashed to 10⁷.

For GBDTs, don't bother. A split on x_i followed by a split on x_j in the subtree is the interaction. Manual crosses on a GBDT add noise.

Time-based features and the leakage they invite

The standard kit:

Lag features — value at t − 1, t − 7, t − 28.
Rolling aggregates — 7-day mean, 30-day max, 24-hour stdev. Smooths noise; captures trend and volatility.
Calendar features — day-of-week, hour, month, holiday indicator, days-since-launch.
Time-since-last-event — for sequence data (seconds since the user's last click).

The rule junior candidates violate: only use information available at prediction time. A rolling mean centred on t uses data from after t — a leak. A "user lifetime value" feature computed on the full dataset includes future events — a leak. Compute every feature as if you were producing it online at t, with strict "no information from > t".

Target leakage — the silent killer

If a model looks too good, it is. Five common forms of leakage to scan for:

Form	Example
Future information	"Total purchases in 30 days" computed after the conversion you're predicting.
Proxy of the label	"Transaction was reversed" used to predict fraud. Reversal happens because of fraud.
Preprocessing across split	Fitting scaler / imputer / encoder on the full dataset before splitting. Test statistics leak into training.
Group leakage	Same user_id in train and test in a churn task. The model memorises IDs.
Imputation leakage	Imputing with a global mean computed including test rows. Subtle, always wrong.

The senior reflex: when an offline score looks suspiciously high, the first question is "what's the leak?", not "is the model good?". The second: "what would this feature look like computed at inference time, in production?" If it differs from the training-time feature, that's the leak.

Interactive · encoding showdown

Synthetic binary classification with a high-cardinality user_id (100 levels, target rate varies by user). Pick an encoding and a model. Three AUCs: train, validation (rows from users seen in training), and "production" (rows from new users never seen). The honesty gap exposes leakage and out-of-vocab failure.

Encoding vs model vs honesty

Try target-leak first (train and val look amazing, production collapses). Then switch to out-of-fold target encoding — all three close. Then hashing — small drop in train AUC, no leak, generalises to new users.

encoding: model: hash buckets: 32 noise: 0.15

train AUC

—

val AUC (seen users)

—

prod AUC (new users)

—

honesty gap

—

Reading

—

When NOT to engineer features

GBDT on tabular features. Manual crosses, polynomial features, and scaling are wasted compute and noise. Spend the time on data quality and target definition.
Deep model on text / image / audio. The representation layers are the feature engineering. Hand-crafted features (bag-of-words on top of a transformer, edge detectors on top of a CNN) generally hurt.
No hypothesis. Random feature engineering is gambling and you will eventually find a leak that looks like a win. Targeted engineering with a stated mechanism is science. The signal: "I added this because X causes Y", not "I tried 200 features and these 12 helped."

Interview prompts you should be ready for

"Walk through the target-encoding leakage failure mode." (Encoding on full train includes each row's own label. Train and val both see the leak; production sees new rows / categories and the leak vanishes. Fix: out-of-fold or ordered target statistics.)
"Your XGBoost has AUC 0.95 on val, 0.55 in prod. Hypothesise." (Leak. Candidates: target-encoded feature without OOF; preprocessing fit on full data; group leakage; feature derived from future timestamp; imputation using global stats.)
"You have a categorical with 10M levels. How do you encode?" (One-hot is out. Target encoding with OOF, or hashing into ~10⁵–10⁶ buckets for constant memory and online learning. For GBDT: library-native categorical handling.)
"When do you scale features?" (Always for distance-based methods, NNs, regularised linear. Never for trees. For unregularised linear: optimum unchanged, optimiser converges faster on well-conditioned features.)
"Missing-value imputation — strategies and trade-offs." (Drop almost never. Mean/median impute + is_missing indicator is the prototype default. Model-based when features correlate. GBDT/CatBoost handle natively, usually best.)
"How do you feature-engineer time-series?" (Lags, rolling aggregates, calendar features, time-since-last-event. Rule: only information available at prediction time. Validate with a temporal split, never random k-fold.)

Takeaway

Feature engineering is bias-variance engineering with prior knowledge: move structure from "things the model has to learn" into "things you already know". Pick encoding by cardinality and model class. Scale only when the model needs it. Add is_missing indicators for free signal. High-cardinality categoricals: out-of-fold target encoding or hashing — never the leaky groupby. When an offline number looks too good, find the leak before celebrating.