traditional_ml / 10 · feature engineering lesson 10 / 12

Feature engineering

The unsexy part of ML that decides whether a model ships. The interview signal is knowing which engineering each model class needs, and recognising the leak patterns that look like wins.

Why this matters even with flexible models

The slogan "deep models learn features, we don't need feature engineering" is half right. A model that can in principle represent anything still needs the right inductive bias to learn from finite data with finite compute. Five well-engineered features routinely beat a hundred raw ones — every free parameter costs variance you pay out of a fixed training set. The interview signal is whether you can map model class to engineering effort:

Model classEngineeringWhy
Linear / logisticHeavy — scale, encode, interact, transformLinear function. Every nonlinearity and interaction has to be in the feature.
GBDT (XGBoost / LightGBM)Light — encode categoricals, handle missingScale-invariant; captures nonlinearity and interactions for free; native missing handling.
Deep tabular (DLRM, FT-Transformer)Moderate — like linear: encode, scale, transformDespite the hype, deep tabular benefits from the same hand-engineering as linear. Not magic.
Deep on text / image / audioMinimal — let the net learn representationsConvolutions, attention, tokenisation are the feature engineering.

Feature engineering shifts structure from parameters (estimated from finite data, with variance) into features (computed exactly, with none). Every fact you encode is one fewer parameter to learn — and one fewer direction your variance can leak through.

Categorical encoding — the four canonical choices

Every feature-engineering interview ends up here. Four practical encodings; pick by cardinality, model class, and leakage risk.

EncodingHowHigh cardinalityLeak riskBest with
One-hotk binary columnsBad — 1M IDs → 1M colsNone directly; high-cardinality one-hot in trees can still memorize rare levelsLR / NN, low cardinality
OrdinalIntegers 0..k−1OK in storage, wrong semanticallyNoneTrees w/ meaningful order; almost never LR
Target encodingE[y | cat], smoothedGreatHigh — needs OOFGBDT w/ careful CV; CatBoost native
Feature hashinghash(cat) mod BExcellent — constant memoryNone (but collisions)Billion-user ads/recsys, online learning

Details that matter at senior level:

The target-encoding trap

Target encoding replaces category c with an estimate of E[y | c], often smoothed toward the global mean:

enc(c) = (n_c · ȳ_c + m · ȳ) / (n_c + m)

where n_c is the count, ȳ_c the in-category mean, ȳ the global mean, m a smoothing prior. It collapses arbitrary cardinality into one numeric feature with the right shape.

The trap: compute ȳ_c on the full training set, and row i's feature contains row i's own label. The "feature" is a partial decoding of the target.

train AUC 0.96 ← model can "see" the label through the feature validation AUC 0.95 ← same leaky encoding reused production AUC 0.55 ← new categories: leak vanishes, model collapses

The fix is out-of-fold target encoding: split training into K folds; for each fold, compute the encoding from the other K−1 folds. CatBoost generalises this with ordered target statistics: process rows in a fixed permutation, and for row i use only rows 1..i−1.

Interview gotcha
Near-perfect offline scores with target-encoded high-cardinality features? First question: "how did you compute the encoding?" If the answer is "df.groupby(cat).target.mean()", the model is leaking and the offline numbers are fiction.

Numerical features — when to scale

ModelScale?Why
k-NN, k-means, SVM (RBF)YesDistances are dominated by the largest-scale feature. Income in dollars eclipses age in years.
Neural networksYesGradient magnitudes scale with input; large inputs saturate activations and slow convergence.
L1 / L2 regularised linearYesPenalty is uniform; unscaled features get under- or over-penalised.
Unregularised linearOptimum: no. Optimiser: yesOLS is scale-equivariant; gradient descent is faster on well-conditioned inputs.
Trees / GBDT / RFNoSplits are scale-invariant. x > 17 partitions the same rows as 2x > 34.

Four scalers worth knowing by name:

Binning, missing values, and informative absence

Binning converts continuous x into k buckets. For linear models, this captures nonlinearity (each bucket gets its own coefficient) and absorbs outliers. The cost is within-bucket information loss. Quantile-based (equal-count) almost always beats fixed-width (equal-spacing). For trees, binning is unnecessary — the tree picks its own cuts.

Missing values have four standard treatments:

TreatmentCostWhen to use
Drop the rowLoses signal; biased if missingness isn't randomAlmost never. Only if missingness is rare and uninformative.
Mean / median / modeUnderestimates varianceDefault for prototypes. Always add an is_missing indicator.
Model-based (k-NN, MICE)Expensive; leaky if done globallyWhen features correlate and compute allows.
Native (GBDT, CatBoost)None — built inUsually best. The tree learns the optimal "missing direction" per split. The is_missing indicator is still useful even with native handling — it lets the model learn from WHY a value is missing, not just the fact of missingness.
Missingness is a feature
In churn, fraud, and credit, the fact that a user didn't fill a field in is often more predictive than the value they would have filled in. Median-imputing without an is_missing column throws that signal away. The two-column pattern (impute + indicator) costs nothing.

Interaction features — only for linear-ish models

Logistic regression cannot represent x_i · x_j without you handing it the product. Three approaches:

For GBDTs, don't bother. A split on x_i followed by a split on x_j in the subtree is the interaction. Manual crosses on a GBDT add noise.

Time-based features and the leakage they invite

The standard kit:

The rule junior candidates violate: only use information available at prediction time. A rolling mean centred on t uses data from after t — a leak. A "user lifetime value" feature computed on the full dataset includes future events — a leak. Compute every feature as if you were producing it online at t, with strict "no information from > t".

Target leakage — the silent killer

If a model looks too good, it is. Five common forms of leakage to scan for:

FormExample
Future information"Total purchases in 30 days" computed after the conversion you're predicting.
Proxy of the label"Transaction was reversed" used to predict fraud. Reversal happens because of fraud.
Preprocessing across splitFitting scaler / imputer / encoder on the full dataset before splitting. Test statistics leak into training.
Group leakageSame user_id in train and test in a churn task. The model memorises IDs.
Imputation leakageImputing with a global mean computed including test rows. Subtle, always wrong.

The senior reflex: when an offline score looks suspiciously high, the first question is "what's the leak?", not "is the model good?". The second: "what would this feature look like computed at inference time, in production?" If it differs from the training-time feature, that's the leak.

Interactive · encoding showdown

Synthetic binary classification with a high-cardinality user_id (100 levels, target rate varies by user). Pick an encoding and a model. Three AUCs: train, validation (rows from users seen in training), and "production" (rows from new users never seen). The honesty gap exposes leakage and out-of-vocab failure.

Encoding vs model vs honesty
Try target-leak first (train and val look amazing, production collapses). Then switch to out-of-fold target encoding — all three close. Then hashing — small drop in train AUC, no leak, generalises to new users.
train AUC
val AUC (seen users)
prod AUC (new users)
honesty gap
Reading

When NOT to engineer features

Interview prompts you should be ready for

  1. "Walk through the target-encoding leakage failure mode." (Encoding on full train includes each row's own label. Train and val both see the leak; production sees new rows / categories and the leak vanishes. Fix: out-of-fold or ordered target statistics.)
  2. "Your XGBoost has AUC 0.95 on val, 0.55 in prod. Hypothesise." (Leak. Candidates: target-encoded feature without OOF; preprocessing fit on full data; group leakage; feature derived from future timestamp; imputation using global stats.)
  3. "You have a categorical with 10M levels. How do you encode?" (One-hot is out. Target encoding with OOF, or hashing into ~10⁵–10⁶ buckets for constant memory and online learning. For GBDT: library-native categorical handling.)
  4. "When do you scale features?" (Always for distance-based methods, NNs, regularised linear. Never for trees. For unregularised linear: optimum unchanged, optimiser converges faster on well-conditioned features.)
  5. "Missing-value imputation — strategies and trade-offs." (Drop almost never. Mean/median impute + is_missing indicator is the prototype default. Model-based when features correlate. GBDT/CatBoost handle natively, usually best.)
  6. "How do you feature-engineer time-series?" (Lags, rolling aggregates, calendar features, time-since-last-event. Rule: only information available at prediction time. Validate with a temporal split, never random k-fold.)
Takeaway
Feature engineering is bias-variance engineering with prior knowledge: move structure from "things the model has to learn" into "things you already know". Pick encoding by cardinality and model class. Scale only when the model needs it. Add is_missing indicators for free signal. High-cardinality categoricals: out-of-fold target encoding or hashing — never the leaky groupby. When an offline number looks too good, find the leak before celebrating.