Traditional ML — MLE interview prep

A linearized read on the pre-deep-learning ML canon — trees, regression, classification, feature engineering, and the trade-offs between them. The interview territory that hasn't gone away.

Why "traditional" still matters in 2026

The instinct of a junior candidate is that deep learning subsumes everything. The instinct of a senior interviewer is the opposite: deep learning is the right tool for a narrow set of problems (perception, language, very-high-dimensional structured input). For most production ML — tabular fraud detection, churn prediction, pricing, ranking on small catalogues, demand forecasting — the right answer in 2026 is still gradient-boosted trees with thoughtful feature engineering, regularized linear models, or a hybrid.

The reason isn't nostalgia. It's that for tabular data with thousands of features and millions of rows, GBDTs and well-tuned linear models routinely beat deep models on accuracy, train in minutes instead of hours, are interpretable to non-ML stakeholders, and don't require GPUs. Most companies aren't Meta or Google — they're optimizing fraud, churn, or pricing on a tabular dataset, and the optimal model is almost always XGBoost or a calibrated logistic regression. Knowing why that's true, and where it stops being true, is what the interview tests.

What an interviewer is really testing

Four orthogonal skills. Strength in one doesn't predict the others.

Skill	What good looks like	What weak looks like
Reason from first principles	Asked "why does L2 regularization help?", you derive it from the bias-variance decomposition or from a Bayesian prior, not from "it's in the loss". You distinguish ridge from lasso by their geometry of the constraint set.	You repeat that "regularization prevents overfitting" without being able to say what overfitting is or how penalizing weights addresses it.
Pick the right model for the problem	Given a tabular task, you can defend a choice between logistic regression, random forest, and XGBoost in 60 seconds — with reference to feature interactions, monotonicity, missing data, training time, and interpretability requirements.	You default to XGBoost for everything and can't name a case where logistic regression beats it.
Feature engineering instinct	You can look at a tabular schema and name 5 features the obvious model would miss: cross-features, target encoding traps, missingness as signal, time-since-event, lag features. You know when to NOT engineer features (GBDTs handle interactions; deep models discover them).	You feed raw columns to the model and rely on hyperparameter tuning to fix the gap.
Quantitative trade-offs	You know which model trains in minutes vs hours, which scales to billion rows vs millions, which handles 10⁴ features vs 10⁷, which gives calibrated probabilities out of the box. You answer "how big is the dataset?" before recommending an algorithm.	Qualitative pros and cons without numbers. No sense of the regimes where each algorithm dominates.

The lessons

The first lesson establishes the bias-variance frame that every subsequent lesson uses. Then we go: linear models → trees → ensembles → kernel methods → generative → unsupervised → feature engineering → evaluation → interpretability. Each lesson can be read standalone but they reinforce each other.

Bias-variance & generalization

The foundational decomposition. Why test error has two parts and how every model-selection choice trades them off. Learning curves, double descent, what regularization actually does.

Linear regression — OLS & regularization

The closed-form solution and where it breaks. Assumptions (Gauss-Markov), multicollinearity, ridge / lasso / elastic-net as both Bayesian priors and constraint geometries. When linear is enough.

Logistic regression & GLMs

MLE derivation, the logit link, why BCE is the natural loss. The GLM family (Poisson, gamma, Tweedie). Multinomial vs one-vs-rest. Why logistic regression is still the right choice surprisingly often.

Decision trees

Splitting criteria (gini, entropy, MSE) — why they're nearly identical in practice. Greedy growth and pruning. Why a single tree is a weak learner but the right scaffolding for ensembles.

Bagging & random forests

Bootstrap aggregation, why averaging trees reduces variance, the feature-subsampling trick that decorrelates them. OOB error as free validation. When RF beats GBDT (and when it doesn't).

Gradient boosting & GBDT

From AdaBoost to gradient boosting to XGBoost. Why boosting reduces bias where bagging reduces variance. XGBoost's regularization, LightGBM's histogram + leaf-wise growth, CatBoost's ordered boosting. The de-facto tabular winner.

SVMs & kernels

The max-margin principle, the dual form, the kernel trick. Why SVMs were dominant in the 2000s and where they still win. Quadratic vs polynomial vs RBF kernels and what each implies geometrically.

Naive Bayes & generative vs discriminative

Bayes rule, the conditional-independence assumption, and why NB works despite the assumption being false. The deeper question: generative models (NB, LDA, HMM) vs discriminative (logistic regression, trees). When each is the right choice.

Clustering & dimensionality reduction

K-means, Lloyd's algorithm, k-means++ initialization. GMMs as soft k-means. PCA as the principal-eigenvector decomposition. t-SNE / UMAP for visualization — what they show, what they hide.

Feature engineering

Encoding (one-hot, target, hashing, frequency), scaling, binning, missing-as-signal, lag/window features, interactions, target leakage. Which model class needs which engineering — and when not to engineer at all.

Evaluation & class imbalance

Beyond AUC: precision/recall/F1, ROC vs PR curves, log-loss, R² and its variants. Class imbalance: when to resample, when to reweight, when to threshold-tune, when to pick a different metric.

Interpretability & feature importance

SHAP, LIME, permutation importance, partial-dependence plots, ICE. What each method assumes and what it hides. The senior question: when does "interpretable" mean explainable to a stakeholder, and when does it mean something stronger?

How to use this

Linearly. Lesson 1 establishes the bias-variance frame that lessons 2–6 use to explain regularization, tree depth, ensemble size, boosting iterations. Skipping it makes the rest harder.
Touch every widget. Each lesson has at least one interactive component — a slider, a calculator, a toggle. They exist to let you feel a trade-off that prose alone makes abstract.
Read the "interview prompts" boxes. Each lesson ends with 4–6 prompts that an interview actually asks. The answers in the prose tell you what good looks like, not just the facts.
Don't skip the basics because they're "easy". Senior interviews probe basics with depth, not breadth. "Why does L2 work?" is harder than it looks. "When would you pick logistic regression over XGBoost?" separates good candidates from great ones.

Companion material in this repo

The search_ads_recsys folder covers ranking/retrieval at scale, where deep models dominate. The RL lessons cover RLHF/RLVR for LLMs. The system_ml folder covers GPU/distributed-training internals. This folder is the entrance to all of them: the bias-variance intuition and the feature-engineering instinct transfer directly.

A note on what's "traditional"

The line between traditional and deep ML is fuzzier than the binary suggests. Boosting (1995) is older than neural networks went mainstream. Two-tower retrieval (2016) is "deep" but spiritually traditional. We use "traditional" to mean "the canon you can implement on a laptop with sklearn, that wins on most tabular problems, and that an interviewer will test you on regardless of your deep-learning resume."