Bias-variance & generalization
The single most-asked concept in ML interviews. Derive it once from first principles and every subsequent regularization, model-selection, and ensemble decision becomes a corollary.
The setup, stated precisely
You have a true data-generating process y = f(x) + ε, where ε is irreducible noise with E[ε] = 0 and Var(ε) = σ². You don't see f; you see a training set D = {(x_i, y_i)}. You fit a model f̂_D(x) from it. The question of "how good is the model" is: at a fresh test point x_0, what is the expected squared error of your prediction, averaged over all possible training sets D?
Write it out:
where y_0 = f(x_0) + ε and the expectation is over both the noise ε at the test point and the randomness in D.
The decomposition
Add and subtract E_D[f̂_D(x_0)], the expected prediction averaged over training sets, then expand:
Three terms, each with a name:
| Term | Name | What it measures |
|---|---|---|
| (f(x_0) − E_D[f̂_D(x_0)])² | Bias² | How far the average model (averaged over all training sets) is from the truth. Captures systematic error from a model class that's too simple to represent f. |
| E_D[(f̂_D(x_0) − E_D[f̂_D(x_0)])²] | Variance | How much the prediction wobbles around its mean as the training set changes. Captures sensitivity to which particular training sample you got. |
| σ² | Irreducible | The noise floor. No model — no matter how flexible, no matter how much data — can do better than σ². |
The proof: E[ε] = 0 kills the (f − f̂)·ε cross term (so the irreducible σ² separates cleanly), and E_D[f̂_D − E_D[f̂_D]] = 0 kills the bias-variance cross term.
Why "low bias, low variance" isn't free
If reducing bias and reducing variance were independent, every problem would have a single optimal model. They're not. A finite-capacity learner trying to reduce bias (fit the training data more tightly) almost always raises variance (sensitivity to which training sample you got). The classical U-shape of test error vs model complexity:
The interviewer's instinct: this is what's actually happening when you "regularize" or "tune hyperparameters". You're picking the capacity that minimises bias² + variance + σ², with the irreducible term unaffected. Knowing which lever moves which term is the entire job.
The classical regime — and where it breaks (double descent)
The U-shape is the classical story, valid when model capacity ≪ sample size. Modern deep models live in a different regime where capacity ≫ samples, and the curve is no longer U-shaped: it descends, hits a peak at the interpolation threshold (where the model has just enough capacity to memorize the training set), then descends again. This is double descent (Belkin et al. 2019, "Reconciling modern machine learning practice and the bias-variance trade-off"). Worth being able to name in an interview, because it explains why "overparameterized" deep models don't overfit catastrophically. For most of this folder's content (linear models, trees, GBDTs at sane depths), the classical U-shape holds and is the right intuition.
Interactive · feel the trade-off
Polynomial regression on a noisy sine wave. Slide the polynomial degree and the training-set size. The widget shows train error, test error, the bias² + variance decomposition (estimated via repeated resampling), and the fit itself.
What the levers do — a cheat sheet you should memorize
| Lever | Bias | Variance | Why |
|---|---|---|---|
| More training data | ≈ | ↓ | Sample-average converges → less wobble across training sets. Doesn't change the model class's expressive limit, so bias is unchanged. |
| More features (relevant) | ↓ | ↑ | More signal to capture (bias ↓), more parameters to estimate (variance ↑). Net effect depends on signal-to-noise. |
| More features (noise) | ≈ | ↑ | No new signal; only new free parameters. Pure variance increase. |
| Stronger L2 regularization | ↑ | ↓ | Constrains weights toward 0 → less expressive (bias ↑), less sensitive to training noise (variance ↓). |
| Deeper trees | ↓ | ↑ | More splits → finer partitions → can match more shapes (bias ↓), but each partition has fewer points (variance ↑). |
| More bagging iterations | ≈ | ↓ | Bagging averages models trained on bootstrap samples. Each model has the same bias; averaging reduces variance. |
| More boosting iterations | ↓ | ↑ | Each iteration fits residuals → reduces bias. But each is a real model fit, so variance accumulates. (Hence early stopping in boosting.) |
| K in K-NN (smaller K) | ↓ | ↑ | K=1 perfectly fits training → bias low at unseen x_0 (each prediction is just the neighbour's noisy label), variance huge. Larger K averages → bias up, variance down. |
Train / validation / test — the protocol that makes the decomposition operational
The decomposition is theoretical. You can't compute bias and variance in practice because you don't know f. What you do is estimate generalization error (the entire sum) on data the model didn't train on. The standard protocol:
- Train set — fit the model parameters.
- Validation set — pick hyperparameters (regularization strength, tree depth, etc.). The hyperparameter that minimises validation error is your estimated bias-variance sweet spot.
- Test set — final unbiased estimate of generalization error. Touched ONCE. If you tune anything on the test set, it's no longer a test set; it's another validation set.
Two common variants:
- k-fold cross-validation — partition data into k folds, train on k−1, validate on the held-out fold, rotate. Reduces variance of the validation-error estimate at k× the compute cost. Standard for small datasets.
- Time-series CV — for temporal data, train on past, validate on future, never the reverse. A random k-fold split with a time column leaks the future into the past and produces wildly optimistic estimates.
Learning curves — the diagnostic you should know how to read
Plot train error and validation error as the training-set size grows. The shape tells you which problem you have.
| Shape | Diagnosis | What to do |
|---|---|---|
| Train and val errors both high, both flat | High bias — the model class is too simple; more data won't help. | Add features, increase capacity (deeper trees, higher polynomial degree, less L2), pick a more flexible model class. |
| Train error low, val error high, gap is wide and not closing | High variance — the model is overfitting and additional samples are still helping val. | More regularization, simpler model, more data, feature selection, ensembles (bagging). |
| Train and val errors both low, gap is closing | Well-tuned. | Ship. |
| Train error climbs, val error climbs | Bug or leak — usually means the val set is changing systematically (e.g., distribution shift) or your preprocessing has a leak you're now fixing. | Investigate before tuning anything. |
The senior question: when does the decomposition mislead?
Three regimes where the classical bias-variance frame is incomplete:
- Distribution shift. The decomposition assumes test data comes from the same distribution as train. If it doesn't (covariate shift, concept drift, label shift), low train-time generalization error doesn't translate to low deployment error. Domain adaptation and continuous retraining are separate machinery.
- Overparameterization (double descent). Discussed above. The U-shape gives way to a second descent at very high capacity. The variance term in the decomposition behaves unexpectedly when the model can interpolate.
- Adversarial / out-of-distribution inputs. An attacker who can pick x_0 isn't sampling from the data distribution. Robustness is a different objective from average-case generalization.
Junior candidates often state "regularization reduces variance" as a slogan; senior candidates know there are well-defined regimes where the slogan is wrong or insufficient. Naming the regime is the test.
Interview prompts you should be ready for
- "Derive the bias-variance decomposition from E[(y − f̂)²]." (The proof in two lines: add-and-subtract E_D[f̂], expand, the cross term vanishes.)
- "You trained a random forest. Train error is 0.05, test error is 0.30. What do you do?" (High variance — the gap is the symptom. Options: more data, deeper feature subsampling, fewer trees won't help, shallower trees will, more bagging won't change it once you have enough trees.)
- "Why does L2 regularization reduce variance?" (Constraint on the parameter space → less sensitivity to training noise. Equivalently: MAP estimate under a Gaussian prior; equivalently: the constraint ‖w‖² ≤ t shrinks the model class's effective capacity.)
- "Your boss says 'add more features, the model will be more accurate.' Defend or refute." (Depends on signal-to-noise. Relevant features lower bias; noise features only raise variance. With finite data and high feature count, you can be in the regime where features hurt.)
- "What's the difference between bias and underfitting? Variance and overfitting?" (Tight but not synonymous: underfitting is observed train+val both high; that's a symptom of high bias. Overfitting is train low, val high; that's a symptom of high variance. Bias/variance are about the expected prediction; under/overfit are about a specific train/val pair.)
- "Name a case where lower bias produces worse generalization." (Trivial: a perfectly-fit interpolating model has zero training bias but high variance. Or: a flexible model in a low-data regime where bias was already low.)