Bagging & random forests
Lesson 4's diagnosis: a single decision tree is low-bias, high-variance. This lesson is the fix. Bagging averages the variance away; random forests sharpen the trick by decorrelating the averaged trees.
Bagging, from first principles
You have a training set D of size n and a learner that gives noticeably different predictions at the same x for different draws of D (a tree, basically). Breiman 1996 — bootstrap aggregation:
- Draw B bootstrap samples D^{(1)}, …, D^{(B)}, each sampled with replacement from D, same size n.
- Fit f̂^{(b)} on each.
- Predict by averaging (regression) or majority vote (classification): f̂_{bag}(x) = (1/B) Σ_b f̂^{(b)}(x).
One line of algebra. If B models give i.i.d. predictions with variance σ², the average has variance σ²/B. Each model is unbiased; the average is unbiased. Bagging keeps bias the same and crushes variance proportionally to 1/B.
The correlation problem — and the random forest fix
The i.i.d. assumption above is a lie. Bootstrap samples of the same D overlap heavily; trees fit on them tend to make similar splits on the same dominant features. Their predictions are correlated. If pairwise correlation is ρ and individual variance is σ²:
The second term vanishes as B → ∞. The first does not. The variance of an infinite bag floors at ρ σ². If your bagged trees are 80% correlated, the most variance reduction bagging can give you is 20%, and doubling B past a few hundred buys nothing.
Breiman 2001's random forest attacks ρ directly. At every split, restrict the splitter to a random subset of m features out of p. Two trees on similar bootstrap samples will, by luck of the feature subsample, often split on different features at the top — structures diverge — predictions decorrelate. Lower ρ means the bag's average gets closer to σ²/B and keeps improving with more trees.
The two knobs that matter
| Knob | Effect on bias | Effect on variance | Default / guidance |
|---|---|---|---|
| B — number of trees | ≈ unchanged | ↓ then plateaus at ρ σ² | 100–1000. More is never worse for accuracy in expectation, given that all trees come from the same procedure — the LLN stabilizes the average of those marginal predictions, but each tree's own bias-variance is fixed by max-depth/mtry. Stop when OOB error stops improving. |
m — features per split (mtry, max_features) |
↑ as m shrinks (each tree is weaker) | ↓ as m shrinks (ρ falls) | Classical: m = √p for classification, m = p/3 for regression. Tune by OOB error if the dataset is small. |
Asymmetric roles. B is a compute knob with monotone returns; m is a bias-variance knob with a sweet spot. "RFs don't need tuning" really means they don't need much tuning of B — set it large and forget. m still rewards attention on smaller datasets.
Out-of-bag error — free cross-validation
Each bootstrap picks n rows with replacement from n originals. What fraction get omitted? P(row i not picked in one draw) = 1 − 1/n. P(not picked in n draws) = (1 − 1/n)^n. As n → ∞:
Each tree's bootstrap leaves out roughly 37% of training rows. Flip it: for each row, roughly 37% of the B trees never saw it. Score the row using only those trees and you have an honest held-out prediction. Average across rows for the OOB error. Built-in cross-validation at zero extra compute; empirically tracks test error within a percent or two.
Feature importance — two methods, both with caveats
| Method | What it measures | Bias / failure mode |
|---|---|---|
| Mean decrease in impurity (MDI / Gini importance) | Sum of impurity decrease attributable to splits on the feature, averaged across trees. | Favours high-cardinality features and continuous features — more candidate thresholds means more chances for a spurious good split. Computed from training data, so a feature that's only useful for memorisation looks important. |
| Permutation importance | Shuffle a feature's column on the OOB rows, measure the drop in OOB accuracy. Repeat per feature. | Computed on held-out data, so less biased toward overfitting. But correlated features split their importance: if x_1 and x_2 are duplicates, permuting either one alone changes little, and both look unimportant. |
Neither is ground truth. In an interview, naming both — and noting that SHAP (lesson 12) is the modern best-practice for credible attributions — is the right level of nuance.
Why a forest of overfitters generalizes
Random forests typically use unpruned, full-depth trees. Each tree has near-zero training error and is, alone, a poster child for overfitting. Yet the ensemble generalizes. Apply the decomposition: each tree has low bias and high variance. The average has the same bias (average of equal biases is that bias) and lower variance (by the formula). The forest inherits the trees' expressive power and discards their wobble. Pruning individuals would raise bias for already-controlled variance — net loss.
Extra-Trees, briefly
Geurts et al. 2006's Extremely Randomized Trees push decorrelation further: (1) skip bootstrapping — use the full training set; (2) at each split, after the random feature subset, pick a random threshold instead of the best one. Each tree is weaker (higher bias), cheaper to fit, less correlated with siblings (lower variance). Usually within a hair of RF on accuracy, much faster to train.
Interactive · feel the variance collapse
1D regression on a noisy sine. Thin grey curves are individual trees; bold blue is the ensemble average. Slide B up — wobble disappears. Cut max-depth — see the bias floor that no amount of averaging can fix.
RF vs GBDT — when you reach for which
| Situation | Pick | Why |
|---|---|---|
| Small, noisy dataset, limited tuning budget | RF | Robust to defaults; averaging is a strong regularizer with small n. GBDT overfits without careful learning-rate and tree-count tuning. |
| Large dataset, accuracy is the bottom line, time to tune | GBDT (lesson 6) | Sequential bias-reduction beats variance-reduction-of-overfitters on enough data. State of the art on tabular. |
| Want validation for free | RF | OOB built in; GBDT needs an explicit validation set for early stopping. |
| Want roughly-calibrated probabilities out of the box | RF | Averaging leaf-frequencies gives reasonable probabilities. GBDT log-loss outputs are moderately calibrated — better than hinge-loss models, worse than well-fitted logistic regression — and benefit from Platt or isotonic for production use. |
| Maximum training parallelism | RF / Extra-Trees | Trees are independent — embarrassingly parallel. GBDT is sequential (later trees depend on earlier residuals). |
The full landscape
| Single deep tree | Bagged trees | Random forest | Extra-Trees | GBDT | |
|---|---|---|---|---|---|
| Bias | low | low | low | slightly higher | very low (tunable) |
| Variance | very high | medium (floor at ρσ²) | low | lower | medium-high (tunable) |
| Sensitivity to defaults | moderate | low | low | low | high |
| OOB available | n/a | yes | yes | no (no bootstrap) | no |
| Parallel training | n/a | full | full | full | sequential |
| Probability calibration | poor | fair | fair | fair | moderate; benefits from Platt/isotonic for production |
| Typical accuracy on tabular | weak | good | strong | strong | strongest |
Interview prompts you should be ready for
- "Derive the OOB ~37% from first principles." (P(not picked in one draw)=1−1/n. P(not picked in n draws)=(1−1/n)^n → 1/e ≈ 0.368.)
- "Why does feature subsampling at each split help?" (Reduces inter-tree correlation ρ. Var of the average is ρσ² + (1−ρ)σ²/B; lowering ρ lowers the floor that bagging hits.)
- "Your RF has 100 trees, 0.05 train and 0.30 test error. What now?" (High variance. More trees won't help past the bagging plateau — lower ρ. Reduce
max_features, reduce depth, get more data, or audit dominant correlated features.) - "How do you get probabilities from RF, and are they calibrated?" (Average per-tree leaf class frequencies. Decent near 0.5, less so near 0/1 — averaging pulls toward centre. Use isotonic on held-out if calibration matters.)
- "RF vs GBDT — when do you pick which?" (RF for small/noisy data, no tuning budget, free OOB, rough calibration, parallelism. GBDT when accuracy matters most and you'll tune.)
- "Permutation vs Gini importance — when does each mislead?" (Gini biased toward high-cardinality/continuous features. Permutation is honest on OOB but collapses under feature correlation: duplicates each look unimportant because the signal flows through the other.)