traditional_ml / 05 · bagging & random forests lesson 5 / 12

Bagging & random forests

Lesson 4's diagnosis: a single decision tree is low-bias, high-variance. This lesson is the fix. Bagging averages the variance away; random forests sharpen the trick by decorrelating the averaged trees.

Bagging, from first principles

You have a training set D of size n and a learner that gives noticeably different predictions at the same x for different draws of D (a tree, basically). Breiman 1996 — bootstrap aggregation:

  1. Draw B bootstrap samples D^{(1)}, …, D^{(B)}, each sampled with replacement from D, same size n.
  2. Fit f̂^{(b)} on each.
  3. Predict by averaging (regression) or majority vote (classification): f̂_{bag}(x) = (1/B) Σ_b f̂^{(b)}(x).

One line of algebra. If B models give i.i.d. predictions with variance σ², the average has variance σ²/B. Each model is unbiased; the average is unbiased. Bagging keeps bias the same and crushes variance proportionally to 1/B.

Why bagging loves trees and ignores linear models
Bagging only helps when the base learner has variance to spare. Full-depth trees wobble dramatically with the training set — bagging is tailor-made for them. A linear regression on n >> p already has low variance; averaging bootstrapped fits gives nearly the same coefficients. Bagging is a variance-reduction operator, not a free accuracy boost — apply it where there's variance to reduce.

The correlation problem — and the random forest fix

The i.i.d. assumption above is a lie. Bootstrap samples of the same D overlap heavily; trees fit on them tend to make similar splits on the same dominant features. Their predictions are correlated. If pairwise correlation is ρ and individual variance is σ²:

Var( (1/B) Σ_b f̂^{(b)}(x) ) = ρσ² + (1−ρ)σ²/B

The second term vanishes as B → ∞. The first does not. The variance of an infinite bag floors at ρ σ². If your bagged trees are 80% correlated, the most variance reduction bagging can give you is 20%, and doubling B past a few hundred buys nothing.

Breiman 2001's random forest attacks ρ directly. At every split, restrict the splitter to a random subset of m features out of p. Two trees on similar bootstrap samples will, by luck of the feature subsample, often split on different features at the top — structures diverge — predictions decorrelate. Lower ρ means the bag's average gets closer to σ²/B and keeps improving with more trees.

variance of the average ▲ │ ● │ ●● │ ●●● bagging (high ρ) │ ●●●●●●●●●●●●●●●●●●●●●●●●● ← floor at ρσ² │ ○○ │ ○○○ │ ○○○ random forest (lower ρ) │ ○○○○○○○○○○○○○○○○○○○○ ← lower floor │ ────────────────────────────────► B

The two knobs that matter

KnobEffect on biasEffect on varianceDefault / guidance
B — number of trees ≈ unchanged ↓ then plateaus at ρ σ² 100–1000. More is never worse for accuracy in expectation, given that all trees come from the same procedure — the LLN stabilizes the average of those marginal predictions, but each tree's own bias-variance is fixed by max-depth/mtry. Stop when OOB error stops improving.
m — features per split (mtry, max_features) ↑ as m shrinks (each tree is weaker) ↓ as m shrinks (ρ falls) Classical: m = √p for classification, m = p/3 for regression. Tune by OOB error if the dataset is small.

Asymmetric roles. B is a compute knob with monotone returns; m is a bias-variance knob with a sweet spot. "RFs don't need tuning" really means they don't need much tuning of B — set it large and forget. m still rewards attention on smaller datasets.

Out-of-bag error — free cross-validation

Each bootstrap picks n rows with replacement from n originals. What fraction get omitted? P(row i not picked in one draw) = 1 − 1/n. P(not picked in n draws) = (1 − 1/n)^n. As n → ∞:

lim_{n → ∞} (1 − 1/n)^n = 1/e ≈ 0.368

Each tree's bootstrap leaves out roughly 37% of training rows. Flip it: for each row, roughly 37% of the B trees never saw it. Score the row using only those trees and you have an honest held-out prediction. Average across rows for the OOB error. Built-in cross-validation at zero extra compute; empirically tracks test error within a percent or two.

Where OOB lies to you
(1) If B is small, the per-row OOB committee is small and the estimate is noisy; OOB is reliable mostly when B ≥ 200. (2) OOB rows are still drawn from the training distribution. With covariate shift at deployment, OOB measures in-distribution generalization, not deployment performance.

Feature importance — two methods, both with caveats

MethodWhat it measuresBias / failure mode
Mean decrease in impurity (MDI / Gini importance) Sum of impurity decrease attributable to splits on the feature, averaged across trees. Favours high-cardinality features and continuous features — more candidate thresholds means more chances for a spurious good split. Computed from training data, so a feature that's only useful for memorisation looks important.
Permutation importance Shuffle a feature's column on the OOB rows, measure the drop in OOB accuracy. Repeat per feature. Computed on held-out data, so less biased toward overfitting. But correlated features split their importance: if x_1 and x_2 are duplicates, permuting either one alone changes little, and both look unimportant.

Neither is ground truth. In an interview, naming both — and noting that SHAP (lesson 12) is the modern best-practice for credible attributions — is the right level of nuance.

Why a forest of overfitters generalizes

Random forests typically use unpruned, full-depth trees. Each tree has near-zero training error and is, alone, a poster child for overfitting. Yet the ensemble generalizes. Apply the decomposition: each tree has low bias and high variance. The average has the same bias (average of equal biases is that bias) and lower variance (by the formula). The forest inherits the trees' expressive power and discards their wobble. Pruning individuals would raise bias for already-controlled variance — net loss.

Extra-Trees, briefly

Geurts et al. 2006's Extremely Randomized Trees push decorrelation further: (1) skip bootstrapping — use the full training set; (2) at each split, after the random feature subset, pick a random threshold instead of the best one. Each tree is weaker (higher bias), cheaper to fit, less correlated with siblings (lower variance). Usually within a hair of RF on accuracy, much faster to train.

Interactive · feel the variance collapse

1D regression on a noisy sine. Thin grey curves are individual trees; bold blue is the ensemble average. Slide B up — wobble disappears. Cut max-depth — see the bias floor that no amount of averaging can fix.

Bagging variance reducer
B=1, depth=8 — high variance, prediction wiggles. Push B to 30 — average smooths out. Drop depth to 2 — even with B=30, the average can't bend enough to match the sine. That's bias.
train MSE
test MSE
avg tree variance @ x
ensemble vs trees
Reading

RF vs GBDT — when you reach for which

SituationPickWhy
Small, noisy dataset, limited tuning budgetRFRobust to defaults; averaging is a strong regularizer with small n. GBDT overfits without careful learning-rate and tree-count tuning.
Large dataset, accuracy is the bottom line, time to tuneGBDT (lesson 6)Sequential bias-reduction beats variance-reduction-of-overfitters on enough data. State of the art on tabular.
Want validation for freeRFOOB built in; GBDT needs an explicit validation set for early stopping.
Want roughly-calibrated probabilities out of the boxRFAveraging leaf-frequencies gives reasonable probabilities. GBDT log-loss outputs are moderately calibrated — better than hinge-loss models, worse than well-fitted logistic regression — and benefit from Platt or isotonic for production use.
Maximum training parallelismRF / Extra-TreesTrees are independent — embarrassingly parallel. GBDT is sequential (later trees depend on earlier residuals).

The full landscape

Single deep treeBagged treesRandom forestExtra-TreesGBDT
Biaslowlowlowslightly highervery low (tunable)
Variancevery highmedium (floor at ρσ²)lowlowermedium-high (tunable)
Sensitivity to defaultsmoderatelowlowlowhigh
OOB availablen/ayesyesno (no bootstrap)no
Parallel trainingn/afullfullfullsequential
Probability calibrationpoorfairfairfairmoderate; benefits from Platt/isotonic for production
Typical accuracy on tabularweakgoodstrongstrongstrongest

Interview prompts you should be ready for

  1. "Derive the OOB ~37% from first principles." (P(not picked in one draw)=1−1/n. P(not picked in n draws)=(1−1/n)^n → 1/e ≈ 0.368.)
  2. "Why does feature subsampling at each split help?" (Reduces inter-tree correlation ρ. Var of the average is ρσ² + (1−ρ)σ²/B; lowering ρ lowers the floor that bagging hits.)
  3. "Your RF has 100 trees, 0.05 train and 0.30 test error. What now?" (High variance. More trees won't help past the bagging plateau — lower ρ. Reduce max_features, reduce depth, get more data, or audit dominant correlated features.)
  4. "How do you get probabilities from RF, and are they calibrated?" (Average per-tree leaf class frequencies. Decent near 0.5, less so near 0/1 — averaging pulls toward centre. Use isotonic on held-out if calibration matters.)
  5. "RF vs GBDT — when do you pick which?" (RF for small/noisy data, no tuning budget, free OOB, rough calibration, parallelism. GBDT when accuracy matters most and you'll tune.)
  6. "Permutation vs Gini importance — when does each mislead?" (Gini biased toward high-cardinality/continuous features. Permutation is honest on OOB but collapses under feature correlation: duplicates each look unimportant because the signal flows through the other.)
Takeaway
Bagging reduces variance by averaging; random forests reduce the correlation between averages so bagging keeps paying off. The whole package — bootstrap, feature subsampling, OOB validation — is one of the most robust defaults in tabular ML. Senior signal: know which knob targets which term of ρ σ² + (1−ρ)σ²/B, and know what RF cannot fix (bias from underpowered trees, distribution shift, GBDT-tier accuracy on tuned large problems).