traditional_ml / 06 · gradient boosting & GBDT lesson 6 / 12

Gradient boosting & GBDT

The crown jewel of tabular ML. When a senior MLE interview asks "what would you start with on this tabular problem?", the answer is almost always a GBDT — and you should be able to derive why from first principles, not from a benchmark table.

Bagging vs boosting — the fork in the road

Both random forests and GBDTs are tree ensembles. They take opposite routes from there.

Bagging (RF)Boosting (GBDT)
Training orderParallel — trees independentSequential — tree t+1 needs tree t's predictions
Each tree seesBootstrap sample of training rows (X, y)Pseudo-residuals of the current ensemble
Tree depthDeep — strong, low-bias, high-varianceShallow (3–8) — weak, high-bias, low-variance
AggregationAverage / voteF(x) = Σ η · h_t(x)
What it reducesVarianceBias
More treesDiminishing return; effectively safeItself a regularizer — too many overfits

Slogan: bagging averages strong learners to kill variance; boosting stacks weak learners to kill bias. Every regularization knob in GBDT follows — the residual loop drives bias down by construction, so engineering effort goes into containing variance.

AdaBoost — the original idea (skim)

Freund & Schapire (1995). Classification setting. Initialize example weights w_i = 1/N; for t = 1…T train a weak classifier h_t on reweighted data, compute weighted error err_t, set α_t = ½ log((1 − err_t)/err_t), update w_i ← w_i · exp(−α_t y_i h_t(x_i)) and renormalize. Final classifier: sign(Σ α_t h_t(x)).

Elegant and historically important, but specialized to binary classification with exponential loss. Friedman's reframe (next) is what runs in your XGBoost binary.

Gradient boosting — the generalization

Friedman (2001) noticed AdaBoost is doing functional gradient descent. Reframe the ensemble as a function F(x) = Σ γ_t h_t(x) being optimized to minimize L(y, F(x)). At each iteration: (1) compute the pseudo-residual — the negative gradient of the loss at the current prediction r_i = −∂L/∂F̂ evaluated at F̂ = F_{t−1}(x_i); (2) fit a regression tree h_t to those pseudo-residuals; (3) line-search for step size γ_t; (4) update F_t = F_{t−1} + η γ_t h_t with learning rate η < 1.

The shape of the pseudo-residual depends only on the loss:

LossNegative gradientReading
Squared ½(y − F̂)²y − F̂Literally the residual. "Fit a tree to leftover error" — the intuitive picture.
Log-loss on logitsy − σ(F̂)Label minus predicted probability. Each tree pushes logits in the right direction.
Pseudo-HuberClipped residualRobustness to outliers without losing differentiability.
Pairwise rank (LambdaRank)Per-pair lambdaSame machinery, ranking objective.
The reframe that makes everything click
Parameter-space SGD: θ ← θ − η ∇_θ L. Function-space SGD: F ← F − η ∇_F L. But ∇_F L isn't a function in our model class, so we project it by fitting a regression tree to it. The tree is the projection. Shrinkage = small step size; tree depth = capacity per step; subsampling = stochastic gradient — identical to plain SGD intuition.

Why "boosting reduces bias"

The single-tree base learner is deliberately weak — depth 3–8, maybe 32–256 leaves total. A weak learner has high bias and low variance. Each iteration adds a term that explicitly chases what the current ensemble got wrong, so the ensemble's capacity grows additively. Bias is driven toward zero; variance creeps up from "many small models added" toward "memorizing training noise" if you let it. The conceptual opposite of bagging, where you start with strong, near-zero-bias learners and average their variance away.

Each iteration fits the residuals of the previous ensemble. Bias falls; variance creeps in. true f(x) ensemble F_t(x) residual (y − F_t) Iteration 1 (stump) huge residuals — bias dominates Iteration 5 bias shrinking — residuals smaller Iteration 20 near-zero residuals — careful, overfit risk

The four regularization knobs

KnobEffectTypical range
Iterations T Each tree reduces bias on train; past a point memorizes noise. Validation loss has a clear minimum. 100–10000; tune by early stopping.
Learning rate η Scales each tree's contribution. The single most effective regularizer. Smaller η + more trees usually beats larger η + fewer. 0.01–0.1; halving η roughly doubles T needed.
Tree depth / num_leaves Caps interaction order each tree captures. Shallow → weaker per-tree → needs more iterations but generalizes better. depth 3–8; num_leaves 15–255.
Subsampling (rows & cols) Friedman's "stochastic gradient boosting": random fraction of rows/columns per tree. Noise that breaks overfit and slightly speeds training. 0.5–0.9 for both.

These knobs are not independent. Canonical tuning workflow: set depth 5–6, η = 0.05, subsample 0.8 / colsample 0.8; train with large max T (say 5000) and early-stop on validation loss with patience ~50 rounds. If the gap is too wide: shrink η, lower depth, raise min_child_weight or λ. If underfit: raise depth or η, drop regularization.

XGBoost — what's actually different from "plain" GBDT

Chen & Guestrin (2016). XGBoost is the implementation that took GBDT from "competitive on Kaggle" to "default tabular baseline". Three real ideas, all derivable on a napkin.

1. Second-order Taylor expansion

For the next tree f at iteration t, expand the loss change to second order:

L_t ≈ Σ_i [ g_i f(x_i) + ½ h_i f(x_i)² ] + Ω(f)

where g_i = ∂L/∂F̂ and h_i = ∂²L/∂F̂² at the current prediction. Plain GBDT uses only g; adding h lets you compute the optimal leaf value analytically — no line search.

A tree partitions inputs into leaves j; every x_i in leaf j gets the same value w_j. With G_j = Σ_{i∈j} g_i, H_j = Σ_{i∈j} h_i, and Ω(f) = γT + ½λ Σ w_j²:

L_t ≈ Σ_j [ G_j w_j + ½(H_j + λ) w_j² ] + γT

Differentiate per leaf, set to zero, substitute back:

w_j* = −G_j / (H_j + λ)     L_t* = −½ Σ_j G_j² / (H_j + λ) + γT

And the split gain — the scoring function inside the split finder:

Gain = ½ [ G_L²/(H_L+λ) + G_R²/(H_R+λ) − G²/(H+λ) ] − γ

γ is a per-split penalty: a leaf only splits if gain exceeds γ. Pruning is built into the criterion, not bolted on after.

2. Explicit regularization in the objective

Plain GBDT regularizes through depth, shrinkage, and subsampling — all external. XGBoost adds Ω(f) = γT + ½λ‖w‖² + α‖w‖₁ directly into the split criterion, so the regularizer affects which splits are even considered.

3. Sparsity-aware splits (default direction for missing values)

At each node, try sending all missing rows left vs right; pick the direction with higher gain; store it as the "default direction". Cleaner than imputation and much faster. Other engineering wins: block-wise column layout for cache and parallelism; out-of-core training; the approximate (percentile-based) split finder.

LightGBM — what's different from XGBoost

Ke et al. (2017). Microsoft's GBDT. Competitive on accuracy after tuning; the differences are about speed and growth policy.

CatBoost — what's different again

Prokhorenkova et al. (2018). Yandex's GBDT. The one with the strongest categorical story.

XGBoost vs LightGBM vs CatBoost — the senior signal

XGBoostLightGBMCatBoost
DifferentiatorSecond-order objective; the original modern GBDTHistogram + leaf-wise — fastestOrdered boosting + native categoricals
Growth policyLevel-wise (default)Leaf-wise (best-first)Symmetric (oblivious)
CategoricalsManual encodingBuilt-in but basicFirst-class, leak-free
Where it winsSafe default, mature ecosystem, distributedSpeed; very large datasetsCategorical-heavy data; fast inference
Where it losesSlower than LightGBM at default settingsLeaf-wise overfits small data without capsSlower training; smaller community

After tuning, all three are competitive on most benchmarks. Headline differences in default-settings benchmarks mostly come from hyperparameter defaults, not algorithmic capacity. The interview signal isn't picking one — it's knowing which differentiator each one is associated with.

Interactive · boosting in slow motion

1D regression on a piecewise nonlinear curve. Slide the number of trees, learning rate, and tree depth. Watch the ensemble's prediction sharpen, then start memorizing noise. The KPI box reports train and validation MSE; the diagnosis box names the regime.

Boosting a regression curve
Try (η=1.0, depth=6, T=50) for textbook overfit. Then (η=0.05, depth=3, T=200) for the canonical recipe. Then watch what happens if you push T past the validation minimum at the same η.
train MSE
validation MSE
total leaves
trees
Reading

GBDT vs RF vs deep NN on tabular

GBDTRandom ForestDeep NN
Training time on tabularFastComparableSlow, GPU-hungry
Hyperparameter sensitivityHigh — coupled knobs, needs early stoppingLow — near-out-of-the-boxHigh — arch, lr schedule, batch, init
Default tabular accuracyState of the artA few points behind GBDTUsually behind GBDT on heterogeneous data
CeilingThe benchmark winnerPlateausMatches GBDT on huge homogeneous data; rarely surpasses on small/medium
CalibrationGBDT log-loss outputs are moderately calibrated — better than hinge-loss models, worse than well-fitted logistic regression — and benefit from Platt or isotonic for production use.Reasonable; can be over-confidentPoor by default; needs temperature scaling
CategoricalsCatBoost native; XGB/LGBM need encodingEncoded inputsEmbeddings — first-class
Monotonicity constraintsBuilt-in per featureHard to enforcePenalty-based or architectural — fiddly
Wide sparse data (text TF-IDF)Weak — single-column axis-aligned splits on millions of rare TF-IDF features rarely beat linear or embedding models.Weak for the same reasonStrong with embeddings/attention
InterpretabilitySHAP, feature importanceFeature importance, proximityPost-hoc only
When NOT to reach for GBDT
Three regimes: (1) high-cardinality unstructured inputs (text, images, audio) where embeddings dominate; (2) calibration matters more than ranking and the dataset is small enough that logistic regression with hand-crafted features is interpretable and calibrated by default; (3) deployment can't pay GBDT's per-prediction tree-traversal cost (extreme low-latency edge inference where a tiny linear/MLP wins on cache footprint). Otherwise: start with GBDT.

The bias-variance picture, restated for GBDT

Each new tree fits the leftover gradient → bias falls. Learning rate, tree depth, and number of trees are coupled variance knobs:

validation loss ▲ │ ╲ ╱ │ ╲ high η, deep trees ╱ │ ╲ variance climbs fast ╱ │ ╲ ╱ │ ╲________________________________ ╱ ← early-stop here │ ╲ low η, shallow trees ╱ │ ╲ variance climbs slow ╱ │ ╲___________________________ ╱ │ (training loss keeps dropping no matter what) │ ──────────────────────────────────────────────► │ few trees T* many trees

Two configurations can hit the same minimum validation loss with very different (T, η, depth) triples. Smaller η and shallower trees needs more T but is more robust to validation noise — usually preferred. This is why "small η, many trees, early stop" is the production default.

Interview prompts you should be ready for

  1. "Derive XGBoost's optimal leaf value w* = −G/(H + λ) from the second-order objective." (Second-order Taylor expansion, group by leaf, differentiate the per-leaf quadratic, set to zero. The split-gain formula falls out by substitution.)
  2. "Why does GBDT use shallow trees while RF uses deep ones?" (Different ensembling strategy. RF averages strong, decorrelated learners — deep trees give low bias, averaging kills variance. GBDT stacks weak learners to drive bias down sequentially — shallow trees keep each step's variance small. Deep trees in boosting overfit fast.)
  3. "Your GBDT overfits at 500 trees. List five things you'd try." (Lower η, shorter trees, more row/column subsampling, raise λ and min_child_weight, tighter early stopping. Bonus: check for target leakage — apparent overfit can be a leak.)
  4. "What's the actual algorithmic difference between LightGBM and XGBoost?" (Two: leaf-wise vs level-wise growth, and LightGBM's earlier-adopted histogram split finder. Plus GOSS and EFB. XGBoost has histogram mode now; the core algorithm is the same Friedman gradient boosting.)
  5. "Boosting reduces bias. So why can it still overfit?" (Bias and variance are independent terms. Bias → 0 doesn't bound variance — and variance climbs with iterations because each tree is a real fit to a noisy signal. Hence early stopping, shrinkage, subsampling.)
  6. "When would you pick logistic regression over GBDT?" (Calibration matters and the dataset is small/medium; relationship is plausibly linear; coefficient-level interpretability required for audit/regulation; ultra-low-latency serving; you need a model the next person can update without retraining pipelines.)
Takeaway
GBDT is gradient descent in function space, with shallow regression trees as the projection step. Each tree fits the negative gradient of the loss at the current prediction, scaled by a learning rate. Bias falls; variance is the engineering problem. XGBoost adds second-order Taylor (closed-form leaf values and split gain), LightGBM adds histogram + leaf-wise growth (speed), CatBoost adds ordered target encoding (leak-free categoricals). For "what would you start with on this tabular problem?" the answer is GBDT — and you should be able to draw the residual loop, write the second-order objective, name the regularization knobs, and pick the right library.