Gradient boosting & GBDT
The crown jewel of tabular ML. When a senior MLE interview asks "what would you start with on this tabular problem?", the answer is almost always a GBDT — and you should be able to derive why from first principles, not from a benchmark table.
Bagging vs boosting — the fork in the road
Both random forests and GBDTs are tree ensembles. They take opposite routes from there.
| Bagging (RF) | Boosting (GBDT) | |
|---|---|---|
| Training order | Parallel — trees independent | Sequential — tree t+1 needs tree t's predictions |
| Each tree sees | Bootstrap sample of training rows (X, y) | Pseudo-residuals of the current ensemble |
| Tree depth | Deep — strong, low-bias, high-variance | Shallow (3–8) — weak, high-bias, low-variance |
| Aggregation | Average / vote | F(x) = Σ η · h_t(x) |
| What it reduces | Variance | Bias |
| More trees | Diminishing return; effectively safe | Itself a regularizer — too many overfits |
Slogan: bagging averages strong learners to kill variance; boosting stacks weak learners to kill bias. Every regularization knob in GBDT follows — the residual loop drives bias down by construction, so engineering effort goes into containing variance.
AdaBoost — the original idea (skim)
Freund & Schapire (1995). Classification setting. Initialize example weights w_i = 1/N; for t = 1…T train a weak classifier h_t on reweighted data, compute weighted error err_t, set α_t = ½ log((1 − err_t)/err_t), update w_i ← w_i · exp(−α_t y_i h_t(x_i)) and renormalize. Final classifier: sign(Σ α_t h_t(x)).
Elegant and historically important, but specialized to binary classification with exponential loss. Friedman's reframe (next) is what runs in your XGBoost binary.
Gradient boosting — the generalization
Friedman (2001) noticed AdaBoost is doing functional gradient descent. Reframe the ensemble as a function F(x) = Σ γ_t h_t(x) being optimized to minimize L(y, F(x)). At each iteration: (1) compute the pseudo-residual — the negative gradient of the loss at the current prediction r_i = −∂L/∂F̂ evaluated at F̂ = F_{t−1}(x_i); (2) fit a regression tree h_t to those pseudo-residuals; (3) line-search for step size γ_t; (4) update F_t = F_{t−1} + η γ_t h_t with learning rate η < 1.
The shape of the pseudo-residual depends only on the loss:
| Loss | Negative gradient | Reading |
|---|---|---|
| Squared ½(y − F̂)² | y − F̂ | Literally the residual. "Fit a tree to leftover error" — the intuitive picture. |
| Log-loss on logits | y − σ(F̂) | Label minus predicted probability. Each tree pushes logits in the right direction. |
| Pseudo-Huber | Clipped residual | Robustness to outliers without losing differentiability. |
| Pairwise rank (LambdaRank) | Per-pair lambda | Same machinery, ranking objective. |
Why "boosting reduces bias"
The single-tree base learner is deliberately weak — depth 3–8, maybe 32–256 leaves total. A weak learner has high bias and low variance. Each iteration adds a term that explicitly chases what the current ensemble got wrong, so the ensemble's capacity grows additively. Bias is driven toward zero; variance creeps up from "many small models added" toward "memorizing training noise" if you let it. The conceptual opposite of bagging, where you start with strong, near-zero-bias learners and average their variance away.
The four regularization knobs
| Knob | Effect | Typical range |
|---|---|---|
| Iterations T | Each tree reduces bias on train; past a point memorizes noise. Validation loss has a clear minimum. | 100–10000; tune by early stopping. |
| Learning rate η | Scales each tree's contribution. The single most effective regularizer. Smaller η + more trees usually beats larger η + fewer. | 0.01–0.1; halving η roughly doubles T needed. |
| Tree depth / num_leaves | Caps interaction order each tree captures. Shallow → weaker per-tree → needs more iterations but generalizes better. | depth 3–8; num_leaves 15–255. |
| Subsampling (rows & cols) | Friedman's "stochastic gradient boosting": random fraction of rows/columns per tree. Noise that breaks overfit and slightly speeds training. | 0.5–0.9 for both. |
These knobs are not independent. Canonical tuning workflow: set depth 5–6, η = 0.05, subsample 0.8 / colsample 0.8; train with large max T (say 5000) and early-stop on validation loss with patience ~50 rounds. If the gap is too wide: shrink η, lower depth, raise min_child_weight or λ. If underfit: raise depth or η, drop regularization.
XGBoost — what's actually different from "plain" GBDT
Chen & Guestrin (2016). XGBoost is the implementation that took GBDT from "competitive on Kaggle" to "default tabular baseline". Three real ideas, all derivable on a napkin.
1. Second-order Taylor expansion
For the next tree f at iteration t, expand the loss change to second order:
where g_i = ∂L/∂F̂ and h_i = ∂²L/∂F̂² at the current prediction. Plain GBDT uses only g; adding h lets you compute the optimal leaf value analytically — no line search.
A tree partitions inputs into leaves j; every x_i in leaf j gets the same value w_j. With G_j = Σ_{i∈j} g_i, H_j = Σ_{i∈j} h_i, and Ω(f) = γT + ½λ Σ w_j²:
Differentiate per leaf, set to zero, substitute back:
And the split gain — the scoring function inside the split finder:
γ is a per-split penalty: a leaf only splits if gain exceeds γ. Pruning is built into the criterion, not bolted on after.
2. Explicit regularization in the objective
Plain GBDT regularizes through depth, shrinkage, and subsampling — all external. XGBoost adds Ω(f) = γT + ½λ‖w‖² + α‖w‖₁ directly into the split criterion, so the regularizer affects which splits are even considered.
3. Sparsity-aware splits (default direction for missing values)
At each node, try sending all missing rows left vs right; pick the direction with higher gain; store it as the "default direction". Cleaner than imputation and much faster. Other engineering wins: block-wise column layout for cache and parallelism; out-of-core training; the approximate (percentile-based) split finder.
LightGBM — what's different from XGBoost
Ke et al. (2017). Microsoft's GBDT. Competitive on accuracy after tuning; the differences are about speed and growth policy.
- Histogram-based split finding. Pre-bin every continuous feature into ~255 buckets at the start; split candidates are bucket boundaries. Orders of magnitude fewer candidates than every distinct value. XGBoost later added tree_method=hist to match; both libraries now ship histogram methods.
- Leaf-wise (best-first) growth. XGBoost's default is level-wise — expand all leaves at depth d before going deeper. LightGBM repeatedly splits the leaf with highest gain. Often reaches lower training loss with fewer leaves but produces skewed trees and overfits small data — hence num_leaves and max_depth caps.
- GOSS (Gradient-based One-Sided Sampling): keep all large-gradient examples, sample the rest, with the sampled small-gradient subset re-weighted by (1−a)/b — where a is the top-gradient fraction kept and b is the sample fraction of the remainder — so the gain estimate stays unbiased. Optional.
- EFB (Exclusive Feature Bundling): merge sparse features that rarely co-occur. Optional.
CatBoost — what's different again
Prokhorenkova et al. (2018). Yandex's GBDT. The one with the strongest categorical story.
- Ordered target statistics. Standard target encoding leaks: the encoding of row i's category depends on row i's label. CatBoost picks a random row ordering and computes each row's target statistic using only rows BEFORE it in the ordering. Leak-free; small memory cost.
- Ordered boosting. The analogous trick for the gradient estimation step itself: at each iteration, the model used to compute gradients for row i is trained only on rows BEFORE i, preventing target leakage in the gradient direction.
- Native categorical handling. Pass column names directly; CatBoost does one-hot for low-cardinality and ordered target-stats for high-cardinality.
- Symmetric (oblivious) trees. Every node at depth d uses the same split. Slightly restricts the tree class, but inference becomes a single bitwise lookup per row — extremely fast at serving time.
XGBoost vs LightGBM vs CatBoost — the senior signal
| XGBoost | LightGBM | CatBoost | |
|---|---|---|---|
| Differentiator | Second-order objective; the original modern GBDT | Histogram + leaf-wise — fastest | Ordered boosting + native categoricals |
| Growth policy | Level-wise (default) | Leaf-wise (best-first) | Symmetric (oblivious) |
| Categoricals | Manual encoding | Built-in but basic | First-class, leak-free |
| Where it wins | Safe default, mature ecosystem, distributed | Speed; very large datasets | Categorical-heavy data; fast inference |
| Where it loses | Slower than LightGBM at default settings | Leaf-wise overfits small data without caps | Slower training; smaller community |
After tuning, all three are competitive on most benchmarks. Headline differences in default-settings benchmarks mostly come from hyperparameter defaults, not algorithmic capacity. The interview signal isn't picking one — it's knowing which differentiator each one is associated with.
Interactive · boosting in slow motion
1D regression on a piecewise nonlinear curve. Slide the number of trees, learning rate, and tree depth. Watch the ensemble's prediction sharpen, then start memorizing noise. The KPI box reports train and validation MSE; the diagnosis box names the regime.
GBDT vs RF vs deep NN on tabular
| GBDT | Random Forest | Deep NN | |
|---|---|---|---|
| Training time on tabular | Fast | Comparable | Slow, GPU-hungry |
| Hyperparameter sensitivity | High — coupled knobs, needs early stopping | Low — near-out-of-the-box | High — arch, lr schedule, batch, init |
| Default tabular accuracy | State of the art | A few points behind GBDT | Usually behind GBDT on heterogeneous data |
| Ceiling | The benchmark winner | Plateaus | Matches GBDT on huge homogeneous data; rarely surpasses on small/medium |
| Calibration | GBDT log-loss outputs are moderately calibrated — better than hinge-loss models, worse than well-fitted logistic regression — and benefit from Platt or isotonic for production use. | Reasonable; can be over-confident | Poor by default; needs temperature scaling |
| Categoricals | CatBoost native; XGB/LGBM need encoding | Encoded inputs | Embeddings — first-class |
| Monotonicity constraints | Built-in per feature | Hard to enforce | Penalty-based or architectural — fiddly |
| Wide sparse data (text TF-IDF) | Weak — single-column axis-aligned splits on millions of rare TF-IDF features rarely beat linear or embedding models. | Weak for the same reason | Strong with embeddings/attention |
| Interpretability | SHAP, feature importance | Feature importance, proximity | Post-hoc only |
The bias-variance picture, restated for GBDT
Each new tree fits the leftover gradient → bias falls. Learning rate, tree depth, and number of trees are coupled variance knobs:
Two configurations can hit the same minimum validation loss with very different (T, η, depth) triples. Smaller η and shallower trees needs more T but is more robust to validation noise — usually preferred. This is why "small η, many trees, early stop" is the production default.
Interview prompts you should be ready for
- "Derive XGBoost's optimal leaf value w* = −G/(H + λ) from the second-order objective." (Second-order Taylor expansion, group by leaf, differentiate the per-leaf quadratic, set to zero. The split-gain formula falls out by substitution.)
- "Why does GBDT use shallow trees while RF uses deep ones?" (Different ensembling strategy. RF averages strong, decorrelated learners — deep trees give low bias, averaging kills variance. GBDT stacks weak learners to drive bias down sequentially — shallow trees keep each step's variance small. Deep trees in boosting overfit fast.)
- "Your GBDT overfits at 500 trees. List five things you'd try." (Lower η, shorter trees, more row/column subsampling, raise λ and min_child_weight, tighter early stopping. Bonus: check for target leakage — apparent overfit can be a leak.)
- "What's the actual algorithmic difference between LightGBM and XGBoost?" (Two: leaf-wise vs level-wise growth, and LightGBM's earlier-adopted histogram split finder. Plus GOSS and EFB. XGBoost has histogram mode now; the core algorithm is the same Friedman gradient boosting.)
- "Boosting reduces bias. So why can it still overfit?" (Bias and variance are independent terms. Bias → 0 doesn't bound variance — and variance climbs with iterations because each tree is a real fit to a noisy signal. Hence early stopping, shrinkage, subsampling.)
- "When would you pick logistic regression over GBDT?" (Calibration matters and the dataset is small/medium; relationship is plausibly linear; coefficient-level interpretability required for audit/regulation; ultra-low-latency serving; you need a model the next person can update without retraining pipelines.)