Gradient boosting & GBDT

The crown jewel of tabular ML. When a senior MLE interview asks "what would you start with on this tabular problem?", the answer is almost always a GBDT — and you should be able to derive why from first principles, not from a benchmark table.

Bagging vs boosting — the fork in the road

Both random forests and GBDTs are tree ensembles. They take opposite routes from there.

	Bagging (RF)	Boosting (GBDT)
Training order	Parallel — trees independent	Sequential — tree t+1 needs tree t's predictions
Each tree sees	Bootstrap sample of training rows (X, y)	Pseudo-residuals of the current ensemble
Tree depth	Deep — strong, low-bias, high-variance	Shallow (3–8) — weak, high-bias, low-variance
Aggregation	Average / vote	F(x) = Σ η · h_t(x)
What it reduces	Variance	Bias
More trees	Diminishing return; effectively safe	Itself a regularizer — too many overfits

Slogan: bagging averages strong learners to kill variance; boosting stacks weak learners to kill bias. Every regularization knob in GBDT follows — the residual loop drives bias down by construction, so engineering effort goes into containing variance.

AdaBoost — the original idea (skim)

Freund & Schapire (1995). Classification setting. Initialize example weights w_i = 1/N; for t = 1…T train a weak classifier h_t on reweighted data, compute weighted error err_t, set α_t = ½ log((1 − err_t)/err_t), update w_i ← w_i · exp(−α_t y_i h_t(x_i)) and renormalize. Final classifier: sign(Σ α_t h_t(x)).

Elegant and historically important, but specialized to binary classification with exponential loss. Friedman's reframe (next) is what runs in your XGBoost binary.

Gradient boosting — the generalization

Friedman (2001) noticed AdaBoost is doing functional gradient descent. Reframe the ensemble as a function F(x) = Σ γ_t h_t(x) being optimized to minimize L(y, F(x)). At each iteration: (1) compute the pseudo-residual — the negative gradient of the loss at the current prediction r_i = −∂L/∂F̂ evaluated at F̂ = F_{t−1}(x_i); (2) fit a regression tree h_t to those pseudo-residuals; (3) line-search for step size γ_t; (4) update F_t = F_{t−1} + η γ_t h_t with learning rate η < 1.

The shape of the pseudo-residual depends only on the loss:

Loss	Negative gradient	Reading
Squared ½(y − F̂)²	y − F̂	Literally the residual. "Fit a tree to leftover error" — the intuitive picture.
Log-loss on logits	y − σ(F̂)	Label minus predicted probability. Each tree pushes logits in the right direction.
Pseudo-Huber	Clipped residual	Robustness to outliers without losing differentiability.
Pairwise rank (LambdaRank)	Per-pair lambda	Same machinery, ranking objective.

The reframe that makes everything click

Parameter-space SGD: θ ← θ − η ∇_θ L. Function-space SGD: F ← F − η ∇_F L. But ∇_F L isn't a function in our model class, so we project it by fitting a regression tree to it. The tree is the projection. Shrinkage = small step size; tree depth = capacity per step; subsampling = stochastic gradient — identical to plain SGD intuition.

Why "boosting reduces bias"

The single-tree base learner is deliberately weak — depth 3–8, maybe 32–256 leaves total. A weak learner has high bias and low variance. Each iteration adds a term that explicitly chases what the current ensemble got wrong, so the ensemble's capacity grows additively. Bias is driven toward zero; variance creeps up from "many small models added" toward "memorizing training noise" if you let it. The conceptual opposite of bagging, where you start with strong, near-zero-bias learners and average their variance away.

The four regularization knobs

Knob	Effect	Typical range
Iterations T	Each tree reduces bias on train; past a point memorizes noise. Validation loss has a clear minimum.	100–10000; tune by early stopping.
Learning rate η	Scales each tree's contribution. The single most effective regularizer. Smaller η + more trees usually beats larger η + fewer.	0.01–0.1; halving η roughly doubles T needed.
Tree depth / num_leaves	Caps interaction order each tree captures. Shallow → weaker per-tree → needs more iterations but generalizes better.	depth 3–8; num_leaves 15–255.
Subsampling (rows & cols)	Friedman's "stochastic gradient boosting": random fraction of rows/columns per tree. Noise that breaks overfit and slightly speeds training.	0.5–0.9 for both.

These knobs are not independent. Canonical tuning workflow: set depth 5–6, η = 0.05, subsample 0.8 / colsample 0.8; train with large max T (say 5000) and early-stop on validation loss with patience ~50 rounds. If the gap is too wide: shrink η, lower depth, raise min_child_weight or λ. If underfit: raise depth or η, drop regularization.

XGBoost — what's actually different from "plain" GBDT

Chen & Guestrin (2016). XGBoost is the implementation that took GBDT from "competitive on Kaggle" to "default tabular baseline". Three real ideas, all derivable on a napkin.

1. Second-order Taylor expansion

For the next tree f at iteration t, expand the loss change to second order:

L_t ≈ Σ_i [ g_i f(x_i) + ½ h_i f(x_i)² ] + Ω(f)

where g_i = ∂L/∂F̂ and h_i = ∂²L/∂F̂² at the current prediction. Plain GBDT uses only g; adding h lets you compute the optimal leaf value analytically — no line search.

A tree partitions inputs into leaves j; every x_i in leaf j gets the same value w_j. With G_j = Σ_{i∈j} g_i, H_j = Σ_{i∈j} h_i, and Ω(f) = γT + ½λ Σ w_j²:

L_t ≈ Σ_j [ G_j w_j + ½(H_j + λ) w_j² ] + γT

Differentiate per leaf, set to zero, substitute back:

w_j* = −G_j / (H_j + λ) L_t* = −½ Σ_j G_j² / (H_j + λ) + γT

And the split gain — the scoring function inside the split finder:

Gain = ½ [ G_L²/(H_L+λ) + G_R²/(H_R+λ) − G²/(H+λ) ] − γ

γ is a per-split penalty: a leaf only splits if gain exceeds γ. Pruning is built into the criterion, not bolted on after.

2. Explicit regularization in the objective

Plain GBDT regularizes through depth, shrinkage, and subsampling — all external. XGBoost adds Ω(f) = γT + ½λ‖w‖² + α‖w‖₁ directly into the split criterion, so the regularizer affects which splits are even considered.

3. Sparsity-aware splits (default direction for missing values)

At each node, try sending all missing rows left vs right; pick the direction with higher gain; store it as the "default direction". Cleaner than imputation and much faster. Other engineering wins: block-wise column layout for cache and parallelism; out-of-core training; the approximate (percentile-based) split finder.

LightGBM — what's different from XGBoost

Ke et al. (2017). Microsoft's GBDT. Competitive on accuracy after tuning; the differences are about speed and growth policy.

Histogram-based split finding. Pre-bin every continuous feature into ~255 buckets at the start; split candidates are bucket boundaries. Orders of magnitude fewer candidates than every distinct value. XGBoost later added tree_method=hist to match; both libraries now ship histogram methods.
Leaf-wise (best-first) growth. XGBoost's default is level-wise — expand all leaves at depth d before going deeper. LightGBM repeatedly splits the leaf with highest gain. Often reaches lower training loss with fewer leaves but produces skewed trees and overfits small data — hence num_leaves and max_depth caps.
GOSS (Gradient-based One-Sided Sampling): keep all large-gradient examples, sample the rest, with the sampled small-gradient subset re-weighted by (1−a)/b — where a is the top-gradient fraction kept and b is the sample fraction of the remainder — so the gain estimate stays unbiased. Optional.
EFB (Exclusive Feature Bundling): merge sparse features that rarely co-occur. Optional.

CatBoost — what's different again

Prokhorenkova et al. (2018). Yandex's GBDT. The one with the strongest categorical story.

Ordered target statistics. Standard target encoding leaks: the encoding of row i's category depends on row i's label. CatBoost picks a random row ordering and computes each row's target statistic using only rows BEFORE it in the ordering. Leak-free; small memory cost.
Ordered boosting. The analogous trick for the gradient estimation step itself: at each iteration, the model used to compute gradients for row i is trained only on rows BEFORE i, preventing target leakage in the gradient direction.
Native categorical handling. Pass column names directly; CatBoost does one-hot for low-cardinality and ordered target-stats for high-cardinality.
Symmetric (oblivious) trees. Every node at depth d uses the same split. Slightly restricts the tree class, but inference becomes a single bitwise lookup per row — extremely fast at serving time.

XGBoost vs LightGBM vs CatBoost — the senior signal

	XGBoost	LightGBM	CatBoost
Differentiator	Second-order objective; the original modern GBDT	Histogram + leaf-wise — fastest	Ordered boosting + native categoricals
Growth policy	Level-wise (default)	Leaf-wise (best-first)	Symmetric (oblivious)
Categoricals	Manual encoding	Built-in but basic	First-class, leak-free
Where it wins	Safe default, mature ecosystem, distributed	Speed; very large datasets	Categorical-heavy data; fast inference
Where it loses	Slower than LightGBM at default settings	Leaf-wise overfits small data without caps	Slower training; smaller community

After tuning, all three are competitive on most benchmarks. Headline differences in default-settings benchmarks mostly come from hyperparameter defaults, not algorithmic capacity. The interview signal isn't picking one — it's knowing which differentiator each one is associated with.

Interactive · boosting in slow motion

1D regression on a piecewise nonlinear curve. Slide the number of trees, learning rate, and tree depth. Watch the ensemble's prediction sharpen, then start memorizing noise. The KPI box reports train and validation MSE; the diagnosis box names the regime.

GBDT vs RF vs deep NN on tabular

	GBDT	Random Forest	Deep NN
Training time on tabular	Fast	Comparable	Slow, GPU-hungry
Hyperparameter sensitivity	High — coupled knobs, needs early stopping	Low — near-out-of-the-box	High — arch, lr schedule, batch, init
Default tabular accuracy	State of the art	A few points behind GBDT	Usually behind GBDT on heterogeneous data
Ceiling	The benchmark winner	Plateaus	Matches GBDT on huge homogeneous data; rarely surpasses on small/medium
Calibration	GBDT log-loss outputs are moderately calibrated — better than hinge-loss models, worse than well-fitted logistic regression — and benefit from Platt or isotonic for production use.	Reasonable; can be over-confident	Poor by default; needs temperature scaling
Categoricals	CatBoost native; XGB/LGBM need encoding	Encoded inputs	Embeddings — first-class
Monotonicity constraints	Built-in per feature	Hard to enforce	Penalty-based or architectural — fiddly
Wide sparse data (text TF-IDF)	Weak — single-column axis-aligned splits on millions of rare TF-IDF features rarely beat linear or embedding models.	Weak for the same reason	Strong with embeddings/attention
Interpretability	SHAP, feature importance	Feature importance, proximity	Post-hoc only

When NOT to reach for GBDT

Three regimes: (1) high-cardinality unstructured inputs (text, images, audio) where embeddings dominate; (2) calibration matters more than ranking and the dataset is small enough that logistic regression with hand-crafted features is interpretable and calibrated by default; (3) deployment can't pay GBDT's per-prediction tree-traversal cost (extreme low-latency edge inference where a tiny linear/MLP wins on cache footprint). Otherwise: start with GBDT.

The bias-variance picture, restated for GBDT

Each new tree fits the leftover gradient → bias falls. Learning rate, tree depth, and number of trees are coupled variance knobs:

validation loss ▲ │ ╲ ╱ │ ╲ high η, deep trees ╱ │ ╲ variance climbs fast ╱ │ ╲ ╱ │ ╲________________________________ ╱ ← early-stop here │ ╲ low η, shallow trees ╱ │ ╲ variance climbs slow ╱ │ ╲___________________________ ╱ │ (training loss keeps dropping no matter what) │ ──────────────────────────────────────────────► │ few trees T* many trees

Two configurations can hit the same minimum validation loss with very different (T, η, depth) triples. Smaller η and shallower trees needs more T but is more robust to validation noise — usually preferred. This is why "small η, many trees, early stop" is the production default.

Interview prompts you should be ready for

"Derive XGBoost's optimal leaf value w* = −G/(H + λ) from the second-order objective." (Second-order Taylor expansion, group by leaf, differentiate the per-leaf quadratic, set to zero. The split-gain formula falls out by substitution.)
"Why does GBDT use shallow trees while RF uses deep ones?" (Different ensembling strategy. RF averages strong, decorrelated learners — deep trees give low bias, averaging kills variance. GBDT stacks weak learners to drive bias down sequentially — shallow trees keep each step's variance small. Deep trees in boosting overfit fast.)
"Your GBDT overfits at 500 trees. List five things you'd try." (Lower η, shorter trees, more row/column subsampling, raise λ and min_child_weight, tighter early stopping. Bonus: check for target leakage — apparent overfit can be a leak.)
"What's the actual algorithmic difference between LightGBM and XGBoost?" (Two: leaf-wise vs level-wise growth, and LightGBM's earlier-adopted histogram split finder. Plus GOSS and EFB. XGBoost has histogram mode now; the core algorithm is the same Friedman gradient boosting.)
"Boosting reduces bias. So why can it still overfit?" (Bias and variance are independent terms. Bias → 0 doesn't bound variance — and variance climbs with iterations because each tree is a real fit to a noisy signal. Hence early stopping, shrinkage, subsampling.)
"When would you pick logistic regression over GBDT?" (Calibration matters and the dataset is small/medium; relationship is plausibly linear; coefficient-level interpretability required for audit/regulation; ultra-low-latency serving; you need a model the next person can update without retraining pipelines.)

Takeaway

GBDT is gradient descent in function space, with shallow regression trees as the projection step. Each tree fits the negative gradient of the loss at the current prediction, scaled by a learning rate. Bias falls; variance is the engineering problem. XGBoost adds second-order Taylor (closed-form leaf values and split gain), LightGBM adds histogram + leaf-wise growth (speed), CatBoost adds ordered target encoding (leak-free categoricals). For "what would you start with on this tabular problem?" the answer is GBDT — and you should be able to draw the residual loop, write the second-order objective, name the regularization knobs, and pick the right library.