Decision trees

The single most important model class for tabular data. Not because one tree is great — it isn't — but because every ensemble that crushes tabular benchmarks (random forests, GBDTs) is a pile of trees. Understand the tree, and lessons 5 and 6 collapse to "stack them, but cleverly".

What a tree is, geometrically

A decision tree is a recursive, axis-aligned partition of feature space. Every internal node tests x_j ≤ t and routes the sample left or right. Leaves are constants — the mean of training labels in that leaf for regression, the majority class (or class-frequency vector for probabilities) for classification. The whole model is

f̂(x) = Σ_ℓ c_ℓ · 1{x ∈ R_ℓ}

where {R_ℓ} are the rectangles (leaves) and c_ℓ the leaf constants. Every prediction is a lookup: route x down the tree, return the leaf value. There is no continuous function being fitted, just a piecewise-constant tessellation.

x2 │ ────┼───────────────── split: x1 ≤ 4 │ │ │ R3 │ R4 │ (mean=0.71) │ (mean=0.12) │ │ ────┼──────────────────┼──── split: x2 ≤ 2 (right side only) │ R1 │ R2 │ (mean=0.05) │ (mean=0.88) │ │ └──────────────────┴──────────► x1 4

Why this representation is powerful

Non-linear by construction. Any measurable function on a compact set can be approximated arbitrarily well by enough rectangles. Capacity is unbounded.
Interactions for free. A split on x_j followed by a split on x_k down one branch is the interaction x_j · x_k. No feature crosses to engineer.
Scale-invariant. Splits depend on order, not magnitude. Standardising, log-transforming, Box-Cox — none of it changes the tree.
Mixed types. Numeric and categorical features coexist (with subtleties in the categorical section below).
Missing values are first-class. Surrogate splits (CART) or learned default directions (XGBoost, LightGBM) route NaNs without imputation.
Robust to irrelevant features. A useless column is simply never chosen as a split.
Interpretable at small depths. A depth-3 tree is a flowchart you can read aloud.

Linear models need you to engineer the non-linearity; neural nets need a lot of data and careful normalisation. A tree just works.

Greedy splitting — and why we live with suboptimality

Finding the globally-optimal tree is NP-hard (Hyafil & Rivest 1976). The partition space is combinatorial in features and thresholds per feature. No one searches it exhaustively.

Instead, trees are grown greedily, top-down: at each node, scan every (feature, threshold) pair and pick the one that maximally reduces a chosen impurity measure. Recurse on each child. Never backtrack. With presorted features the cost at each node is O(d · n log n); total build cost is roughly O(d · n log n · depth).

Why everyone uses greedy despite suboptimality

The gap from greedy to optimal is empirically small, and the boosting frame (lesson 6) erases what's left by fitting subsequent trees to the residuals of the greedy ones. Industry has effectively decided that greedy + ensemble dominates exact + single. Optimal-tree research (branch-and-bound, MIP) is interesting but rarely competitive at scale.

Splitting criteria — classification

For K classes with node proportions p_1, …, p_K:

Impurity	Formula	Interpretation
Gini	G = Σ_k p_k (1 − p_k) = 1 − Σ_k p_k²	Probability that a random pair of items in the node have different labels. Zero iff pure.
Entropy	H = − Σ_k p_k log p_k	Bits/sample to encode the label given the node. Zero iff pure, maximised at uniform.
Classification error	1 − max_k p_k	Used for pruning, rarely for growing — too coarse, ignores minority-class mass.

Split quality is the information gain: parent impurity minus the weighted sum of child impurities.

Δ = I(parent) − (n_L / n) · I(left) − (n_R / n) · I(right)

Gini and entropy are both concave, maximised at p_k = 1/K, zero at one-hot proportions, and produce nearly identical trees on real data. Gini is slightly cheaper (no log). CART defaults to Gini; ID3/C4.5 use entropy. The choice is more religion than science — pretending one is materially better is a tell.

Splitting criterion — regression

The default impurity is within-node variance:

I(node) = (1/n) Σ_i (y_i − ȳ)²

Maximising gain is equivalently minimising within-child squared error, which is why regression-tree leaves predict the mean. Alternatives: MAE (median leaves, outlier-robust, slower), Poisson deviance, Huber. Most libraries default to MSE — fast to update incrementally as the threshold slides, and pairs cleanly with GBDT's residual-fitting interpretation.

The CART recipe

Breiman et al. 1984: grow greedily to maximum size, then prune back using cost-complexity pruning:

L_α(T) = err(T) + α · |T|

where |T| is the leaf count and α ≥ 0 is the complexity penalty. Sweep α from zero upwards; at each value an entire subtree collapses to its root, generating a nested sequence of candidate trees. Pick α by cross-validation.

Modern libraries (scikit-learn, XGBoost, LightGBM) usually skip explicit pruning and regularize by not growing: max_depth, min_samples_leaf, min_impurity_decrease, max_leaf_nodes. Roughly equivalent, avoids the second pass. CCP is still in scikit-learn under ccp_alpha, but rarely the chosen knob in production.

The bias-variance lever — and why one tree is a weak learner

Depth	Bias	Variance	What happens
Shallow (1–3)	High	Low	Too few rectangles to represent the function; train and val errors both high.
Moderate	Mid	Mid	Sweet spot if you must ship a single tree. Rare — you'd ensemble.
Deep (≈1 sample/leaf)	Zero on train	Huge	Memorises the training set; train error 0, val error wild. Boundaries hug individual points.

A fully-grown tree is a high-variance model: greedy split-search is unstable, so a small training-set perturbation can flip an early split and every downstream split changes too. This instability is exactly what makes bagging (lesson 5) and boosting (lesson 6) effective — averaging many trees whose errors are weakly correlated drives variance down while bias stays put.

Interactive · grow a tree on a 2D dataset

A small CART implementation in vanilla JS. Pick a dataset shape, then move the depth and min-leaf sliders. The canvas shows the data and the tree's decision boundary as a union of axis-aligned rectangles.

Decision boundary vs depth

Try depth=1 on the rings (one line; tree can't carve a circle). Then depth=4 (passable approximation, lots of stair-steps). Then depth=10 with min-leaf=1 (overfits — every training point becomes its own rectangle). Switch to XOR to see why linear models fail on it.

dataset: max depth: 4 min samples / leaf: 3

train accuracy

—

leaves

—

actual depth

—

fit time (ms)

—

Reading

—

Pros and cons, in interview-grade detail

Pros.

No feature scaling. Splits are scale-invariant; removes a preprocessing layer and a class of bugs.
Mixed types. Numerical and categorical features coexist; LightGBM and CatBoost handle categoricals natively (more below).
Non-linearity and interactions automatically. No manual x_j · x_k or polynomials.
Native missing-value handling. Surrogate splits (CART) or learned default directions (XGBoost, LightGBM).
Interpretable at small depths. Each root-to-leaf path is an auditable conjunction of rules.
Robust to irrelevant features. Noise columns are never chosen; cf. linear regression where every feature contributes (modulo shrinkage).

Cons.

High variance. A few changed training points can flip an early split, and the entire subtree below it changes. Single trees rarely ship.
No smoothness. Predictions jump discontinuously across split boundaries. For physical regressions or other smooth targets, you'll see staircase artefacts. Ensembles partially smooth this by averaging many staircases that disagree on the exact boundary.
Axis-aligned splits only. A diagonal boundary requires many staircase splits. Oblique trees (split on linear combinations) exist but are rare — they sacrifice the scale-invariance and missing-value properties that make trees nice.
Cannot extrapolate. A regression tree trained on house prices in [100k, 500k] will never predict 600k — outside training range, the prediction is the boundary leaf's mean. Linear models and NNs can extrapolate (sometimes wrongly, but they can). For trended time series, raw trees are a known footgun.
Bias toward high-cardinality features. A categorical with many levels has many candidate splits, so even random chance produces a "good-looking" gain. Mitigations: hash or aggregate, target-encode with smoothing, or use a library with per-feature gain penalties (LightGBM's cat_smooth, min_data_per_group).

The user_id-as-best-feature trap

Interviewer scenario: "Your tree's root split is on user_id, which has 1M distinct values. Train accuracy is 99%. What's wrong?" Answer: a high-cardinality categorical can shatter the training set — each user_id ends up in its own leaf, memorising its label. The model has zero generalisation. The fix is to remove the feature, hash it to a small bucket count, or replace it with an aggregate (target-encoded mean with smoothing, or learned embedding for use in a different model class).

Categorical features — the subtle part

Encoding	What the tree sees	When to use
One-hot	One binary column per level; each split is "level = X vs not X". Loses information about which subset of levels groups together.	Low-cardinality, no native support. Bloats feature count at high cardinality.
Ordinal (label) encode	One integer column; tree splits x ≤ t, imposing an arbitrary order.	Only with a meaningful order (`small < medium < large`). Otherwise capacity wasted learning around a fake order.
Native (LightGBM, CatBoost)	Tree considers partitions of the level set. Fisher 1958: the optimal partition for binary classification is found by sorting levels by mean target and scanning.	Whenever the library supports it. Optimal-by-construction for binary, sample-efficient even at high cardinality.

A categorical with k levels has 2^(k−1) − 1 non-trivial partitions — infeasible to enumerate. The Fisher trick reduces this to O(k log k). CatBoost adds ordered target statistics to avoid leakage on training data.

Trade-offs vs neighbouring models

Property	Decision tree	Linear regression	k-NN	Neural net
Captures non-linearity	Yes	No (needs features)	Yes	Yes
Captures interactions	Yes (free)	No (needs crosses)	Yes (local)	Yes
Scaling required	No	Yes	Yes	Yes
Interpretable	Yes (shallow)	Yes	Partial	No
Extrapolates	No	Yes	No	Sometimes
Train time	Fast	Very fast	None	Slow
Inference time	Very fast	Very fast	Slow	Medium
Variance (single model)	High	Low	Medium	Med–high
Sample efficiency	Medium	High	Low	Low

The cells where trees win — non-linearity, interactions, no scaling, mixed types, missing values, irrelevant-feature robustness — are the ones that matter for tabular. The cells where they lose — extrapolation, smoothness, high single-model variance — get addressed by ensembling.

Interview prompts you should be ready for

"Gini vs entropy — which do you pick?" (Either; on real data trees built with the two are nearly identical. Gini is slightly cheaper. CART picks Gini; ID3/C4.5 pick entropy. Saying one is materially better is wrong.)
"Why are decision trees high-variance?" (Greedy split selection is unstable near close-to-tied candidates; flipping the root split flips everything below. Deep trees additionally have leaves with very few samples, so leaf predictions are noisy means. This is what bagging fixes.)
"Walk me through how a tree handles a missing value in a feature it splits on." (Three options used in practice: surrogate splits — CART learns a backup feature whose split most agrees with the primary; default direction — XGBoost/LightGBM pick left or right by which assignment minimises training loss; imputation — fill before training, simplest but loses missingness information, which is often itself signal.)
"Your tree's first split is on user_id with cardinality 1M. What's going wrong?" (High-cardinality categorical bias: many possible splits → easy to find a good-looking one by chance. The split memorises rather than generalises. Fix: hash or aggregate the feature, use a library that penalises split count per categorical, or drop the feature.)
"A regression tree predicts house prices. What happens at a query point with feature values outside the training range?" (Tree returns the boundary leaf's mean — a constant. No extrapolation. For trended series detrend first, or use a model class that can extrapolate.)
"Why don't we use the optimal tree?" (NP-hard. Greedy is empirically close — within a few percent — and ensembling (lessons 5, 6) closes the rest of the gap. Industry has decided greedy + ensemble dominates exact + single. Optimal-tree research exists but is rarely competitive at production scale.)
"Tree boundaries are axis-aligned. What does that imply for a problem with a 45° decision boundary?" (Tree approximates with a staircase. Capacity is spent on depth that buys you the diagonal. Either rotate features (PCA, learned projections, oblique trees) or accept that you'll need a deeper tree / more trees. Linear or kernel models handle diagonals natively.)

Takeaway

A single decision tree is the scaffolding, not the building. Geometrically a tree is a piecewise-constant function on axis-aligned rectangles, built greedily because the optimal version is intractable, regularized by depth/leaf-size because the unrestricted version memorises. The reason it dominates tabular ML isn't that it's a great single model — it isn't — but that it has the right inductive biases (no scaling, mixed types, interactions, missing values, irrelevant-feature robustness) and averages beautifully into ensembles whose variance shrinks while bias stays put. The next two lessons (random forests, GBDT) are about how to do that averaging well.