traditional_ml / 04 · decision trees lesson 4 / 12

Decision trees

The single most important model class for tabular data. Not because one tree is great — it isn't — but because every ensemble that crushes tabular benchmarks (random forests, GBDTs) is a pile of trees. Understand the tree, and lessons 5 and 6 collapse to "stack them, but cleverly".

What a tree is, geometrically

A decision tree is a recursive, axis-aligned partition of feature space. Every internal node tests x_j ≤ t and routes the sample left or right. Leaves are constants — the mean of training labels in that leaf for regression, the majority class (or class-frequency vector for probabilities) for classification. The whole model is

f̂(x) = Σ_ℓ c_ℓ · 1{x ∈ R_ℓ}

where {R_ℓ} are the rectangles (leaves) and c_ℓ the leaf constants. Every prediction is a lookup: route x down the tree, return the leaf value. There is no continuous function being fitted, just a piecewise-constant tessellation.

x2 │ ────┼───────────────── split: x1 ≤ 4 │ │ │ R3 │ R4 │ (mean=0.71) │ (mean=0.12) │ │ ────┼──────────────────┼──── split: x2 ≤ 2 (right side only) │ R1 │ R2 │ (mean=0.05) │ (mean=0.88) │ │ └──────────────────┴──────────► x1 4

Why this representation is powerful

Linear models need you to engineer the non-linearity; neural nets need a lot of data and careful normalisation. A tree just works.

Greedy splitting — and why we live with suboptimality

Finding the globally-optimal tree is NP-hard (Hyafil & Rivest 1976). The partition space is combinatorial in features and thresholds per feature. No one searches it exhaustively.

Instead, trees are grown greedily, top-down: at each node, scan every (feature, threshold) pair and pick the one that maximally reduces a chosen impurity measure. Recurse on each child. Never backtrack. With presorted features the cost at each node is O(d · n log n); total build cost is roughly O(d · n log n · depth).

Why everyone uses greedy despite suboptimality
The gap from greedy to optimal is empirically small, and the boosting frame (lesson 6) erases what's left by fitting subsequent trees to the residuals of the greedy ones. Industry has effectively decided that greedy + ensemble dominates exact + single. Optimal-tree research (branch-and-bound, MIP) is interesting but rarely competitive at scale.

Splitting criteria — classification

For K classes with node proportions p_1, …, p_K:

ImpurityFormulaInterpretation
Gini G = Σ_k p_k (1 − p_k) = 1 − Σ_k p_k² Probability that a random pair of items in the node have different labels. Zero iff pure.
Entropy H = − Σ_k p_k log p_k Bits/sample to encode the label given the node. Zero iff pure, maximised at uniform.
Classification error 1 − max_k p_k Used for pruning, rarely for growing — too coarse, ignores minority-class mass.

Split quality is the information gain: parent impurity minus the weighted sum of child impurities.

Δ = I(parent) − (n_L / n) · I(left) − (n_R / n) · I(right)

Gini and entropy are both concave, maximised at p_k = 1/K, zero at one-hot proportions, and produce nearly identical trees on real data. Gini is slightly cheaper (no log). CART defaults to Gini; ID3/C4.5 use entropy. The choice is more religion than science — pretending one is materially better is a tell.

Splitting criterion — regression

The default impurity is within-node variance:

I(node) = (1/n) Σ_i (y_i − ȳ)²

Maximising gain is equivalently minimising within-child squared error, which is why regression-tree leaves predict the mean. Alternatives: MAE (median leaves, outlier-robust, slower), Poisson deviance, Huber. Most libraries default to MSE — fast to update incrementally as the threshold slides, and pairs cleanly with GBDT's residual-fitting interpretation.

The CART recipe

Breiman et al. 1984: grow greedily to maximum size, then prune back using cost-complexity pruning:

L_α(T) = err(T) + α · |T|

where |T| is the leaf count and α ≥ 0 is the complexity penalty. Sweep α from zero upwards; at each value an entire subtree collapses to its root, generating a nested sequence of candidate trees. Pick α by cross-validation.

Modern libraries (scikit-learn, XGBoost, LightGBM) usually skip explicit pruning and regularize by not growing: max_depth, min_samples_leaf, min_impurity_decrease, max_leaf_nodes. Roughly equivalent, avoids the second pass. CCP is still in scikit-learn under ccp_alpha, but rarely the chosen knob in production.

The bias-variance lever — and why one tree is a weak learner

DepthBiasVarianceWhat happens
Shallow (1–3)HighLowToo few rectangles to represent the function; train and val errors both high.
ModerateMidMidSweet spot if you must ship a single tree. Rare — you'd ensemble.
Deep (≈1 sample/leaf)Zero on trainHugeMemorises the training set; train error 0, val error wild. Boundaries hug individual points.

A fully-grown tree is a high-variance model: greedy split-search is unstable, so a small training-set perturbation can flip an early split and every downstream split changes too. This instability is exactly what makes bagging (lesson 5) and boosting (lesson 6) effective — averaging many trees whose errors are weakly correlated drives variance down while bias stays put.

Interactive · grow a tree on a 2D dataset

A small CART implementation in vanilla JS. Pick a dataset shape, then move the depth and min-leaf sliders. The canvas shows the data and the tree's decision boundary as a union of axis-aligned rectangles.

Decision boundary vs depth
Try depth=1 on the rings (one line; tree can't carve a circle). Then depth=4 (passable approximation, lots of stair-steps). Then depth=10 with min-leaf=1 (overfits — every training point becomes its own rectangle). Switch to XOR to see why linear models fail on it.
train accuracy
leaves
actual depth
fit time (ms)
Reading

Pros and cons, in interview-grade detail

Pros.

Cons.

The user_id-as-best-feature trap
Interviewer scenario: "Your tree's root split is on user_id, which has 1M distinct values. Train accuracy is 99%. What's wrong?" Answer: a high-cardinality categorical can shatter the training set — each user_id ends up in its own leaf, memorising its label. The model has zero generalisation. The fix is to remove the feature, hash it to a small bucket count, or replace it with an aggregate (target-encoded mean with smoothing, or learned embedding for use in a different model class).

Categorical features — the subtle part

EncodingWhat the tree seesWhen to use
One-hot One binary column per level; each split is "level = X vs not X". Loses information about which subset of levels groups together. Low-cardinality, no native support. Bloats feature count at high cardinality.
Ordinal (label) encode One integer column; tree splits x ≤ t, imposing an arbitrary order. Only with a meaningful order (small < medium < large). Otherwise capacity wasted learning around a fake order.
Native (LightGBM, CatBoost) Tree considers partitions of the level set. Fisher 1958: the optimal partition for binary classification is found by sorting levels by mean target and scanning. Whenever the library supports it. Optimal-by-construction for binary, sample-efficient even at high cardinality.

A categorical with k levels has 2^(k−1) − 1 non-trivial partitions — infeasible to enumerate. The Fisher trick reduces this to O(k log k). CatBoost adds ordered target statistics to avoid leakage on training data.

Trade-offs vs neighbouring models

PropertyDecision treeLinear regressionk-NNNeural net
Captures non-linearityYesNo (needs features)YesYes
Captures interactionsYes (free)No (needs crosses)Yes (local)Yes
Scaling requiredNoYesYesYes
InterpretableYes (shallow)YesPartialNo
ExtrapolatesNoYesNoSometimes
Train timeFastVery fastNoneSlow
Inference timeVery fastVery fastSlowMedium
Variance (single model)HighLowMediumMed–high
Sample efficiencyMediumHighLowLow

The cells where trees win — non-linearity, interactions, no scaling, mixed types, missing values, irrelevant-feature robustness — are the ones that matter for tabular. The cells where they lose — extrapolation, smoothness, high single-model variance — get addressed by ensembling.

Interview prompts you should be ready for

  1. "Gini vs entropy — which do you pick?" (Either; on real data trees built with the two are nearly identical. Gini is slightly cheaper. CART picks Gini; ID3/C4.5 pick entropy. Saying one is materially better is wrong.)
  2. "Why are decision trees high-variance?" (Greedy split selection is unstable near close-to-tied candidates; flipping the root split flips everything below. Deep trees additionally have leaves with very few samples, so leaf predictions are noisy means. This is what bagging fixes.)
  3. "Walk me through how a tree handles a missing value in a feature it splits on." (Three options used in practice: surrogate splits — CART learns a backup feature whose split most agrees with the primary; default direction — XGBoost/LightGBM pick left or right by which assignment minimises training loss; imputation — fill before training, simplest but loses missingness information, which is often itself signal.)
  4. "Your tree's first split is on user_id with cardinality 1M. What's going wrong?" (High-cardinality categorical bias: many possible splits → easy to find a good-looking one by chance. The split memorises rather than generalises. Fix: hash or aggregate the feature, use a library that penalises split count per categorical, or drop the feature.)
  5. "A regression tree predicts house prices. What happens at a query point with feature values outside the training range?" (Tree returns the boundary leaf's mean — a constant. No extrapolation. For trended series detrend first, or use a model class that can extrapolate.)
  6. "Why don't we use the optimal tree?" (NP-hard. Greedy is empirically close — within a few percent — and ensembling (lessons 5, 6) closes the rest of the gap. Industry has decided greedy + ensemble dominates exact + single. Optimal-tree research exists but is rarely competitive at production scale.)
  7. "Tree boundaries are axis-aligned. What does that imply for a problem with a 45° decision boundary?" (Tree approximates with a staircase. Capacity is spent on depth that buys you the diagonal. Either rotate features (PCA, learned projections, oblique trees) or accept that you'll need a deeper tree / more trees. Linear or kernel models handle diagonals natively.)
Takeaway
A single decision tree is the scaffolding, not the building. Geometrically a tree is a piecewise-constant function on axis-aligned rectangles, built greedily because the optimal version is intractable, regularized by depth/leaf-size because the unrestricted version memorises. The reason it dominates tabular ML isn't that it's a great single model — it isn't — but that it has the right inductive biases (no scaling, mixed types, interactions, missing values, irrelevant-feature robustness) and averages beautifully into ensembles whose variance shrinks while bias stays put. The next two lessons (random forests, GBDT) are about how to do that averaging well.