Decision trees
The single most important model class for tabular data. Not because one tree is great — it isn't — but because every ensemble that crushes tabular benchmarks (random forests, GBDTs) is a pile of trees. Understand the tree, and lessons 5 and 6 collapse to "stack them, but cleverly".
What a tree is, geometrically
A decision tree is a recursive, axis-aligned partition of feature space. Every internal node tests x_j ≤ t and routes the sample left or right. Leaves are constants — the mean of training labels in that leaf for regression, the majority class (or class-frequency vector for probabilities) for classification. The whole model is
where {R_ℓ} are the rectangles (leaves) and c_ℓ the leaf constants. Every prediction is a lookup: route x down the tree, return the leaf value. There is no continuous function being fitted, just a piecewise-constant tessellation.
Why this representation is powerful
- Non-linear by construction. Any measurable function on a compact set can be approximated arbitrarily well by enough rectangles. Capacity is unbounded.
- Interactions for free. A split on x_j followed by a split on x_k down one branch is the interaction x_j · x_k. No feature crosses to engineer.
- Scale-invariant. Splits depend on order, not magnitude. Standardising, log-transforming, Box-Cox — none of it changes the tree.
- Mixed types. Numeric and categorical features coexist (with subtleties in the categorical section below).
- Missing values are first-class. Surrogate splits (CART) or learned default directions (XGBoost, LightGBM) route NaNs without imputation.
- Robust to irrelevant features. A useless column is simply never chosen as a split.
- Interpretable at small depths. A depth-3 tree is a flowchart you can read aloud.
Linear models need you to engineer the non-linearity; neural nets need a lot of data and careful normalisation. A tree just works.
Greedy splitting — and why we live with suboptimality
Finding the globally-optimal tree is NP-hard (Hyafil & Rivest 1976). The partition space is combinatorial in features and thresholds per feature. No one searches it exhaustively.
Instead, trees are grown greedily, top-down: at each node, scan every (feature, threshold) pair and pick the one that maximally reduces a chosen impurity measure. Recurse on each child. Never backtrack. With presorted features the cost at each node is O(d · n log n); total build cost is roughly O(d · n log n · depth).
Splitting criteria — classification
For K classes with node proportions p_1, …, p_K:
| Impurity | Formula | Interpretation |
|---|---|---|
| Gini | G = Σ_k p_k (1 − p_k) = 1 − Σ_k p_k² | Probability that a random pair of items in the node have different labels. Zero iff pure. |
| Entropy | H = − Σ_k p_k log p_k | Bits/sample to encode the label given the node. Zero iff pure, maximised at uniform. |
| Classification error | 1 − max_k p_k | Used for pruning, rarely for growing — too coarse, ignores minority-class mass. |
Split quality is the information gain: parent impurity minus the weighted sum of child impurities.
Gini and entropy are both concave, maximised at p_k = 1/K, zero at one-hot proportions, and produce nearly identical trees on real data. Gini is slightly cheaper (no log). CART defaults to Gini; ID3/C4.5 use entropy. The choice is more religion than science — pretending one is materially better is a tell.
Splitting criterion — regression
The default impurity is within-node variance:
Maximising gain is equivalently minimising within-child squared error, which is why regression-tree leaves predict the mean. Alternatives: MAE (median leaves, outlier-robust, slower), Poisson deviance, Huber. Most libraries default to MSE — fast to update incrementally as the threshold slides, and pairs cleanly with GBDT's residual-fitting interpretation.
The CART recipe
Breiman et al. 1984: grow greedily to maximum size, then prune back using cost-complexity pruning:
where |T| is the leaf count and α ≥ 0 is the complexity penalty. Sweep α from zero upwards; at each value an entire subtree collapses to its root, generating a nested sequence of candidate trees. Pick α by cross-validation.
Modern libraries (scikit-learn, XGBoost, LightGBM) usually skip explicit pruning and regularize by not growing: max_depth, min_samples_leaf, min_impurity_decrease, max_leaf_nodes. Roughly equivalent, avoids the second pass. CCP is still in scikit-learn under ccp_alpha, but rarely the chosen knob in production.
The bias-variance lever — and why one tree is a weak learner
| Depth | Bias | Variance | What happens |
|---|---|---|---|
| Shallow (1–3) | High | Low | Too few rectangles to represent the function; train and val errors both high. |
| Moderate | Mid | Mid | Sweet spot if you must ship a single tree. Rare — you'd ensemble. |
| Deep (≈1 sample/leaf) | Zero on train | Huge | Memorises the training set; train error 0, val error wild. Boundaries hug individual points. |
A fully-grown tree is a high-variance model: greedy split-search is unstable, so a small training-set perturbation can flip an early split and every downstream split changes too. This instability is exactly what makes bagging (lesson 5) and boosting (lesson 6) effective — averaging many trees whose errors are weakly correlated drives variance down while bias stays put.
Interactive · grow a tree on a 2D dataset
A small CART implementation in vanilla JS. Pick a dataset shape, then move the depth and min-leaf sliders. The canvas shows the data and the tree's decision boundary as a union of axis-aligned rectangles.
Pros and cons, in interview-grade detail
Pros.
- No feature scaling. Splits are scale-invariant; removes a preprocessing layer and a class of bugs.
- Mixed types. Numerical and categorical features coexist; LightGBM and CatBoost handle categoricals natively (more below).
- Non-linearity and interactions automatically. No manual x_j · x_k or polynomials.
- Native missing-value handling. Surrogate splits (CART) or learned default directions (XGBoost, LightGBM).
- Interpretable at small depths. Each root-to-leaf path is an auditable conjunction of rules.
- Robust to irrelevant features. Noise columns are never chosen; cf. linear regression where every feature contributes (modulo shrinkage).
Cons.
- High variance. A few changed training points can flip an early split, and the entire subtree below it changes. Single trees rarely ship.
- No smoothness. Predictions jump discontinuously across split boundaries. For physical regressions or other smooth targets, you'll see staircase artefacts. Ensembles partially smooth this by averaging many staircases that disagree on the exact boundary.
- Axis-aligned splits only. A diagonal boundary requires many staircase splits. Oblique trees (split on linear combinations) exist but are rare — they sacrifice the scale-invariance and missing-value properties that make trees nice.
- Cannot extrapolate. A regression tree trained on house prices in [100k, 500k] will never predict 600k — outside training range, the prediction is the boundary leaf's mean. Linear models and NNs can extrapolate (sometimes wrongly, but they can). For trended time series, raw trees are a known footgun.
- Bias toward high-cardinality features. A categorical with many levels has many candidate splits, so even random chance produces a "good-looking" gain. Mitigations: hash or aggregate, target-encode with smoothing, or use a library with per-feature gain penalties (LightGBM's
cat_smooth,min_data_per_group).
user_id, which has 1M distinct values. Train accuracy is 99%. What's wrong?" Answer: a high-cardinality categorical can shatter the training set — each user_id ends up in its own leaf, memorising its label. The model has zero generalisation. The fix is to remove the feature, hash it to a small bucket count, or replace it with an aggregate (target-encoded mean with smoothing, or learned embedding for use in a different model class).
Categorical features — the subtle part
| Encoding | What the tree sees | When to use |
|---|---|---|
| One-hot | One binary column per level; each split is "level = X vs not X". Loses information about which subset of levels groups together. | Low-cardinality, no native support. Bloats feature count at high cardinality. |
| Ordinal (label) encode | One integer column; tree splits x ≤ t, imposing an arbitrary order. | Only with a meaningful order (small < medium < large). Otherwise capacity wasted learning around a fake order. |
| Native (LightGBM, CatBoost) | Tree considers partitions of the level set. Fisher 1958: the optimal partition for binary classification is found by sorting levels by mean target and scanning. | Whenever the library supports it. Optimal-by-construction for binary, sample-efficient even at high cardinality. |
A categorical with k levels has 2^(k−1) − 1 non-trivial partitions — infeasible to enumerate. The Fisher trick reduces this to O(k log k). CatBoost adds ordered target statistics to avoid leakage on training data.
Trade-offs vs neighbouring models
| Property | Decision tree | Linear regression | k-NN | Neural net |
|---|---|---|---|---|
| Captures non-linearity | Yes | No (needs features) | Yes | Yes |
| Captures interactions | Yes (free) | No (needs crosses) | Yes (local) | Yes |
| Scaling required | No | Yes | Yes | Yes |
| Interpretable | Yes (shallow) | Yes | Partial | No |
| Extrapolates | No | Yes | No | Sometimes |
| Train time | Fast | Very fast | None | Slow |
| Inference time | Very fast | Very fast | Slow | Medium |
| Variance (single model) | High | Low | Medium | Med–high |
| Sample efficiency | Medium | High | Low | Low |
The cells where trees win — non-linearity, interactions, no scaling, mixed types, missing values, irrelevant-feature robustness — are the ones that matter for tabular. The cells where they lose — extrapolation, smoothness, high single-model variance — get addressed by ensembling.
Interview prompts you should be ready for
- "Gini vs entropy — which do you pick?" (Either; on real data trees built with the two are nearly identical. Gini is slightly cheaper. CART picks Gini; ID3/C4.5 pick entropy. Saying one is materially better is wrong.)
- "Why are decision trees high-variance?" (Greedy split selection is unstable near close-to-tied candidates; flipping the root split flips everything below. Deep trees additionally have leaves with very few samples, so leaf predictions are noisy means. This is what bagging fixes.)
- "Walk me through how a tree handles a missing value in a feature it splits on." (Three options used in practice: surrogate splits — CART learns a backup feature whose split most agrees with the primary; default direction — XGBoost/LightGBM pick left or right by which assignment minimises training loss; imputation — fill before training, simplest but loses missingness information, which is often itself signal.)
- "Your tree's first split is on
user_idwith cardinality 1M. What's going wrong?" (High-cardinality categorical bias: many possible splits → easy to find a good-looking one by chance. The split memorises rather than generalises. Fix: hash or aggregate the feature, use a library that penalises split count per categorical, or drop the feature.) - "A regression tree predicts house prices. What happens at a query point with feature values outside the training range?" (Tree returns the boundary leaf's mean — a constant. No extrapolation. For trended series detrend first, or use a model class that can extrapolate.)
- "Why don't we use the optimal tree?" (NP-hard. Greedy is empirically close — within a few percent — and ensembling (lessons 5, 6) closes the rest of the gap. Industry has decided greedy + ensemble dominates exact + single. Optimal-tree research exists but is rarely competitive at production scale.)
- "Tree boundaries are axis-aligned. What does that imply for a problem with a 45° decision boundary?" (Tree approximates with a staircase. Capacity is spent on depth that buys you the diagonal. Either rotate features (PCA, learned projections, oblique trees) or accept that you'll need a deeper tree / more trees. Linear or kernel models handle diagonals natively.)