Interpretability & feature importance
The capstone. Regulated industries still ship GBDTs and logistic regression because "explain this decision" remains a hard requirement. The senior signal is knowing which question you're answering: global or local, faithful or stable, correlational or causal.
Two definitions of "interpretable"
Before any method, separate two questions that stakeholders conflate:
| Flavor | Question | Easy for | Hard for |
|---|---|---|---|
| Global | How does the model work overall? Which features does it use, and how are they combined? | Linear/logistic regression, shallow trees, GAMs. | Ensembles (RF, GBDT), neural nets, anything with deep interactions. |
| Local | Why did the model output this prediction for this row? | Any model — via post-hoc attribution (SHAP, LIME, ICE). | Answers are approximate; the local linearization may not hold. |
Stakeholders usually want one, not both. A regulator handing a denied applicant an adverse-action notice wants local. A risk officer asking "what is this model keying on?" wants global. The methods are different.
Intrinsically interpretable models
The cheapest path to interpretability is to pick a model whose structure is the explanation.
- Linear / logistic regression. Coefficient β_j tells you "a one-unit increase in x_j changes the output by β_j, others fixed." With standardized features, magnitudes are directly comparable; with raw features they aren't.
- Shallow decision tree. Depth 3–5 is literally a flowchart. Each leaf is a rule. The trade-off is accuracy: a shallow tree usually loses several points of AUC to a GBDT on tabular data.
- Generalized Additive Models (GAMs). y = β_0 + Σ_j f_j(x_j). Each f_j is a 1-D function you can plot; no interactions, so the per-feature shape is the explanation. Microsoft's EBM is a GBDT-flavored GAM with optional explicit pairwise interactions f_{jk}(x_j, x_k).
Post-hoc methods — what they all share
For models that aren't intrinsically interpretable (RF, GBDT, NN), you query the trained model after the fact. There are two questions and many methods:
Two warnings: (1) different methods rank features differently — the disagreement problem, below; (2) every post-hoc method queries the model on points that may not exist in the training distribution.
Permutation feature importance
The simplest model-agnostic global method. For each feature x_j: score on a held-out set (s_0), shuffle column j across rows, score again (s_j), and report s_0 − s_j. Shuffling preserves the marginal distribution of x_j but destroys its joint with the target and other features. Average over several shuffles.
| Pro | Con |
|---|---|
| Model-agnostic; reuses your eval pipeline. | Correlated features mask each other. If x_1 ≈ x_2, shuffling x_1 barely hurts because x_2 still carries the signal. |
| Tied to a real metric (AUC, MSE) you already care about. | Shuffling creates out-of-distribution inputs (shuffled-height + real-weight pairs that never exist). |
| Cheap: O(features × eval cost). | Conditional permutation fixes the correlation issue but is expensive and finicky. |
Built-in tree importance (Gini / split-gain)
Every GBDT/RF library exposes a "feature_importances_" computed during training: sum the impurity reduction (Gini for classification, MSE for regression) across all nodes that split on the feature, weighted by samples reaching the node.
| Pro | Con |
|---|---|
| Free — computed as a side-effect of training. | Biased toward high-cardinality features. A feature with 1000 unique values has 1000 split candidates; one with 2 has 1. High-cardinality wins by luck. |
| Captures the model's use of the feature in training. | Counts splits, not the magnitude of resulting prediction change. A feature can be split often yet contribute small moves. |
| No extra compute. | Inconsistent: adding a tree that uses feature A more can decrease A's reported importance — TreeSHAP was partly motivated to fix this (Lundberg et al. 2018). |
Partial Dependence Plots (PDP) and ICE
PDP answers "what is the average prediction as x_j varies, marginalizing over everything else?" For a grid of values v:
Replace column j with constant v, score every row, average, plot vs v. Same OOD problem as permutation — the row (age=80, income=20k) becomes (age=20, income=20k), which may not exist.
Failure mode. If x_j interacts with x_k — positive for small x_k, negative for large — averaging hides this and PDP shows a flat line. You'd conclude "feature doesn't matter" when in fact it matters for everyone, in opposite directions.
ICE (Individual Conditional Expectation) fixes this: one line per row instead of averaging. If lines slope the same way, the effect is monotone; if they fan out, the feature interacts. ICE = PDP without averaging.
LIME — local surrogate models
Ribeiro, Singh & Guestrin 2016. For a specific prediction f̂(x_0): sample perturbed inputs around x_0, score them with the black-box, fit a simple model (sparse linear or shallow tree) on the perturbed pairs weighted by proximity to x_0. The simple model's coefficients are the local explanation.
Intuition: any model is approximately linear in a small enough neighborhood. LIME finds that neighborhood and fits a line.
Pros: model-agnostic; works for tabular, text (drop tokens), image (occlude superpixels). Cons: unstable — change the perturbation seed, get a different explanation. The kernel/neighborhood is itself a hyperparameter. In regulated settings the instability is disqualifying.
SHAP — the dominant method
Lundberg & Lee 2017. SHAP computes each feature's average marginal contribution to the prediction across all possible orderings of features being added to a "coalition":
The reason SHAP became dominant is the axioms. Shapley values are the unique attribution satisfying:
- Efficiency. Σ_j φ_j(x) = f̂(x) − E[f̂(X)]. Attributions sum to the prediction's deviation from baseline.
- Symmetry. Two features contributing identically to every coalition get equal attribution.
- Dummy. A feature that never changes the prediction gets zero.
- Additivity. SHAP of an ensemble = sum of SHAPs of its components.
No other scheme satisfies all four. TreeSHAP (Lundberg et al. 2018) put SHAP in production: exact Shapley values for tree ensembles in O(TLD²) time instead of O(TL · 2^d). This is why every GBDT library ships a SHAP integration. For neural networks you fall back to DeepSHAP or KernelSHAP — both approximate.
| Pro | Con |
|---|---|
| Principled — unique under four reasonable axioms. | Still queries the model on points that may not exist. The "interventional" variant used by TreeSHAP is OOD. TreeSHAP has two variants: tree_path_dependent (uses the tree's own training distribution at each split — stays on-manifold but conflates correlation with attribution) and interventional (uses a background dataset — can go off-manifold but preserves all four Shapley axioms cleanly). Most production tools default to tree_path_dependent. |
| Local and global from one machinery: mean |φ_j| across rows = global importance, consistent with per-row. | Fast for trees only; other models slow or approximate. |
| Sign matters: positive φ pushed the prediction up, negative pushed it down. | Easy to misinterpret as causal. SHAP attributes a prediction, not an outcome. |
Interactive · SHAP-like attribution playground
A tiny linear model on a 4-feature regression. For a linear model, φ_j(x) = β_j · (x_j − E[x_j]) is the exact Shapley value (exact under feature-independence; for correlated features there are two correct SHAP variants — conditional and interventional — which can differ.) — so this widget computes the real thing. Pick a row to see its waterfall. Toggle "add a correlated copy of feature 1" to watch permutation importance get fooled while SHAP stays sane.
The disagreement problem
Run permutation, TreeSHAP, and built-in split-gain on the same GBDT and the feature rankings will not match. Not a bug — each method answers a slightly different question.
| Method | Question it actually answers |
|---|---|
| Permutation importance | "How much worse does my eval loss get if I destroy this feature's signal?" |
| Built-in split-gain | "How much impurity did this feature reduce during training, summed across splits?" |
| Mean |SHAP| | "How much does this feature move the prediction from baseline, on average per row?" |
| PDP range | "How much does the average prediction swing as this feature varies?" |
| LIME coefficient (averaged) | "How much does a local linear surrogate weight this feature, averaged over rows?" |
Senior answer to "which one should I report?" — pick the method whose question matches what the stakeholder is asking. Loss when a feature pipeline breaks → permutation. Explaining a single decision → SHAP. Debugging tree structure → split-gain. "What should I change?" → none of them.
Interpretability ≠ causal inference
SHAP says "feature X contributed +0.3 to this prediction." It does not say "if you changed X in the world, the outcome would change by 0.3." Two reasons:
- The model is correlational. If x_j is a proxy for an unobserved confounder, SHAP credits the proxy. Intervening on x_j doesn't move the confounder.
- SHAP attributes the trained model, not the world. SHAP for "shoe size" predicting reading ability in children will be positive — older kids read better and have bigger feet. Buying bigger shoes does not improve reading.
Causal questions need experimental data or causal-inference machinery (DAGs, do-calculus, IV, matching). The senior signal is calling this out unprompted whenever someone asks "what should we change to flip the decision?"
When does the stakeholder actually need interpretability?
| Driver | Flavor | Method |
|---|---|---|
| Regulation (GDPR Art. 22, ECOA, FCRA adverse-action notices) | Local | SHAP per row, or coefficients if linear. Stable + reproducible matters more than minimum-variance. |
| Debugging ("the model is doing something weird") | Global | Permutation + mean |SHAP| + PDP/ALE on suspected features. Look for features that shouldn't matter and do. |
| Trust / adoption | Global | Ship the simplest model that meets accuracy. EBM/GAM is often right. A small AUC tax buys a year of deployment velocity. |
| Fairness audit | Both | SHAP segmented by demographic + group metrics (TPR parity, calibration parity). Detection, not remediation. |
| Causal action | Neither | Push back. A/B test the intervention or commission a causal study. |
The trade-off table you should have memorized
| Linear coef | Shallow tree | TreeSHAP / GBDT | LIME | Permutation | |
|---|---|---|---|---|---|
| Faithfulness | Exact | Exact | Exact for trees | Local approximation | Real metric, corrupts joint distribution |
| Model-agnostic | No | No | Trees only | Yes | Yes |
| Local / global | Both | Both | Both | Local only | Global only |
| Scalability | Trivial | Trivial | Polynomial in tree size | Slow (sample + fit per row) | O(features × eval) |
| Stability | High | High | High (deterministic) | Low (random perturbations) | Medium (MC over shuffles) |
| Handles correlation | OK if regularized | OK | Splits credit between correlated | Poor | Poor — features mask each other |
| Causal? | No | No | No | No | No |
Interview prompts you should be ready for
- "Walk me through SHAP. Why Shapley values?" (The four axioms. Efficiency: attributions sum to prediction − baseline. Symmetry, dummy, additivity. Shapley is the unique attribution that satisfies all four. TreeSHAP made it tractable for the models people actually ship.)
- "Permutation importance vs SHAP — when do they disagree?" (Permutation answers a loss-based question, SHAP a prediction-based one. They disagree when a feature moves predictions a lot but those moves don't help loss — e.g., a well-calibrated feature in a balanced dataset where rearranging predictions doesn't change AUC. They also disagree on correlated features: permutation halves the credit, SHAP splits it more cleanly.)
- "Your stakeholders want to know 'what to change to flip the decision.' What's your concern with answering from SHAP?" (SHAP is correlational. A feature with high SHAP may be a proxy for an unobserved confounder, and intervening on it does nothing. For counterfactual advice you need a verified causal structure or an experiment.)
- "Why doesn't a tree's built-in feature importance match permutation importance?" (Three reasons: built-in is biased toward high-cardinality features; it counts splits, not prediction magnitude; permutation is computed on held-out data while built-in is computed on training. They're answering different questions.)
- "You have a GBDT at AUC 0.83 and a logistic regression at AUC 0.81. Pick one for a regulated lending product." (LR. The 2-point gap rarely outweighs the cost of building adverse-action notices, monitoring, and regulator review for a black box. The exception is when those 2 points represent millions in NPV — in which case ship the GBDT with TreeSHAP and budget for the explanation infrastructure.)
- "Your SHAP for feature X is +0.3 on this row. What does that actually mean?" ("Holding the data distribution fixed, this feature's value contributed +0.3 to this prediction relative to the baseline E[ŷ]. It does not mean changing X by one unit changes the prediction by 0.3, and it does not mean changing X in the real world changes the outcome.")
- "How do you handle the disagreement between three different feature-importance methods?" (Don't pick the one with the prettiest answer. Identify which question the stakeholder is asking — loss impact, prediction impact, or training behavior — and use the method that answers it. Report the others as sensitivity checks, not as ground truth.)