Interpretability & feature importance

The capstone. Regulated industries still ship GBDTs and logistic regression because "explain this decision" remains a hard requirement. The senior signal is knowing which question you're answering: global or local, faithful or stable, correlational or causal.

Two definitions of "interpretable"

Before any method, separate two questions that stakeholders conflate:

Flavor	Question	Easy for	Hard for
Global	How does the model work overall? Which features does it use, and how are they combined?	Linear/logistic regression, shallow trees, GAMs.	Ensembles (RF, GBDT), neural nets, anything with deep interactions.
Local	Why did the model output this prediction for this row?	Any model — via post-hoc attribution (SHAP, LIME, ICE).	Answers are approximate; the local linearization may not hold.

Stakeholders usually want one, not both. A regulator handing a denied applicant an adverse-action notice wants local. A risk officer asking "what is this model keying on?" wants global. The methods are different.

Intrinsically interpretable models

The cheapest path to interpretability is to pick a model whose structure is the explanation.

Linear / logistic regression. Coefficient β_j tells you "a one-unit increase in x_j changes the output by β_j, others fixed." With standardized features, magnitudes are directly comparable; with raw features they aren't.
Shallow decision tree. Depth 3–5 is literally a flowchart. Each leaf is a rule. The trade-off is accuracy: a shallow tree usually loses several points of AUC to a GBDT on tabular data.
Generalized Additive Models (GAMs). y = β_0 + Σ_j f_j(x_j). Each f_j is a 1-D function you can plot; no interactions, so the per-feature shape is the explanation. Microsoft's EBM is a GBDT-flavored GAM with optional explicit pairwise interactions f_{jk}(x_j, x_k).

The first sentence of every interpretability answer

"If interpretability is a hard requirement, pick a model that is interpretable. Don't ship a black box and then try to explain it." Juniors immediately reach for SHAP; seniors first ask whether the accuracy gap justifies the explanation tax.

Post-hoc methods — what they all share

For models that aren't intrinsically interpretable (RF, GBDT, NN), you query the trained model after the fact. There are two questions and many methods:

GLOBAL LOCAL "which features matter overall?" "what drove this prediction?" ──────────────────────────── ──────────────────────────── · permutation importance · LIME · built-in tree split-gain · SHAP (per row) · mean |SHAP| across rows · ICE (one row's curve) · partial dependence (PDP) · gradient × input (for NNs) · ICE (averaged) · counterfactuals

Two warnings: (1) different methods rank features differently — the disagreement problem, below; (2) every post-hoc method queries the model on points that may not exist in the training distribution.

Permutation feature importance

The simplest model-agnostic global method. For each feature x_j: score on a held-out set (s_0), shuffle column j across rows, score again (s_j), and report s_0 − s_j. Shuffling preserves the marginal distribution of x_j but destroys its joint with the target and other features. Average over several shuffles.

Pro	Con
Model-agnostic; reuses your eval pipeline.	Correlated features mask each other. If x_1 ≈ x_2, shuffling x_1 barely hurts because x_2 still carries the signal.
Tied to a real metric (AUC, MSE) you already care about.	Shuffling creates out-of-distribution inputs (shuffled-height + real-weight pairs that never exist).
Cheap: O(features × eval cost).	Conditional permutation fixes the correlation issue but is expensive and finicky.

Built-in tree importance (Gini / split-gain)

Every GBDT/RF library exposes a "feature_importances_" computed during training: sum the impurity reduction (Gini for classification, MSE for regression) across all nodes that split on the feature, weighted by samples reaching the node.

Pro	Con
Free — computed as a side-effect of training.	Biased toward high-cardinality features. A feature with 1000 unique values has 1000 split candidates; one with 2 has 1. High-cardinality wins by luck.
Captures the model's use of the feature in training.	Counts splits, not the magnitude of resulting prediction change. A feature can be split often yet contribute small moves.
No extra compute.	Inconsistent: adding a tree that uses feature A more can decrease A's reported importance — TreeSHAP was partly motivated to fix this (Lundberg et al. 2018).

Partial Dependence Plots (PDP) and ICE

PDP answers "what is the average prediction as x_j varies, marginalizing over everything else?" For a grid of values v:

PDP_j(v) = (1/N) · Σ_i f̂(x_i with x_{i,j} := v)

Replace column j with constant v, score every row, average, plot vs v. Same OOD problem as permutation — the row (age=80, income=20k) becomes (age=20, income=20k), which may not exist.

Failure mode. If x_j interacts with x_k — positive for small x_k, negative for large — averaging hides this and PDP shows a flat line. You'd conclude "feature doesn't matter" when in fact it matters for everyone, in opposite directions.

ICE (Individual Conditional Expectation) fixes this: one line per row instead of averaging. If lines slope the same way, the effect is monotone; if they fan out, the feature interacts. ICE = PDP without averaging.

PDP/ICE gotcha

Both marginalize by substitution and hallucinate impossible rows (age=8, retired=yes). Accumulated Local Effects (ALE) plots (Apley & Zhu 2020) integrate local changes within actually-occupied regions — the one-token upgrade when asked "what's wrong with PDP?"

LIME — local surrogate models

Ribeiro, Singh & Guestrin 2016. For a specific prediction f̂(x_0): sample perturbed inputs around x_0, score them with the black-box, fit a simple model (sparse linear or shallow tree) on the perturbed pairs weighted by proximity to x_0. The simple model's coefficients are the local explanation.

Intuition: any model is approximately linear in a small enough neighborhood. LIME finds that neighborhood and fits a line.

Pros: model-agnostic; works for tabular, text (drop tokens), image (occlude superpixels). Cons: unstable — change the perturbation seed, get a different explanation. The kernel/neighborhood is itself a hyperparameter. In regulated settings the instability is disqualifying.

SHAP — the dominant method

Lundberg & Lee 2017. SHAP computes each feature's average marginal contribution to the prediction across all possible orderings of features being added to a "coalition":

φ_j(x) = Σ_{S ⊆ F\{j}} [ |S|! · (|F|−|S|−1)! / |F|! ] · [ f_S∪{j}(x) − f_S(x) ]

The reason SHAP became dominant is the axioms. Shapley values are the unique attribution satisfying:

Efficiency. Σ_j φ_j(x) = f̂(x) − E[f̂(X)]. Attributions sum to the prediction's deviation from baseline.
Symmetry. Two features contributing identically to every coalition get equal attribution.
Dummy. A feature that never changes the prediction gets zero.
Additivity. SHAP of an ensemble = sum of SHAPs of its components.

No other scheme satisfies all four. TreeSHAP (Lundberg et al. 2018) put SHAP in production: exact Shapley values for tree ensembles in O(TLD²) time instead of O(TL · 2^d). This is why every GBDT library ships a SHAP integration. For neural networks you fall back to DeepSHAP or KernelSHAP — both approximate.

Pro	Con
Principled — unique under four reasonable axioms.	Still queries the model on points that may not exist. The "interventional" variant used by TreeSHAP is OOD. TreeSHAP has two variants: tree_path_dependent (uses the tree's own training distribution at each split — stays on-manifold but conflates correlation with attribution) and interventional (uses a background dataset — can go off-manifold but preserves all four Shapley axioms cleanly). Most production tools default to tree_path_dependent.
Local and global from one machinery: mean \|φ_j\| across rows = global importance, consistent with per-row.	Fast for trees only; other models slow or approximate.
Sign matters: positive φ pushed the prediction up, negative pushed it down.	Easy to misinterpret as causal. SHAP attributes a prediction, not an outcome.

Interactive · SHAP-like attribution playground

A tiny linear model on a 4-feature regression. For a linear model, φ_j(x) = β_j · (x_j − E[x_j]) is the exact Shapley value (exact under feature-independence; for correlated features there are two correct SHAP variants — conditional and interventional — which can differ.) — so this widget computes the real thing. Pick a row to see its waterfall. Toggle "add a correlated copy of feature 1" to watch permutation importance get fooled while SHAP stays sane.

The disagreement problem

Run permutation, TreeSHAP, and built-in split-gain on the same GBDT and the feature rankings will not match. Not a bug — each method answers a slightly different question.

Method	Question it actually answers
Permutation importance	"How much worse does my eval loss get if I destroy this feature's signal?"
Built-in split-gain	"How much impurity did this feature reduce during training, summed across splits?"
Mean \|SHAP\|	"How much does this feature move the prediction from baseline, on average per row?"
PDP range	"How much does the average prediction swing as this feature varies?"
LIME coefficient (averaged)	"How much does a local linear surrogate weight this feature, averaged over rows?"

Senior answer to "which one should I report?" — pick the method whose question matches what the stakeholder is asking. Loss when a feature pipeline breaks → permutation. Explaining a single decision → SHAP. Debugging tree structure → split-gain. "What should I change?" → none of them.

Interpretability ≠ causal inference

SHAP says "feature X contributed +0.3 to this prediction." It does not say "if you changed X in the world, the outcome would change by 0.3." Two reasons:

The model is correlational. If x_j is a proxy for an unobserved confounder, SHAP credits the proxy. Intervening on x_j doesn't move the confounder.
SHAP attributes the trained model, not the world. SHAP for "shoe size" predicting reading ability in children will be positive — older kids read better and have bigger feet. Buying bigger shoes does not improve reading.

Causal questions need experimental data or causal-inference machinery (DAGs, do-calculus, IV, matching). The senior signal is calling this out unprompted whenever someone asks "what should we change to flip the decision?"

The counterfactual trap

"SHAP says feature X has a +0.3 effect on the denial. Let's tell the applicant to fix X." This may be correct, useless, or actively misleading — depending on whether X is causal or a proxy. Surface SHAP as a hypothesis, then validate causally. Most regulators only require the model's reasons, not actionable advice; conflating the two is how you end up advising someone to change something they cannot.

When does the stakeholder actually need interpretability?

Driver	Flavor	Method
Regulation (GDPR Art. 22, ECOA, FCRA adverse-action notices)	Local	SHAP per row, or coefficients if linear. Stable + reproducible matters more than minimum-variance.
Debugging ("the model is doing something weird")	Global	Permutation + mean \|SHAP\| + PDP/ALE on suspected features. Look for features that shouldn't matter and do.
Trust / adoption	Global	Ship the simplest model that meets accuracy. EBM/GAM is often right. A small AUC tax buys a year of deployment velocity.
Fairness audit	Both	SHAP segmented by demographic + group metrics (TPR parity, calibration parity). Detection, not remediation.
Causal action	Neither	Push back. A/B test the intervention or commission a causal study.

The trade-off table you should have memorized

	Linear coef	Shallow tree	TreeSHAP / GBDT	LIME	Permutation
Faithfulness	Exact	Exact	Exact for trees	Local approximation	Real metric, corrupts joint distribution
Model-agnostic	No	No	Trees only	Yes	Yes
Local / global	Both	Both	Both	Local only	Global only
Scalability	Trivial	Trivial	Polynomial in tree size	Slow (sample + fit per row)	O(features × eval)
Stability	High	High	High (deterministic)	Low (random perturbations)	Medium (MC over shuffles)
Handles correlation	OK if regularized	OK	Splits credit between correlated	Poor	Poor — features mask each other
Causal?	No	No	No	No	No

Interview prompts you should be ready for

"Walk me through SHAP. Why Shapley values?" (The four axioms. Efficiency: attributions sum to prediction − baseline. Symmetry, dummy, additivity. Shapley is the unique attribution that satisfies all four. TreeSHAP made it tractable for the models people actually ship.)
"Permutation importance vs SHAP — when do they disagree?" (Permutation answers a loss-based question, SHAP a prediction-based one. They disagree when a feature moves predictions a lot but those moves don't help loss — e.g., a well-calibrated feature in a balanced dataset where rearranging predictions doesn't change AUC. They also disagree on correlated features: permutation halves the credit, SHAP splits it more cleanly.)
"Your stakeholders want to know 'what to change to flip the decision.' What's your concern with answering from SHAP?" (SHAP is correlational. A feature with high SHAP may be a proxy for an unobserved confounder, and intervening on it does nothing. For counterfactual advice you need a verified causal structure or an experiment.)
"Why doesn't a tree's built-in feature importance match permutation importance?" (Three reasons: built-in is biased toward high-cardinality features; it counts splits, not prediction magnitude; permutation is computed on held-out data while built-in is computed on training. They're answering different questions.)
"You have a GBDT at AUC 0.83 and a logistic regression at AUC 0.81. Pick one for a regulated lending product." (LR. The 2-point gap rarely outweighs the cost of building adverse-action notices, monitoring, and regulator review for a black box. The exception is when those 2 points represent millions in NPV — in which case ship the GBDT with TreeSHAP and budget for the explanation infrastructure.)
"Your SHAP for feature X is +0.3 on this row. What does that actually mean?" ("Holding the data distribution fixed, this feature's value contributed +0.3 to this prediction relative to the baseline E[ŷ]. It does not mean changing X by one unit changes the prediction by 0.3, and it does not mean changing X in the real world changes the outcome.")
"How do you handle the disagreement between three different feature-importance methods?" (Don't pick the one with the prettiest answer. Identify which question the stakeholder is asking — loss impact, prediction impact, or training behavior — and use the method that answers it. Report the others as sensitivity checks, not as ground truth.)

Takeaway — and end of folder

Interpretability is not a property of a model; it's a property of a question. The senior move, in order: ask which question the stakeholder is solving; pick an intrinsically interpretable model if the accuracy budget allows; otherwise pick the post-hoc method whose question matches; and never claim a causal answer from a correlational tool. The whole folder in one paragraph: bias-variance gave you the trade-off frame, the model lessons the levers, evaluation taught you to measure what you optimize, and this lesson tells you how to talk about what you shipped.