Support vector machines & the kernel trick
The 1990s classifier that taught the field two ideas worth keeping forever: maximize the margin, and never compute the feature map you're separating in.
The max-margin principle
Infinitely many hyperplanes separate two linearly separable classes. Pick the one maximizing the perpendicular distance to the nearest training point — the margin. PAC-style bounds (Vapnik 1995 (margin bound); Bartlett 1998) show generalization error scales with 1/margin², not with input dimension. This is why SVMs work in high dimensions where logistic regression would need careful regularization.
Hard-margin primal
The geometric margin of a hyperplane wᵀx + b = 0 is 2/‖w‖. Maximizing it = minimizing ‖w‖:
A convex QP: one global optimum. Every point lies on the correct side of its own margin band. If even one point violates this, the problem is infeasible — hence soft margin.
Soft margin — the practical version
Introduce slack ξ_i ≥ 0 measuring how much each point violates its margin, and pay for it:
C is the soft-margin regularization knob — the most important SVM hyperparameter after the kernel.
| C | Effect on margin | Effect on fit | Bias / variance |
|---|---|---|---|
| Large (e.g. 10³) | Narrow | Forces near-zero training error; tolerates few violations. | Bias ↓, variance ↑ |
| Small (e.g. 10⁻²) | Wide | Accepts many slack violations; smoother decision boundary. | Bias ↑, variance ↓ |
Hinge-loss reformulation — what SVM really is
At the soft-margin optimum, the slack must be ξ_i = max(0, 1 − y_i(wᵀx_i + b)) — the smallest value that satisfies both ξ_i ≥ 0 and ξ_i ≥ 1 − y_i(wᵀx_i + b). Substituting this back gives:
This is just a linear classifier with hinge loss + L2 regularization. The "max-margin" framing is one view; the "specific loss" framing is the other. It tells you exactly how SVM differs from logistic regression: same model class, same regularizer, different loss.
| Loss | Formula | Behaviour |
|---|---|---|
| Hinge (SVM) | max(0, 1 − y·f(x)) | Zero loss once margin ≥ 1. Only margin violators contribute gradient → sparse support vectors. |
| Logistic | log(1 + exp(−y·f(x))) | Loss is positive everywhere → every point contributes. Output is a calibrated probability. |
| 0/1 (ideal) | 1[y·f(x) < 0] | Non-convex, non-differentiable. Hinge is its tightest convex surrogate. |
The dual — and why it matters
Lagrangian + KKT eliminates w, b, ξ:
Recover w = Σ α_i y_i x_i, and the decision function is f(x) = Σ α_i y_i x_iᵀx + b. Two facts make the dual the workhorse:
- Sparsity. KKT forces α_i = 0 for every point strictly outside its margin. Only points on or inside the margin get α_i > 0 — the support vectors. Throw away everything else; the model is unchanged.
- Inner products only. Both objective and decision function depend on training data only through x_iᵀx_j. Replace that with a kernel and you've moved to a different feature space without rewriting any code.
The kernel trick
A kernel K(x, x') = φ(x)ᵀφ(x') is an inner product in some implicit feature space defined by φ. If you only ever need inner products, you never materialize φ(x). Mercer's theorem: any symmetric PSD K corresponds to a valid φ. Pick K, not φ, and SVM solves a linear problem in a feature space of arbitrary (possibly infinite) dimension while only ever touching n × n kernel values.
| Kernel | Formula | What it does | Hyperparameters |
|---|---|---|---|
| Linear | xᵀx' | Standard linear SVM. Use when d ≫ n (text, sparse high-dim). | None |
| Polynomial | (xᵀx' + c)^d | Captures feature interactions up to degree d. Numerically twitchy at d ≥ 4. | d, c |
| RBF (Gaussian) | exp(−γ‖x − x'‖²) | Universal approximator. Each support vector contributes a Gaussian bump; γ sets the bump width. | γ (and C) |
| Sigmoid | tanh(α xᵀx' + c) | Only conditionally PSD. Rarely beats RBF; historical interest only. | α, c |
γ for RBF — the bandwidth knob
| γ | Bump width | Boundary | Bias / variance |
|---|---|---|---|
| Small | Wide | Smooth, almost linear; each SV influences far-away points. | Bias ↑, variance ↓ |
| Large | Narrow | Highly local; boundary can memorize training set. | Bias ↓, variance ↑ |
C and γ interact. Tune jointly on a log-grid with CV — e.g. C ∈ {10⁻², …, 10³}, γ ∈ {10⁻³, …, 10¹}. No closed form.
Interactive · kernel SVM playground
Pick a 2D dataset, a kernel, and tune C and (for RBF) γ. The widget runs a small SMO-lite solver, draws the decision boundary, and highlights support vectors.
Computational cost — why SVMs don't scale
| Phase | Cost | Practical limit |
|---|---|---|
| Kernel-SVM train (LIBSVM/SMO) | O(n²)–O(n³) | ≈ 10⁵ rows before training becomes hours. |
| Linear-SVM train (LIBLINEAR/SGD) | O(n·d) | Millions fine. SGDClassifier(loss="hinge"). |
| Prediction | O(n_sv·d) | Slow if n_sv is a large fraction of n. |
The O(n²) Gram matrix is the wall. Nyström or random Fourier features approximate the kernel with finite-dim features (linear model on top), but at that point a different model would have been simpler.
When SVMs still win — and when they don't
| Setting | Pick | Why |
|---|---|---|
| Small/medium tabular (n < 10⁵), non-linear structure | RBF-SVM | Often beats untuned tree ensembles; only C, γ to tune. |
| Text with bag-of-words / TF-IDF | Linear SVM | SOTA in the 2000s, still a strong baseline. High-dim sparse — linear is enough. |
| Bio/chem/graphs with structured kernels | Kernel SVM | Kernel encodes domain knowledge no off-the-shelf model uses. |
| Tabular at scale (n ≥ 10⁶) | GBDT | Scales linearly, native categoricals, missing values. |
| Images, audio, sequences | Deep nets | Learned features dominate fixed kernels by orders of magnitude. |
| Need calibrated probabilities | Logistic / GBDT | SVM outputs are signed distances. |
The decline isn't because SVMs got worse — it's that GBDTs ate medium-tabular and deep nets ate structured-input. SVMs occupy a shrinking middle.
Calibration caveat
An SVM's output f(x) = wᵀφ(x) + b is a signed distance, not P(y=1 | x). Squashing it through a sigmoid yields nonsense unless you fit the sigmoid's parameters. Standard fix: Platt scaling — fit P(y=1 | x) = 1 / (1 + exp(A·f(x) + B)) on held-out data. Isotonic regression for larger calibration sets. sklearn.svm.SVC(probability=True) does Platt internally.
Trade-off table — pick your classifier in 2026
| Linear SVM | RBF SVM | Logistic | GBDT | NN | |
|---|---|---|---|---|---|
| Training time | Fast | Slow O(n²–n³) | Fast | Medium | Slow |
| Scales n ≥ 10⁶ | Yes (SGD) | No | Yes | Yes | Yes |
| Non-linearity | No | Yes | Via features | Yes | Yes |
| Probabilities | Platt | Platt | Native | Often recalibrated | Overconfident |
| Interpretability | Coeffs | Low | Coeffs | SHAP, importances | Low |
| HP sensitivity | C | C, γ | λ | Several | Many |
Interview prompts you should be ready for
- "Why does the SVM dual only depend on inner products?" (Sub w = Σ α_i y_i x_i into ½‖w‖² and into the constraints; both reduce to x_iᵀx_j. Primal has explicit w; the dual eliminates it.)
- "Walk me through the kernel trick. What is RBF doing?" (Kernel = inner product in some feature space. RBF's feature map is infinite-dim (Taylor expansion of the Gaussian). Decision function = weighted sum of Gaussian bumps at the support vectors.)
- "Effect of C? Of γ?" (Large C → narrow margin, variance up. Large γ → local bumps, variance up. Tune jointly on a log grid.)
- "Your RBF-SVM has 95% support vectors. Diagnose." (C too large (no slack tolerated) or γ too large (everything looks dissimilar, so all points sit near the margin), usually both. Big train-test gap. Lower C and/or γ.)
- "Linear SVM vs logistic regression — what's different?" (Same model class, same L2, different loss. Hinge → sparse SVs; logistic → calibrated probabilities. Performance usually within a percent.)
- "When would you pick SVM over GBDT in 2026?" (Small/medium data with non-linear structure where you don't want to tune trees; TF-IDF text with linear kernel; domains with a known structured kernel. Else GBDT.)
- "How do you get probabilities out of an SVM?" (Platt scaling on held-out data, or isotonic. Don't sigmoid the raw decision values.)