Support vector machines & the kernel trick

The 1990s classifier that taught the field two ideas worth keeping forever: maximize the margin, and never compute the feature map you're separating in.

The max-margin principle

Infinitely many hyperplanes separate two linearly separable classes. Pick the one maximizing the perpendicular distance to the nearest training point — the margin. PAC-style bounds (Vapnik 1995 (margin bound); Bartlett 1998) show generalization error scales with 1/margin², not with input dimension. This is why SVMs work in high dimensions where logistic regression would need careful regularization.

x₂ ▲ + + │ + ← class +1 │ + ┃ margin │ ┃ ╱─ support vector │ ─ ─ ─ ─ ┃ ─ ─ ─ ─ ─ wᵀx + b = 0 │ ┃ │ o ┃ o ← class −1 │ o o └──────────────────────► x₁

Hard-margin primal

The geometric margin of a hyperplane wᵀx + b = 0 is 2/‖w‖. Maximizing it = minimizing ‖w‖:

min_w,b ½‖w‖² s.t. y_i(wᵀx_i + b) ≥ 1 ∀ i

A convex QP: one global optimum. Every point lies on the correct side of its own margin band. If even one point violates this, the problem is infeasible — hence soft margin.

Soft margin — the practical version

Introduce slack ξ_i ≥ 0 measuring how much each point violates its margin, and pay for it:

min_w,b,ξ ½‖w‖² + C Σ ξ_i s.t. y_i(wᵀx_i + b) ≥ 1 − ξ_i, ξ_i ≥ 0

C is the soft-margin regularization knob — the most important SVM hyperparameter after the kernel.

C	Effect on margin	Effect on fit	Bias / variance
Large (e.g. 10³)	Narrow	Forces near-zero training error; tolerates few violations.	Bias ↓, variance ↑
Small (e.g. 10⁻²)	Wide	Accepts many slack violations; smoother decision boundary.	Bias ↑, variance ↓

Hinge-loss reformulation — what SVM really is

At the soft-margin optimum, the slack must be ξ_i = max(0, 1 − y_i(wᵀx_i + b)) — the smallest value that satisfies both ξ_i ≥ 0 and ξ_i ≥ 1 − y_i(wᵀx_i + b). Substituting this back gives:

min_w,b Σ max(0, 1 − y_i(wᵀx_i + b)) + (1/2C) ‖w‖²

This is just a linear classifier with hinge loss + L2 regularization. The "max-margin" framing is one view; the "specific loss" framing is the other. It tells you exactly how SVM differs from logistic regression: same model class, same regularizer, different loss.

Loss	Formula	Behaviour
Hinge (SVM)	max(0, 1 − y·f(x))	Zero loss once margin ≥ 1. Only margin violators contribute gradient → sparse support vectors.
Logistic	log(1 + exp(−y·f(x)))	Loss is positive everywhere → every point contributes. Output is a calibrated probability.
0/1 (ideal)	1[y·f(x) < 0]	Non-convex, non-differentiable. Hinge is its tightest convex surrogate.

The dual — and why it matters

Lagrangian + KKT eliminates w, b, ξ:

max_α Σ α_i − ½ Σ_i,j α_i α_j y_i y_j x_iᵀx_j s.t. 0 ≤ α_i ≤ C, Σ α_i y_i = 0

Recover w = Σ α_i y_i x_i, and the decision function is f(x) = Σ α_i y_i x_iᵀx + b. Two facts make the dual the workhorse:

Sparsity. KKT forces α_i = 0 for every point strictly outside its margin. Only points on or inside the margin get α_i > 0 — the support vectors. Throw away everything else; the model is unchanged.
Inner products only. Both objective and decision function depend on training data only through x_iᵀx_j. Replace that with a kernel and you've moved to a different feature space without rewriting any code.

The kernel trick

A kernel K(x, x') = φ(x)ᵀφ(x') is an inner product in some implicit feature space defined by φ. If you only ever need inner products, you never materialize φ(x). Mercer's theorem: any symmetric PSD K corresponds to a valid φ. Pick K, not φ, and SVM solves a linear problem in a feature space of arbitrary (possibly infinite) dimension while only ever touching n × n kernel values.

Kernel	Formula	What it does	Hyperparameters
Linear	xᵀx'	Standard linear SVM. Use when d ≫ n (text, sparse high-dim).	None
Polynomial	(xᵀx' + c)^d	Captures feature interactions up to degree d. Numerically twitchy at d ≥ 4.	d, c
RBF (Gaussian)	exp(−γ‖x − x'‖²)	Universal approximator. Each support vector contributes a Gaussian bump; γ sets the bump width.	γ (and C)
Sigmoid	tanh(α xᵀx' + c)	Only conditionally PSD. Rarely beats RBF; historical interest only.	α, c

RBF as weighted k-NN

An RBF-SVM's decision function is f(x) = Σ α_i y_i exp(−γ‖x − x_i‖²) + b — a weighted similarity vote among the support vectors. Each support vector is a "prototype"; the kernel measures how similar the test point is to it; α_i y_i is the signed weight. SVM = k-NN with learned weights and learned prototypes. Saying this in an interview is a good signal.

γ for RBF — the bandwidth knob

γ	Bump width	Boundary	Bias / variance
Small	Wide	Smooth, almost linear; each SV influences far-away points.	Bias ↑, variance ↓
Large	Narrow	Highly local; boundary can memorize training set.	Bias ↓, variance ↑

C and γ interact. Tune jointly on a log-grid with CV — e.g. C ∈ {10⁻², …, 10³}, γ ∈ {10⁻³, …, 10¹}. No closed form.

Interactive · kernel SVM playground

Pick a 2D dataset, a kernel, and tune C and (for RBF) γ. The widget runs a small SMO-lite solver, draws the decision boundary, and highlights support vectors.

Decision boundary, kernel, support vectors

Try: XOR with linear kernel (impossible). Circles with RBF, γ=2 (clean ring). Moons with RBF, γ=20, C=1000 (overfit: boundary curls around individual points).

dataset: kernel: log₁₀(C): 1.0 γ (RBF) / d (poly): 2.0

support vectors

—

training accuracy

—

margin width (rel)

—

‖w‖² (feat space)

—

Reading

—

Computational cost — why SVMs don't scale

Phase	Cost	Practical limit
Kernel-SVM train (LIBSVM/SMO)	O(n²)–O(n³)	≈ 10⁵ rows before training becomes hours.
Linear-SVM train (LIBLINEAR/SGD)	O(n·d)	Millions fine. `SGDClassifier(loss="hinge")`.
Prediction	O(n_sv·d)	Slow if n_sv is a large fraction of n.

The O(n²) Gram matrix is the wall. Nyström or random Fourier features approximate the kernel with finite-dim features (linear model on top), but at that point a different model would have been simpler.

When SVMs still win — and when they don't

Setting	Pick	Why
Small/medium tabular (n < 10⁵), non-linear structure	RBF-SVM	Often beats untuned tree ensembles; only C, γ to tune.
Text with bag-of-words / TF-IDF	Linear SVM	SOTA in the 2000s, still a strong baseline. High-dim sparse — linear is enough.
Bio/chem/graphs with structured kernels	Kernel SVM	Kernel encodes domain knowledge no off-the-shelf model uses.
Tabular at scale (n ≥ 10⁶)	GBDT	Scales linearly, native categoricals, missing values.
Images, audio, sequences	Deep nets	Learned features dominate fixed kernels by orders of magnitude.
Need calibrated probabilities	Logistic / GBDT	SVM outputs are signed distances.

The decline isn't because SVMs got worse — it's that GBDTs ate medium-tabular and deep nets ate structured-input. SVMs occupy a shrinking middle.

Calibration caveat

An SVM's output f(x) = wᵀφ(x) + b is a signed distance, not P(y=1 | x). Squashing it through a sigmoid yields nonsense unless you fit the sigmoid's parameters. Standard fix: Platt scaling — fit P(y=1 | x) = 1 / (1 + exp(A·f(x) + B)) on held-out data. Isotonic regression for larger calibration sets. sklearn.svm.SVC(probability=True) does Platt internally.

Trade-off table — pick your classifier in 2026

	Linear SVM	RBF SVM	Logistic	GBDT	NN
Training time	Fast	Slow O(n²–n³)	Fast	Medium	Slow
Scales n ≥ 10⁶	Yes (SGD)	No	Yes	Yes	Yes
Non-linearity	No	Yes	Via features	Yes	Yes
Probabilities	Platt	Platt	Native	Often recalibrated	Overconfident
Interpretability	Coeffs	Low	Coeffs	SHAP, importances	Low
HP sensitivity	C	C, γ	λ	Several	Many

Two interview traps

(1) "SVMs work in infinite dimensions because they're so flexible." Wrong framing — they generalize because of the margin, not despite the dimensionality. PAC bounds scale with margin, not d. (2) "SVMs don't overfit." They absolutely do — large C with large γ produces a model with 90%+ of points as support vectors and atrocious test error. The hyperparameters are the regularizers.

Interview prompts you should be ready for

"Why does the SVM dual only depend on inner products?" (Sub w = Σ α_i y_i x_i into ½‖w‖² and into the constraints; both reduce to x_iᵀx_j. Primal has explicit w; the dual eliminates it.)
"Walk me through the kernel trick. What is RBF doing?" (Kernel = inner product in some feature space. RBF's feature map is infinite-dim (Taylor expansion of the Gaussian). Decision function = weighted sum of Gaussian bumps at the support vectors.)
"Effect of C? Of γ?" (Large C → narrow margin, variance up. Large γ → local bumps, variance up. Tune jointly on a log grid.)
"Your RBF-SVM has 95% support vectors. Diagnose." (C too large (no slack tolerated) or γ too large (everything looks dissimilar, so all points sit near the margin), usually both. Big train-test gap. Lower C and/or γ.)
"Linear SVM vs logistic regression — what's different?" (Same model class, same L2, different loss. Hinge → sparse SVs; logistic → calibrated probabilities. Performance usually within a percent.)
"When would you pick SVM over GBDT in 2026?" (Small/medium data with non-linear structure where you don't want to tune trees; TF-IDF text with linear kernel; domains with a known structured kernel. Else GBDT.)
"How do you get probabilities out of an SVM?" (Platt scaling on held-out data, or isotonic. Don't sigmoid the raw decision values.)

Takeaway

Two ideas to keep: maximize the margin (it controls generalization), and the kernel trick (work in a richer space without ever computing it). SVMs themselves are a shrinking middle — GBDTs took the tabular high ground, deep nets took the structured-input high ground. But the framing they introduced (hinge loss + L2, dual representations, kernels) shows up across modern ML, so the lesson is worth more than the model.