traditional_ml / 07 · SVMs & kernels lesson 7 / 12

Support vector machines & the kernel trick

The 1990s classifier that taught the field two ideas worth keeping forever: maximize the margin, and never compute the feature map you're separating in.

The max-margin principle

Infinitely many hyperplanes separate two linearly separable classes. Pick the one maximizing the perpendicular distance to the nearest training point — the margin. PAC-style bounds (Vapnik 1995 (margin bound); Bartlett 1998) show generalization error scales with 1/margin², not with input dimension. This is why SVMs work in high dimensions where logistic regression would need careful regularization.

x₂ ▲ + + │ + ← class +1 │ + ┃ margin │ ┃ ╱─ support vector │ ─ ─ ─ ─ ┃ ─ ─ ─ ─ ─ wᵀx + b = 0 │ ┃ │ o ┃ o ← class −1 │ o o └──────────────────────► x₁

Hard-margin primal

The geometric margin of a hyperplane wᵀx + b = 0 is 2/‖w‖. Maximizing it = minimizing ‖w‖:

minw,b   ½‖w‖²   s.t.   y_i(wᵀx_i + b) ≥ 1   ∀ i

A convex QP: one global optimum. Every point lies on the correct side of its own margin band. If even one point violates this, the problem is infeasible — hence soft margin.

Soft margin — the practical version

Introduce slack ξ_i ≥ 0 measuring how much each point violates its margin, and pay for it:

minw,b,ξ   ½‖w‖² + C Σ ξ_i   s.t.   y_i(wᵀx_i + b) ≥ 1 − ξ_i,   ξ_i ≥ 0

C is the soft-margin regularization knob — the most important SVM hyperparameter after the kernel.

CEffect on marginEffect on fitBias / variance
Large (e.g. 10³)NarrowForces near-zero training error; tolerates few violations.Bias ↓, variance ↑
Small (e.g. 10⁻²)WideAccepts many slack violations; smoother decision boundary.Bias ↑, variance ↓

Hinge-loss reformulation — what SVM really is

At the soft-margin optimum, the slack must be ξ_i = max(0, 1 − y_i(wᵀx_i + b)) — the smallest value that satisfies both ξ_i ≥ 0 and ξ_i ≥ 1 − y_i(wᵀx_i + b). Substituting this back gives:

minw,b   Σ max(0, 1 − y_i(wᵀx_i + b))   +   (1/2C) ‖w‖²

This is just a linear classifier with hinge loss + L2 regularization. The "max-margin" framing is one view; the "specific loss" framing is the other. It tells you exactly how SVM differs from logistic regression: same model class, same regularizer, different loss.

LossFormulaBehaviour
Hinge (SVM)max(0, 1 − y·f(x))Zero loss once margin ≥ 1. Only margin violators contribute gradient → sparse support vectors.
Logisticlog(1 + exp(−y·f(x)))Loss is positive everywhere → every point contributes. Output is a calibrated probability.
0/1 (ideal)1[y·f(x) < 0]Non-convex, non-differentiable. Hinge is its tightest convex surrogate.

The dual — and why it matters

Lagrangian + KKT eliminates w, b, ξ:

maxα   Σ α_i − ½ Σi,j α_i α_j y_i y_j x_iᵀx_j   s.t.   0 ≤ α_i ≤ C,   Σ α_i y_i = 0

Recover w = Σ α_i y_i x_i, and the decision function is f(x) = Σ α_i y_i x_iᵀx + b. Two facts make the dual the workhorse:

The kernel trick

A kernel K(x, x') = φ(x)ᵀφ(x') is an inner product in some implicit feature space defined by φ. If you only ever need inner products, you never materialize φ(x). Mercer's theorem: any symmetric PSD K corresponds to a valid φ. Pick K, not φ, and SVM solves a linear problem in a feature space of arbitrary (possibly infinite) dimension while only ever touching n × n kernel values.

KernelFormulaWhat it doesHyperparameters
Linearxᵀx'Standard linear SVM. Use when d ≫ n (text, sparse high-dim).None
Polynomial(xᵀx' + c)^dCaptures feature interactions up to degree d. Numerically twitchy at d ≥ 4.d, c
RBF (Gaussian)exp(−γ‖x − x'‖²)Universal approximator. Each support vector contributes a Gaussian bump; γ sets the bump width.γ (and C)
Sigmoidtanh(α xᵀx' + c)Only conditionally PSD. Rarely beats RBF; historical interest only.α, c
RBF as weighted k-NN
An RBF-SVM's decision function is f(x) = Σ α_i y_i exp(−γ‖x − x_i‖²) + b — a weighted similarity vote among the support vectors. Each support vector is a "prototype"; the kernel measures how similar the test point is to it; α_i y_i is the signed weight. SVM = k-NN with learned weights and learned prototypes. Saying this in an interview is a good signal.

γ for RBF — the bandwidth knob

γBump widthBoundaryBias / variance
SmallWideSmooth, almost linear; each SV influences far-away points.Bias ↑, variance ↓
LargeNarrowHighly local; boundary can memorize training set.Bias ↓, variance ↑

C and γ interact. Tune jointly on a log-grid with CV — e.g. C ∈ {10⁻², …, 10³}, γ ∈ {10⁻³, …, 10¹}. No closed form.

Interactive · kernel SVM playground

Pick a 2D dataset, a kernel, and tune C and (for RBF) γ. The widget runs a small SMO-lite solver, draws the decision boundary, and highlights support vectors.

Decision boundary, kernel, support vectors
Try: XOR with linear kernel (impossible). Circles with RBF, γ=2 (clean ring). Moons with RBF, γ=20, C=1000 (overfit: boundary curls around individual points).
support vectors
training accuracy
margin width (rel)
‖w‖² (feat space)
Reading

Computational cost — why SVMs don't scale

PhaseCostPractical limit
Kernel-SVM train (LIBSVM/SMO)O(n²)–O(n³)≈ 10⁵ rows before training becomes hours.
Linear-SVM train (LIBLINEAR/SGD)O(n·d)Millions fine. SGDClassifier(loss="hinge").
PredictionO(n_sv·d)Slow if n_sv is a large fraction of n.

The O(n²) Gram matrix is the wall. Nyström or random Fourier features approximate the kernel with finite-dim features (linear model on top), but at that point a different model would have been simpler.

When SVMs still win — and when they don't

SettingPickWhy
Small/medium tabular (n < 10⁵), non-linear structureRBF-SVMOften beats untuned tree ensembles; only C, γ to tune.
Text with bag-of-words / TF-IDFLinear SVMSOTA in the 2000s, still a strong baseline. High-dim sparse — linear is enough.
Bio/chem/graphs with structured kernelsKernel SVMKernel encodes domain knowledge no off-the-shelf model uses.
Tabular at scale (n ≥ 10⁶)GBDTScales linearly, native categoricals, missing values.
Images, audio, sequencesDeep netsLearned features dominate fixed kernels by orders of magnitude.
Need calibrated probabilitiesLogistic / GBDTSVM outputs are signed distances.

The decline isn't because SVMs got worse — it's that GBDTs ate medium-tabular and deep nets ate structured-input. SVMs occupy a shrinking middle.

Calibration caveat

An SVM's output f(x) = wᵀφ(x) + b is a signed distance, not P(y=1 | x). Squashing it through a sigmoid yields nonsense unless you fit the sigmoid's parameters. Standard fix: Platt scaling — fit P(y=1 | x) = 1 / (1 + exp(A·f(x) + B)) on held-out data. Isotonic regression for larger calibration sets. sklearn.svm.SVC(probability=True) does Platt internally.

Trade-off table — pick your classifier in 2026

Linear SVMRBF SVMLogisticGBDTNN
Training timeFastSlow O(n²–n³)FastMediumSlow
Scales n ≥ 10⁶Yes (SGD)NoYesYesYes
Non-linearityNoYesVia featuresYesYes
ProbabilitiesPlattPlattNativeOften recalibratedOverconfident
InterpretabilityCoeffsLowCoeffsSHAP, importancesLow
HP sensitivityCC, γλSeveralMany
Two interview traps
(1) "SVMs work in infinite dimensions because they're so flexible." Wrong framing — they generalize because of the margin, not despite the dimensionality. PAC bounds scale with margin, not d. (2) "SVMs don't overfit." They absolutely do — large C with large γ produces a model with 90%+ of points as support vectors and atrocious test error. The hyperparameters are the regularizers.

Interview prompts you should be ready for

  1. "Why does the SVM dual only depend on inner products?" (Sub w = Σ α_i y_i x_i into ½‖w‖² and into the constraints; both reduce to x_iᵀx_j. Primal has explicit w; the dual eliminates it.)
  2. "Walk me through the kernel trick. What is RBF doing?" (Kernel = inner product in some feature space. RBF's feature map is infinite-dim (Taylor expansion of the Gaussian). Decision function = weighted sum of Gaussian bumps at the support vectors.)
  3. "Effect of C? Of γ?" (Large C → narrow margin, variance up. Large γ → local bumps, variance up. Tune jointly on a log grid.)
  4. "Your RBF-SVM has 95% support vectors. Diagnose." (C too large (no slack tolerated) or γ too large (everything looks dissimilar, so all points sit near the margin), usually both. Big train-test gap. Lower C and/or γ.)
  5. "Linear SVM vs logistic regression — what's different?" (Same model class, same L2, different loss. Hinge → sparse SVs; logistic → calibrated probabilities. Performance usually within a percent.)
  6. "When would you pick SVM over GBDT in 2026?" (Small/medium data with non-linear structure where you don't want to tune trees; TF-IDF text with linear kernel; domains with a known structured kernel. Else GBDT.)
  7. "How do you get probabilities out of an SVM?" (Platt scaling on held-out data, or isotonic. Don't sigmoid the raw decision values.)
Takeaway
Two ideas to keep: maximize the margin (it controls generalization), and the kernel trick (work in a richer space without ever computing it). SVMs themselves are a shrinking middle — GBDTs took the tabular high ground, deep nets took the structured-input high ground. But the framing they introduced (hinge loss + L2, dual representations, kernels) shows up across modern ML, so the lesson is worth more than the model.