traditional_ml / 09 · clustering & dimensionality reduction lesson 9 / 12

Clustering & dimensionality reduction

Unsupervised learning is two questions: "are there groups in my data" and "can I represent it in fewer dimensions". Both are usually preprocessing or exploration — but the failure modes are subtle and beloved by interviewers.

Part A — Clustering

K-means: Lloyd's algorithm from first principles

Given n points x_1, …, x_n ∈ ℝ^d and a target number of clusters K, k-means minimises the within-cluster sum of squared distances:

J = Σ_i ‖x_i − μ_{c_i}‖²

where c_i ∈ {1, …, K} is the cluster assignment of point i and μ_k is the centroid of cluster k. The objective is non-convex in (c, μ) jointly. Lloyd's algorithm (1957/1982) is alternating minimisation:

  1. Assign. Fix μ. For each i, set c_i = arg min_k ‖x_i − μ_k‖². (Each x_i goes to its nearest centroid; this minimises J in c with μ fixed.)
  2. Update. Fix c. For each k, set μ_k = (1/|S_k|) Σ_{i ∈ S_k} x_i, the mean of points assigned to cluster k. (This minimises J in μ with c fixed — the mean is the unique minimiser of squared distance.)
  3. Repeat until assignments stop changing.

Each step is coordinate descent on a bounded objective, so J is monotone non-increasing. There are finitely many assignments, so the algorithm terminates — but only at a local minimum (the global problem is NP-hard). Run multiple random restarts and keep the lowest-J solution.

Assumptions baked into k-means — and where they break

Three assumptions the squared Euclidean loss imposes implicitly:

Two practical consequences:

Choosing K

MethodIdeaHonest weakness
ElbowPlot inertia (J) vs K; pick the kink.Often ambiguous or absent — you're eyeballing curvature on a monotonically-decreasing curve.
SilhouettePer point, s = (b − a) / max(a, b) with a = mean intra-cluster distance, b = mean distance to the nearest other cluster. Average; pick K maximising it. Range [-1, 1].Biased toward convex, well-separated clusters; misleads on overlapping or non-convex shapes.
Gap statisticTibshirani et al. 2001. Compare inertia to inertia on uniformly-random null data; pick the K with largest gap.Expensive (k-means on many bootstrap nulls). Sensitive to the null distribution.
Downstream taskPick K by how well clusters serve the actual use case.Requires a downstream metric — but if you have one, this is the only honest answer.

The senior reading: K is a modelling choice, not a discoverable truth. Heuristics like elbow and silhouette are sanity checks, not oracles.

Gaussian Mixture Models — k-means with covariance

A GMM models the data as a mixture of K Gaussians:

p(x) = Σ_k π_k N(x | μ_k, Σ_k)

Fit by EM. E-step: soft responsibility γ_{ik} = π_k N(x_i | μ_k, Σ_k) / Σ_j π_j N(x_i | μ_j, Σ_j). M-step: re-estimate π_k, μ_k, Σ_k as the responsibility-weighted statistics. EM monotonically increases the data log-likelihood, with the same local-optimum caveat as Lloyd's.

The textbook connection: k-means is the limit of GMM with isotropic equal-variance covariances and hard assignments. Letting the covariances be full and the assignments be soft gives elliptical clusters of unequal size and a posterior probability of cluster membership.

Hierarchical clustering and DBSCAN

MethodHow it worksWhat it's good at
AgglomerativeEach point its own cluster; merge the closest pair under a linkage (single = min pairwise distance, complete = max, average = mean, Ward = minimise within-cluster variance). Produces a dendrogram you cut at a chosen height.No need to pre-specify K. Different linkages capture different shapes. Cost O(n²) to O(n³) depending on linkage and implementation (SLINK/CLINK O(n²); heap-based generic O(n² log n); naive O(n³)) — doesn't scale past ~10⁴ points.
DBSCANEster et al. 1996. A point is core if at least min_pts points lie within ε. Clusters are maximal connected sets of core points (plus boundary). Non-core / non-boundary points are noise.Arbitrary cluster shapes (moons, rings). Doesn't require K. Native outlier handling. Brittle (ε, min_pts) tuning; struggles with very different cluster densities (HDBSCAN extends).

Clustering trade-off table

K-meansGMMHierarchicalDBSCAN
Specify K?YesYesNo (cut dendrogram)No (density-driven)
Non-spherical clusters?NoElliptical onlyDepends on linkageArbitrary shapes
Unequal cluster sizes?PoorlyOK (π_k absorbs it)OKOK
Soft assignment?NoYesNoNo (plus a noise label)
Scales to 10⁶+ points?Yes (with mini-batch)Yes (slower than k-means)NoYes (with spatial index)
Outlier handling?None (every point in a cluster)Implicit (low-likelihood points)NoneExplicit noise label
Main knobsK, initK, covariance typeLinkage, cut heightε, min_pts

Part B — Dimensionality reduction

PCA from first principles

Centre the data: X̃ = X − x̄. PCA seeks an orthonormal basis {v_1, …, v_d} such that projecting onto v_1 maximises variance, v_2 maximises variance subject to being orthogonal to v_1, and so on.

Variance of the projection onto v is v^T Σ v with Σ = X̃ᵀX̃ / (n − 1). Maximising v^T Σ v s.t. ‖v‖ = 1 via Lagrangian v^T Σ v − λ(v^T v − 1) yields Σ v = λ v. The principal components are eigenvectors of the covariance matrix; the eigenvalues are the projected variances. Computationally use the SVD X̃ = U S Vᵀ — columns of V are the PCs, s_i² / (n−1) the eigenvalues. SVD is numerically more stable than forming X̃ᵀX̃ explicitly.

x_2 v_2 ↑ ▲ ╲ │ · · · ╲ · · · · │ · · · · · ╲· · · · · · │ · · · · · · → · · · · · · → v_1 (max variance) │ · · · · · · · · · · ·╱ │ · · · · · · · · ╱ │ ╱ └──────────► x_1 ╱ raw axes (correlated) rotated axes (decorrelated)

What PCA does and doesn't do

PCA is good atPCA breaks on
Linearly-correlated features. Compressing redundancy. Denoising (drop low-variance components). Visualisation when data is roughly linear.Non-linear manifolds — a circle in 2D collapses to a line in PC1. Categorical or sparse data. Features with very different scales (standardise first).

One trap: "PC1 explains 90% of variance, so drop the rest." Variance is not task-relevant information. The discriminative direction for your downstream classifier may lie in a tiny-variance component. PCA is feature compression, not feature selection.

t-SNE and UMAP — what they are, what they are not

t-SNE (van der Maaten & Hinton 2008) defines pairwise similarities in high-D via a Gaussian kernel and in low-D via a heavy-tailed Student-t kernel, then minimises the KL divergence between them. The heavy tail in low-D lets distant points repel without exploding gradients — preserving local structure at the price of global structure.

UMAP (McInnes et al. 2018) uses a different theoretical framing (fuzzy simplicial sets from Riemannian geometry) but the same operational shape: build a high-D neighbour graph, find a low-D layout that preserves it. Faster, scales better, preserves more global structure — though "more" is not "all".

The senior-vs-junior moment
Both t-SNE and UMAP are visualisation tools. In a t-SNE or UMAP plot: "The t-SNE plot shows class A is far from class B" is a junior mistake. The plot shows they form distinguishable groups; the layout-distance is an optimisation artefact, not a property of the data.

Dimensionality-reduction trade-off table

PCAt-SNEUMAP
Linear or non-linear?LinearNon-linearNon-linear
Preserves global structure?Yes for linearly-correlated structure; no for non-linear manifolds (e.g. a Swiss roll collapses)NoPartially
Preserves local structure?Only if linearYes (its objective)Yes
Distances meaningful?Yes (in PC space)NoNo (better than t-SNE but still no)
Deterministic?YesNo (stochastic init)No (stochastic init)
Scales to 10⁶ points?Yes (randomised SVD)Tight (Barnes-Hut/FIt-SNE help)Yes
Use as preprocessing?OftenAlmost neverSometimes
Use for visualisation?If linear worksCommonModern default
Interpretable axes?Yes (loadings)NoNo

Interactive · clustering playground

Pick a dataset and algorithm. The point is to feel where k-means breaks (moons, anisotropic) and where DBSCAN earns its keep. Silhouette and inertia are reported so you can sanity-check metrics on configurations that visibly fail.

Algorithm × dataset
Try k-means on blobs (works), k-means on moons (fails — sliced down the middle), DBSCAN on moons (works), GMM on anisotropic (handles the elongation k-means can't).
clusters found
silhouette
inertia
noise points
Reading

Interview prompts you should be ready for

  1. "Show me k-means is guaranteed to converge." (Each step minimises the objective in one coordinate with the other fixed, so J is monotone non-increasing. There are finitely many assignments, so the sequence of objectives is bounded below and eventually stable. Converges to a local minimum — not the global one.)
  2. "Why do you standardise features before k-means?" (Squared Euclidean distance is not scale-invariant. A feature in dollars (range 10⁶) dwarfs a feature in years (range 10²), so the clustering reduces to "group by the dollar feature". Z-score puts all features on a comparable scale; standardisation is the default unless you have a domain reason to weight features differently.)
  3. "What's the relationship between k-means and GMM?" (K-means is the hard-assignment, isotropic-equal-variance limit of GMM-with-EM. The E-step in GMM gives soft responsibilities; replacing them with arg-max assignments and the covariance with σ²I gives Lloyd's algorithm exactly.)
  4. "You ran t-SNE on your embeddings and got beautiful clusters. Can you trust the distances between them?" (No. t-SNE optimises a divergence between local neighbour distributions; distant points in the plot have essentially unconstrained positions. Cluster identity is meaningful, inter-cluster distance is not. Same for cluster sizes and within-cluster density.)
  5. "Your PCA's first component explains 90% of variance — should you drop everything else?" (Not automatically. Variance is not task-relevance: the discriminative axis for your downstream model may live in a low-variance component. Check downstream performance with and without the trailing components before discarding.)
  6. "When would you pick UMAP over t-SNE? Over PCA?" (Over t-SNE: speed at scale, better global-structure preservation, deterministic-ish given a seed. Over PCA: when the manifold is non-linear and a linear projection collapses meaningful structure. PCA when you need interpretable axes, when you'll feed the embedding into a downstream model, or when you specifically need distances to remain meaningful.)
Takeaway
Clustering is a modelling choice (which algorithm, which K) constrained by the geometry of your data and the use of the output. K-means is the default; reach for GMM when ellipsoidal, DBSCAN when non-convex or noisy, hierarchical when you want a dendrogram. Dimensionality reduction splits cleanly: PCA when you need linear, interpretable, reusable axes; t-SNE/UMAP when you need to look at the data. The senior signal is knowing what each method's plot is and isn't telling you.