Clustering & dimensionality reduction
Unsupervised learning is two questions: "are there groups in my data" and "can I represent it in fewer dimensions". Both are usually preprocessing or exploration — but the failure modes are subtle and beloved by interviewers.
Part A — Clustering
K-means: Lloyd's algorithm from first principles
Given n points x_1, …, x_n ∈ ℝ^d and a target number of clusters K, k-means minimises the within-cluster sum of squared distances:
where c_i ∈ {1, …, K} is the cluster assignment of point i and μ_k is the centroid of cluster k. The objective is non-convex in (c, μ) jointly. Lloyd's algorithm (1957/1982) is alternating minimisation:
- Assign. Fix μ. For each i, set c_i = arg min_k ‖x_i − μ_k‖². (Each x_i goes to its nearest centroid; this minimises J in c with μ fixed.)
- Update. Fix c. For each k, set μ_k = (1/|S_k|) Σ_{i ∈ S_k} x_i, the mean of points assigned to cluster k. (This minimises J in μ with c fixed — the mean is the unique minimiser of squared distance.)
- Repeat until assignments stop changing.
Each step is coordinate descent on a bounded objective, so J is monotone non-increasing. There are finitely many assignments, so the algorithm terminates — but only at a local minimum (the global problem is NP-hard). Run multiple random restarts and keep the lowest-J solution.
Assumptions baked into k-means — and where they break
Three assumptions the squared Euclidean loss imposes implicitly:
- Spherical clusters. Distance is isotropic, so k-means draws Voronoi boundaries — straight lines between centroids. Elongated or curved clusters get sliced apart.
- Equal variance. Each cluster contributes the same loss-per-point, so a high-variance cluster gets cannibalised by tighter neighbours.
- Equal size. The Voronoi partition assigns by nearest centroid regardless of density, so dense small clusters get absorbed into sparse big ones.
Two practical consequences:
- Standardise first. Squared distance is not scale-invariant; a feature in dollars (range 10⁶) dwarfs a feature in years (range 10²) and the clustering reduces to "group by the dollar feature".
- Initialisation matters. Use k-means++ (Arthur & Vassilvitskii 2007): pick the first centroid uniformly, then each next centroid with probability proportional to D(x)² where D(x) is distance to the nearest already-picked centroid. Logarithmic competitive bound on J in expectation (O(log K) in expectation; Arthur & Vassilvitskii 2007) — the default everywhere now.
Choosing K
| Method | Idea | Honest weakness |
|---|---|---|
| Elbow | Plot inertia (J) vs K; pick the kink. | Often ambiguous or absent — you're eyeballing curvature on a monotonically-decreasing curve. |
| Silhouette | Per point, s = (b − a) / max(a, b) with a = mean intra-cluster distance, b = mean distance to the nearest other cluster. Average; pick K maximising it. Range [-1, 1]. | Biased toward convex, well-separated clusters; misleads on overlapping or non-convex shapes. |
| Gap statistic | Tibshirani et al. 2001. Compare inertia to inertia on uniformly-random null data; pick the K with largest gap. | Expensive (k-means on many bootstrap nulls). Sensitive to the null distribution. |
| Downstream task | Pick K by how well clusters serve the actual use case. | Requires a downstream metric — but if you have one, this is the only honest answer. |
The senior reading: K is a modelling choice, not a discoverable truth. Heuristics like elbow and silhouette are sanity checks, not oracles.
Gaussian Mixture Models — k-means with covariance
A GMM models the data as a mixture of K Gaussians:
Fit by EM. E-step: soft responsibility γ_{ik} = π_k N(x_i | μ_k, Σ_k) / Σ_j π_j N(x_i | μ_j, Σ_j). M-step: re-estimate π_k, μ_k, Σ_k as the responsibility-weighted statistics. EM monotonically increases the data log-likelihood, with the same local-optimum caveat as Lloyd's.
The textbook connection: k-means is the limit of GMM with isotropic equal-variance covariances and hard assignments. Letting the covariances be full and the assignments be soft gives elliptical clusters of unequal size and a posterior probability of cluster membership.
Hierarchical clustering and DBSCAN
| Method | How it works | What it's good at |
|---|---|---|
| Agglomerative | Each point its own cluster; merge the closest pair under a linkage (single = min pairwise distance, complete = max, average = mean, Ward = minimise within-cluster variance). Produces a dendrogram you cut at a chosen height. | No need to pre-specify K. Different linkages capture different shapes. Cost O(n²) to O(n³) depending on linkage and implementation (SLINK/CLINK O(n²); heap-based generic O(n² log n); naive O(n³)) — doesn't scale past ~10⁴ points. |
| DBSCAN | Ester et al. 1996. A point is core if at least min_pts points lie within ε. Clusters are maximal connected sets of core points (plus boundary). Non-core / non-boundary points are noise. | Arbitrary cluster shapes (moons, rings). Doesn't require K. Native outlier handling. Brittle (ε, min_pts) tuning; struggles with very different cluster densities (HDBSCAN extends). |
Clustering trade-off table
| K-means | GMM | Hierarchical | DBSCAN | |
|---|---|---|---|---|
| Specify K? | Yes | Yes | No (cut dendrogram) | No (density-driven) |
| Non-spherical clusters? | No | Elliptical only | Depends on linkage | Arbitrary shapes |
| Unequal cluster sizes? | Poorly | OK (π_k absorbs it) | OK | OK |
| Soft assignment? | No | Yes | No | No (plus a noise label) |
| Scales to 10⁶+ points? | Yes (with mini-batch) | Yes (slower than k-means) | No | Yes (with spatial index) |
| Outlier handling? | None (every point in a cluster) | Implicit (low-likelihood points) | None | Explicit noise label |
| Main knobs | K, init | K, covariance type | Linkage, cut height | ε, min_pts |
Part B — Dimensionality reduction
PCA from first principles
Centre the data: X̃ = X − x̄. PCA seeks an orthonormal basis {v_1, …, v_d} such that projecting onto v_1 maximises variance, v_2 maximises variance subject to being orthogonal to v_1, and so on.
Variance of the projection onto v is v^T Σ v with Σ = X̃ᵀX̃ / (n − 1). Maximising v^T Σ v s.t. ‖v‖ = 1 via Lagrangian v^T Σ v − λ(v^T v − 1) yields Σ v = λ v. The principal components are eigenvectors of the covariance matrix; the eigenvalues are the projected variances. Computationally use the SVD X̃ = U S Vᵀ — columns of V are the PCs, s_i² / (n−1) the eigenvalues. SVD is numerically more stable than forming X̃ᵀX̃ explicitly.
What PCA does and doesn't do
| PCA is good at | PCA breaks on |
|---|---|
| Linearly-correlated features. Compressing redundancy. Denoising (drop low-variance components). Visualisation when data is roughly linear. | Non-linear manifolds — a circle in 2D collapses to a line in PC1. Categorical or sparse data. Features with very different scales (standardise first). |
One trap: "PC1 explains 90% of variance, so drop the rest." Variance is not task-relevant information. The discriminative direction for your downstream classifier may lie in a tiny-variance component. PCA is feature compression, not feature selection.
t-SNE and UMAP — what they are, what they are not
t-SNE (van der Maaten & Hinton 2008) defines pairwise similarities in high-D via a Gaussian kernel and in low-D via a heavy-tailed Student-t kernel, then minimises the KL divergence between them. The heavy tail in low-D lets distant points repel without exploding gradients — preserving local structure at the price of global structure.
UMAP (McInnes et al. 2018) uses a different theoretical framing (fuzzy simplicial sets from Riemannian geometry) but the same operational shape: build a high-D neighbour graph, find a low-D layout that preserves it. Faster, scales better, preserves more global structure — though "more" is not "all".
- Cluster identification is meaningful (points grouped together were neighbours in high-D).
- Distances between clusters are not meaningful — two clusters far apart on a t-SNE plot may have been adjacent in the original space.
- Cluster sizes are not meaningful — perplexity controls effective neighbourhood, not visual diameter.
- Density within a cluster is not meaningful — t-SNE equalises density.
Dimensionality-reduction trade-off table
| PCA | t-SNE | UMAP | |
|---|---|---|---|
| Linear or non-linear? | Linear | Non-linear | Non-linear |
| Preserves global structure? | Yes for linearly-correlated structure; no for non-linear manifolds (e.g. a Swiss roll collapses) | No | Partially |
| Preserves local structure? | Only if linear | Yes (its objective) | Yes |
| Distances meaningful? | Yes (in PC space) | No | No (better than t-SNE but still no) |
| Deterministic? | Yes | No (stochastic init) | No (stochastic init) |
| Scales to 10⁶ points? | Yes (randomised SVD) | Tight (Barnes-Hut/FIt-SNE help) | Yes |
| Use as preprocessing? | Often | Almost never | Sometimes |
| Use for visualisation? | If linear works | Common | Modern default |
| Interpretable axes? | Yes (loadings) | No | No |
Interactive · clustering playground
Pick a dataset and algorithm. The point is to feel where k-means breaks (moons, anisotropic) and where DBSCAN earns its keep. Silhouette and inertia are reported so you can sanity-check metrics on configurations that visibly fail.
Interview prompts you should be ready for
- "Show me k-means is guaranteed to converge." (Each step minimises the objective in one coordinate with the other fixed, so J is monotone non-increasing. There are finitely many assignments, so the sequence of objectives is bounded below and eventually stable. Converges to a local minimum — not the global one.)
- "Why do you standardise features before k-means?" (Squared Euclidean distance is not scale-invariant. A feature in dollars (range 10⁶) dwarfs a feature in years (range 10²), so the clustering reduces to "group by the dollar feature". Z-score puts all features on a comparable scale; standardisation is the default unless you have a domain reason to weight features differently.)
- "What's the relationship between k-means and GMM?" (K-means is the hard-assignment, isotropic-equal-variance limit of GMM-with-EM. The E-step in GMM gives soft responsibilities; replacing them with arg-max assignments and the covariance with σ²I gives Lloyd's algorithm exactly.)
- "You ran t-SNE on your embeddings and got beautiful clusters. Can you trust the distances between them?" (No. t-SNE optimises a divergence between local neighbour distributions; distant points in the plot have essentially unconstrained positions. Cluster identity is meaningful, inter-cluster distance is not. Same for cluster sizes and within-cluster density.)
- "Your PCA's first component explains 90% of variance — should you drop everything else?" (Not automatically. Variance is not task-relevance: the discriminative axis for your downstream model may live in a low-variance component. Check downstream performance with and without the trailing components before discarding.)
- "When would you pick UMAP over t-SNE? Over PCA?" (Over t-SNE: speed at scale, better global-structure preservation, deterministic-ish given a seed. Over PCA: when the manifold is non-linear and a linear projection collapses meaningful structure. PCA when you need interpretable axes, when you'll feed the embedding into a downstream model, or when you specifically need distances to remain meaningful.)