Clustering & dimensionality reduction

Unsupervised learning is two questions: "are there groups in my data" and "can I represent it in fewer dimensions". Both are usually preprocessing or exploration — but the failure modes are subtle and beloved by interviewers.

Part A — Clustering

K-means: Lloyd's algorithm from first principles

Given n points x_1, …, x_n ∈ ℝ^d and a target number of clusters K, k-means minimises the within-cluster sum of squared distances:

J = Σ_i ‖x_i − μ_{c_i}‖²

where c_i ∈ {1, …, K} is the cluster assignment of point i and μ_k is the centroid of cluster k. The objective is non-convex in (c, μ) jointly. Lloyd's algorithm (1957/1982) is alternating minimisation:

Assign. Fix μ. For each i, set c_i = arg min_k ‖x_i − μ_k‖². (Each x_i goes to its nearest centroid; this minimises J in c with μ fixed.)
Update. Fix c. For each k, set μ_k = (1/|S_k|) Σ_{i ∈ S_k} x_i, the mean of points assigned to cluster k. (This minimises J in μ with c fixed — the mean is the unique minimiser of squared distance.)
Repeat until assignments stop changing.

Each step is coordinate descent on a bounded objective, so J is monotone non-increasing. There are finitely many assignments, so the algorithm terminates — but only at a local minimum (the global problem is NP-hard). Run multiple random restarts and keep the lowest-J solution.

Assumptions baked into k-means — and where they break

Three assumptions the squared Euclidean loss imposes implicitly:

Spherical clusters. Distance is isotropic, so k-means draws Voronoi boundaries — straight lines between centroids. Elongated or curved clusters get sliced apart.
Equal variance. Each cluster contributes the same loss-per-point, so a high-variance cluster gets cannibalised by tighter neighbours.
Equal size. The Voronoi partition assigns by nearest centroid regardless of density, so dense small clusters get absorbed into sparse big ones.

Two practical consequences:

Standardise first. Squared distance is not scale-invariant; a feature in dollars (range 10⁶) dwarfs a feature in years (range 10²) and the clustering reduces to "group by the dollar feature".
Initialisation matters. Use k-means++ (Arthur & Vassilvitskii 2007): pick the first centroid uniformly, then each next centroid with probability proportional to D(x)² where D(x) is distance to the nearest already-picked centroid. Logarithmic competitive bound on J in expectation (O(log K) in expectation; Arthur & Vassilvitskii 2007) — the default everywhere now.

Choosing K

Method	Idea	Honest weakness
Elbow	Plot inertia (J) vs K; pick the kink.	Often ambiguous or absent — you're eyeballing curvature on a monotonically-decreasing curve.
Silhouette	Per point, s = (b − a) / max(a, b) with a = mean intra-cluster distance, b = mean distance to the nearest other cluster. Average; pick K maximising it. Range [-1, 1].	Biased toward convex, well-separated clusters; misleads on overlapping or non-convex shapes.
Gap statistic	Tibshirani et al. 2001. Compare inertia to inertia on uniformly-random null data; pick the K with largest gap.	Expensive (k-means on many bootstrap nulls). Sensitive to the null distribution.
Downstream task	Pick K by how well clusters serve the actual use case.	Requires a downstream metric — but if you have one, this is the only honest answer.

The senior reading: K is a modelling choice, not a discoverable truth. Heuristics like elbow and silhouette are sanity checks, not oracles.

Gaussian Mixture Models — k-means with covariance

A GMM models the data as a mixture of K Gaussians:

p(x) = Σ_k π_k N(x | μ_k, Σ_k)

Fit by EM. E-step: soft responsibility γ_{ik} = π_k N(x_i | μ_k, Σ_k) / Σ_j π_j N(x_i | μ_j, Σ_j). M-step: re-estimate π_k, μ_k, Σ_k as the responsibility-weighted statistics. EM monotonically increases the data log-likelihood, with the same local-optimum caveat as Lloyd's.

The textbook connection: k-means is the limit of GMM with isotropic equal-variance covariances and hard assignments. Letting the covariances be full and the assignments be soft gives elliptical clusters of unequal size and a posterior probability of cluster membership.

Hierarchical clustering and DBSCAN

Method	How it works	What it's good at
Agglomerative	Each point its own cluster; merge the closest pair under a linkage (single = min pairwise distance, complete = max, average = mean, Ward = minimise within-cluster variance). Produces a dendrogram you cut at a chosen height.	No need to pre-specify K. Different linkages capture different shapes. Cost O(n²) to O(n³) depending on linkage and implementation (SLINK/CLINK O(n²); heap-based generic O(n² log n); naive O(n³)) — doesn't scale past ~10⁴ points.
DBSCAN	Ester et al. 1996. A point is core if at least min_pts points lie within ε. Clusters are maximal connected sets of core points (plus boundary). Non-core / non-boundary points are noise.	Arbitrary cluster shapes (moons, rings). Doesn't require K. Native outlier handling. Brittle (ε, min_pts) tuning; struggles with very different cluster densities (HDBSCAN extends).

Clustering trade-off table

	K-means	GMM	Hierarchical	DBSCAN
Specify K?	Yes	Yes	No (cut dendrogram)	No (density-driven)
Non-spherical clusters?	No	Elliptical only	Depends on linkage	Arbitrary shapes
Unequal cluster sizes?	Poorly	OK (π_k absorbs it)	OK	OK
Soft assignment?	No	Yes	No	No (plus a noise label)
Scales to 10⁶+ points?	Yes (with mini-batch)	Yes (slower than k-means)	No	Yes (with spatial index)
Outlier handling?	None (every point in a cluster)	Implicit (low-likelihood points)	None	Explicit noise label
Main knobs	K, init	K, covariance type	Linkage, cut height	ε, min_pts

Part B — Dimensionality reduction

PCA from first principles

Centre the data: X̃ = X − x̄. PCA seeks an orthonormal basis {v_1, …, v_d} such that projecting onto v_1 maximises variance, v_2 maximises variance subject to being orthogonal to v_1, and so on.

Variance of the projection onto v is v^T Σ v with Σ = X̃ᵀX̃ / (n − 1). Maximising v^T Σ v s.t. ‖v‖ = 1 via Lagrangian v^T Σ v − λ(v^T v − 1) yields Σ v = λ v. The principal components are eigenvectors of the covariance matrix; the eigenvalues are the projected variances. Computationally use the SVD X̃ = U S Vᵀ — columns of V are the PCs, s_i² / (n−1) the eigenvalues. SVD is numerically more stable than forming X̃ᵀX̃ explicitly.

x_2 v_2 ↑ ▲ ╲ │ · · · ╲ · · · · │ · · · · · ╲· · · · · · │ · · · · · · → · · · · · · → v_1 (max variance) │ · · · · · · · · · · ·╱ │ · · · · · · · · ╱ │ ╱ └──────────► x_1 ╱ raw axes (correlated) rotated axes (decorrelated)

What PCA does and doesn't do

PCA is good at	PCA breaks on
Linearly-correlated features. Compressing redundancy. Denoising (drop low-variance components). Visualisation when data is roughly linear.	Non-linear manifolds — a circle in 2D collapses to a line in PC1. Categorical or sparse data. Features with very different scales (standardise first).

One trap: "PC1 explains 90% of variance, so drop the rest." Variance is not task-relevant information. The discriminative direction for your downstream classifier may lie in a tiny-variance component. PCA is feature compression, not feature selection.

t-SNE and UMAP — what they are, what they are not

t-SNE (van der Maaten & Hinton 2008) defines pairwise similarities in high-D via a Gaussian kernel and in low-D via a heavy-tailed Student-t kernel, then minimises the KL divergence between them. The heavy tail in low-D lets distant points repel without exploding gradients — preserving local structure at the price of global structure.

UMAP (McInnes et al. 2018) uses a different theoretical framing (fuzzy simplicial sets from Riemannian geometry) but the same operational shape: build a high-D neighbour graph, find a low-D layout that preserves it. Faster, scales better, preserves more global structure — though "more" is not "all".

The senior-vs-junior moment

Both t-SNE and UMAP are visualisation tools. In a t-SNE or UMAP plot:

Cluster identification is meaningful (points grouped together were neighbours in high-D).
Distances between clusters are not meaningful — two clusters far apart on a t-SNE plot may have been adjacent in the original space.
Cluster sizes are not meaningful — perplexity controls effective neighbourhood, not visual diameter.
Density within a cluster is not meaningful — t-SNE equalises density.

"The t-SNE plot shows class A is far from class B" is a junior mistake. The plot shows they form distinguishable groups; the layout-distance is an optimisation artefact, not a property of the data.

Dimensionality-reduction trade-off table

	PCA	t-SNE	UMAP
Linear or non-linear?	Linear	Non-linear	Non-linear
Preserves global structure?	Yes for linearly-correlated structure; no for non-linear manifolds (e.g. a Swiss roll collapses)	No	Partially
Preserves local structure?	Only if linear	Yes (its objective)	Yes
Distances meaningful?	Yes (in PC space)	No	No (better than t-SNE but still no)
Deterministic?	Yes	No (stochastic init)	No (stochastic init)
Scales to 10⁶ points?	Yes (randomised SVD)	Tight (Barnes-Hut/FIt-SNE help)	Yes
Use as preprocessing?	Often	Almost never	Sometimes
Use for visualisation?	If linear works	Common	Modern default
Interpretable axes?	Yes (loadings)	No	No

Interactive · clustering playground

Pick a dataset and algorithm. The point is to feel where k-means breaks (moons, anisotropic) and where DBSCAN earns its keep. Silhouette and inertia are reported so you can sanity-check metrics on configurations that visibly fail.

Algorithm × dataset

Try k-means on blobs (works), k-means on moons (fails — sliced down the middle), DBSCAN on moons (works), GMM on anisotropic (handles the elongation k-means can't).

dataset: algorithm: K (k-means / GMM): 3 ε (DBSCAN): 0.12 min_pts (DBSCAN): 5

clusters found

—

silhouette

—

inertia

—

noise points

—

Reading

—

Interview prompts you should be ready for

"Show me k-means is guaranteed to converge." (Each step minimises the objective in one coordinate with the other fixed, so J is monotone non-increasing. There are finitely many assignments, so the sequence of objectives is bounded below and eventually stable. Converges to a local minimum — not the global one.)
"Why do you standardise features before k-means?" (Squared Euclidean distance is not scale-invariant. A feature in dollars (range 10⁶) dwarfs a feature in years (range 10²), so the clustering reduces to "group by the dollar feature". Z-score puts all features on a comparable scale; standardisation is the default unless you have a domain reason to weight features differently.)
"What's the relationship between k-means and GMM?" (K-means is the hard-assignment, isotropic-equal-variance limit of GMM-with-EM. The E-step in GMM gives soft responsibilities; replacing them with arg-max assignments and the covariance with σ²I gives Lloyd's algorithm exactly.)
"You ran t-SNE on your embeddings and got beautiful clusters. Can you trust the distances between them?" (No. t-SNE optimises a divergence between local neighbour distributions; distant points in the plot have essentially unconstrained positions. Cluster identity is meaningful, inter-cluster distance is not. Same for cluster sizes and within-cluster density.)
"Your PCA's first component explains 90% of variance — should you drop everything else?" (Not automatically. Variance is not task-relevance: the discriminative axis for your downstream model may live in a low-variance component. Check downstream performance with and without the trailing components before discarding.)
"When would you pick UMAP over t-SNE? Over PCA?" (Over t-SNE: speed at scale, better global-structure preservation, deterministic-ish given a seed. Over PCA: when the manifold is non-linear and a linear projection collapses meaningful structure. PCA when you need interpretable axes, when you'll feed the embedding into a downstream model, or when you specifically need distances to remain meaningful.)

Takeaway

Clustering is a modelling choice (which algorithm, which K) constrained by the geometry of your data and the use of the output. K-means is the default; reach for GMM when ellipsoidal, DBSCAN when non-convex or noisy, hierarchical when you want a dendrogram. Dimensionality reduction splits cleanly: PCA when you need linear, interpretable, reusable axes; t-SNE/UMAP when you need to look at the data. The senior signal is knowing what each method's plot is and isn't telling you.