Embeddings and approximate nearest neighbour

The two-tower model gives you vectors. This lesson is about the geometry those vectors live in, and the index that turns a vector lookup into a sub-millisecond operation at billion scale.

Similarity functions, from first principles

Retrieval reduces "find the most relevant items" to "find vectors closest to the query." But closest isn't one thing. The three choices interviewers expect you to reason about:

Similarity	Definition	Property
Dot product	⟨u, v⟩ = Σᵢ uᵢ vᵢ	Sensitive to norm. ‖v‖ raises the score of v against every query. Not a metric — no triangle inequality, unbounded.
Cosine	⟨u, v⟩ / (‖u‖ · ‖v‖)	Normalises away norm. Only the angle matters. Bounded in [−1, 1]. Still not a true metric (cos similarity is a similarity, not a distance), but its complement 1 − cos is well-behaved.
L2 distance	‖u − v‖₂	A true metric (triangle inequality holds). On the unit sphere, ‖u − v‖² = 2 − 2⟨u, v⟩, so nearest-by-L2 and nearest-by-cosine are identical orderings.

The interview question is "when do you pick which?" The answer flows from one observation: norm carries information. A two-tower model trained with unconstrained dot product can make a broadly popular item's vector longer — a free knob to say "score me higher against any query." That can be exactly what you want (popularity as a useful prior) or exactly what you don't (you want the geometry to encode specific affinity, not popularity). Use dot product when norm-as-confidence is desired; cosine when items should sit on equal geometric footing and popularity is handled separately; L2 when you've already normalised and want a true metric for indices that assume one.

The train/serve geometry trap

If your two-tower model is trained with dot product (no normalisation in the loss) and you deploy with cosine (normalising at query time), you have silently changed the objective. Items the trainer learned to favour via long norm collapse to angle-only scoring, and recall against the offline metric drops. This is one of the most common production bugs in retrieval. Either normalise in training, or serve with the same similarity you trained on.

MIPS is not a metric problem

Maximum Inner Product Search — "find argmax_v ⟨q, v⟩" — looks like nearest-neighbour search but isn't. Dot product violates the triangle inequality and isn't bounded; the textbook nearest-neighbour data structures (kd-tree, ball-tree, cover-tree) rely on metric pruning and don't apply directly.

Two practical fixes. Norm augmentation (Bachrach et al. 2014, the "extra coordinate" trick): append a coordinate so all augmented item vectors share the same norm M ≥ max‖v‖ — replace v with [v ; √(M² − ‖v‖²)] and q with [q ; 0]. Augmented vectors live on a sphere of radius M; MIPS over the originals becomes nearest-neighbour over the augmented set. Or train to the sphere: add an L2 normalisation layer at the top of both towers and use cosine end-to-end. Most modern systems pick the second route — the loss of norm-as-popularity is reclaimed with an explicit popularity feature in the ranker.

ANN index families — the three you must know

Linear scan costs O(Nd) per query. At N = 10⁹, d = 128, that's ~10¹¹ multiply-adds per query — minutes, not milliseconds. You need an index whose query cost is sub-linear in N.

Family	How it prunes	Knobs	Where it lives in practice
Tree-based (kd-tree, annoy)	Recursively partition space; at query time, descend the tree, backtrack into siblings whose bounding region could contain a closer point.	tree depth, num_trees	Works only in low d (≲ 20). The curse of dimensionality makes pruning ineffective: in high d, almost every sibling subtree could contain a closer point, so you visit nearly everything.
Inverted-file (IVF)	Cluster the index with k-means into nlist cells. Each item lives in its nearest cell. At query time, find the nprobe nearest cell centroids and scan only those.	nlist (build), nprobe (query)	Bread and butter at billion scale, usually as IVF-PQ (next section). FAISS's standard billion-scale recipe is IVF-PQ.
HNSW	Layered proximity graph. Top layer is sparse (long-range edges); descend layer by layer with greedy best-first search. Each node has up to M neighbours on upper layers, 2M on layer 0.	M, ef_construction (build), ef_search (query)	State of the art recall@K vs latency in the regime where the graph fits in RAM. Used by Pinecone, Weaviate, FAISS-HNSW, pgvector's hnsw.

LSH (locality-sensitive hashing) is the historical alternative — random-projection families that make close vectors collide. First sub-linear theory results came from here. In practice HNSW and IVF dominate the recall/latency Pareto. Worth knowing, rarely worth deploying.

HNSW's greedy descent in one paragraph

Each item gets a random max level; higher levels are exponentially rarer. Within a level, connect to M nearest existing nodes. At query time, start at the entry point on the top (sparsest) layer, greedily walk toward the query, drop a level when no neighbour is closer. On the bottom layer, run a beam search of width ef_search and return the top K. Larger M ⇒ denser graph, better recall, more RAM. Larger ef_search ⇒ wider beam, better recall, more compute per query.

Product Quantization — an orthogonal axis

The above indices reduce the number of vectors scanned. Product quantization (Jégou et al. 2011) reduces the cost per vector. Split each d-dim vector into m sub-vectors, quantize each sub-vector to one of k centroids (typically k = 256, one byte). Raw size: d × 4 B fp32 = 512 B for d = 128. PQ size: m bytes. RAM reduction: 8–32×. Scoring uses an asymmetric distance table — precompute m × k sub-distances between query and centroids once per query, each item's distance is m table lookups summed.

The cost is quantization error: items sharing centroids in every sub-quantizer become indistinguishable. PQ is usually combined with residual re-ranking (top candidates by PQ score, re-score with raw vectors) and almost always with IVF — IVF-PQ is the standard billion-scale recipe (FAISS's IndexIVFPQ). HNSW-PQ exists but is more fragile: HNSW's greedy descent depends on accurate distance comparisons, and quantization noise can mis-route the walk. "PQ is for IVF" is the rule-of-thumb in practice.

The three-axis trade-off

The single frame that holds the rest of the lesson together: recall, latency, RAM. Pick two.

Want max recall + low latency? HNSW with high M, ef_search — pay in RAM.
Want max recall + low RAM? IVF-PQ with high nprobe and re-ranking — pay in latency.
Want low latency + low RAM? IVF-PQ with low nprobe — pay in recall.

	HNSW	IVF (flat)	IVF-PQ
Recall@10 @ matched latency	highest	middle	lower (PQ error)
RAM cost	~1.3–2× raw vectors (graph overhead)	~1× raw vectors	~0.05–0.2× raw vectors
Build time	slow (graph construction)	fast (k-means)	fast + PQ training pass
Streaming inserts	supported but expensive; historically deletes were awkward (orphaned edges → tombstones + periodic rebuild), though hnswlib's markDelete-with-replacement and Lucene HNSW now handle deletes with reasonable efficiency — rebuilds remain the standard at scale	straightforward — assign to cell, append	straightforward inserts; PQ codebook drifts over time, periodic retrain
Knobs to tune	3 (M, ef_construction, ef_search)	2 (nlist, nprobe)	4+ (nlist, nprobe, m, k, residual-rerank K)
Sweet spot	10⁶–10⁸ vectors, RAM-rich, latency-critical	medium scale baseline	10⁸–10¹⁰ vectors, RAM-constrained

Interactive · ANN trade-off explorer

Slide the index size and target recall; switch algorithms. The outputs are rough scaling approximations, not measurements — they exist to make the three-axis trade-off feel concrete. Real numbers depend on hardware, dimension, batch size, and a dozen other things; the curves below get the shape right and the order of magnitude approximately right.

Sharding and updates

At a billion vectors, no single machine holds the index. The standard pattern is:

Shard by hash or partition key. Per-shard ANN index. A coordinator scatters the query, each shard returns local top-K, the coordinator gathers and re-ranks.
Replicate each shard for QPS and HA. ANN indices are read-heavy, so replication is cheap.
Append-only with tombstones. Inserts are easy. Deletes are hard — HNSW in particular can't cleanly remove a node without orphaning neighbours' edges. Production answer: tombstone the doc, filter at query time, rebuild the shard nightly or weekly.
Two-tier freshness. Small online index (recent items, minutes-old) plus large offline index (rebuilt nightly). Queries hit both; results merge. Avoids rebuilding a billion vectors every upload.

Why nightly rebuilds, not streaming

HNSW's graph degrades as you insert into it without periodically rebalancing — early inserts get high-quality neighbour lists, later inserts may find suboptimal entry points. IVF's cluster assignments drift as the data distribution changes. PQ codebooks become miscalibrated as the residual distribution shifts. All three are addressed by periodic rebuilds; none is eliminated by streaming inserts. Senior interview signal: knowing that "we just stream everything in" is a junior answer.

When you should NOT use ANN

ANN is a good answer when (a) you have many items, (b) you need sub-linear search, and (c) approximate recall is acceptable. Each of those can fail:

Small catalogues. Under ~10⁵ items, a GPU/SIMD linear scan beats ANN on latency and gives exact recall.
Heavy filtering. If queries filter to a small eligible set (ads matching this query within budget), post-filter recall can be terrible — ANN returned 1000 candidates but only 3 pass the filter. Build indices over filtered partitions, or use pre-filtered ANN (FAISS's id_filter, ScaNN's masked search), accepting some recall hit.
Exact-match required. ANN is approximate by design. Use exact search where correctness is mandatory.
Wrong-shape similarity. If similarity depends on cross features (user × item, attention over context), no static index helps — that's ranking's job, not retrieval's.

Interview prompts you should be ready for

"Your two-tower model is trained with dot product but you deploy with cosine. What's the silent bug?" (Probes the train/serve geometry mismatch. The norm-as-popularity signal vanishes at serve time; offline recall drops because the items the trainer favoured by norm now tie on angle.)
"Walk me through HNSW's search procedure end-to-end." (Layered graph, greedy descent from entry point, beam search of width ef_search on the bottom layer. Probes whether you can describe an algorithm clearly under pressure.)
"You have a 1B-item index that needs sub-50 ms p99. RAM is constrained to 500 GB. What index do you build?" (Forces the three-axis frame. 1B × ~512 B = 512 GB raw — HNSW's graph overhead doesn't fit. IVF-PQ is the realistic answer; explain compression ratio and the recall hit you accept.)
"Recall@10 dropped from 92% to 78% after adding 100M new items. Diagnose." (Probes: was the index rebuilt or streamed-inserted? Did the cluster centroids drift? PQ codebook stale? Catalogue distribution shifted away from training? Should have rebalanced/rebuilt.)
"Why is product quantization compatible with IVF but more fragile with HNSW?" (Trick — it's compatible with both. The honest answer: HNSW's greedy descent assumes distances are accurate enough to navigate; quantization noise breaks navigation more than it breaks IVF's cell selection.)
"When would you NOT use ANN?" (Small catalogue, heavy filter, exact-match required, or the similarity isn't expressible as a static vector dot product. Probes whether you reach for ANN reflexively or by argument.)

Takeaway

Embeddings choose a geometry; the ANN index navigates it. The geometry decision (dot vs cosine vs L2) must match between training and serving or you have a silent bug. The index decision (HNSW vs IVF vs IVF-PQ) is forced by where you sit on the recall/latency/RAM triangle, and at billion scale only one corner of that triangle is realistic at a time. Senior signal: when given a scale and a budget, you can name the index and defend the trade-off in two sentences.