Embeddings and approximate nearest neighbour
The two-tower model gives you vectors. This lesson is about the geometry those vectors live in, and the index that turns a vector lookup into a sub-millisecond operation at billion scale.
Similarity functions, from first principles
Retrieval reduces "find the most relevant items" to "find vectors closest to the query." But closest isn't one thing. The three choices interviewers expect you to reason about:
| Similarity | Definition | Property |
|---|---|---|
| Dot product | ⟨u, v⟩ = Σᵢ uᵢ vᵢ | Sensitive to norm. ‖v‖ raises the score of v against every query. Not a metric — no triangle inequality, unbounded. |
| Cosine | ⟨u, v⟩ / (‖u‖ · ‖v‖) | Normalises away norm. Only the angle matters. Bounded in [−1, 1]. Still not a true metric (cos similarity is a similarity, not a distance), but its complement 1 − cos is well-behaved. |
| L2 distance | ‖u − v‖₂ | A true metric (triangle inequality holds). On the unit sphere, ‖u − v‖² = 2 − 2⟨u, v⟩, so nearest-by-L2 and nearest-by-cosine are identical orderings. |
The interview question is "when do you pick which?" The answer flows from one observation: norm carries information. A two-tower model trained with unconstrained dot product can make a broadly popular item's vector longer — a free knob to say "score me higher against any query." That can be exactly what you want (popularity as a useful prior) or exactly what you don't (you want the geometry to encode specific affinity, not popularity). Use dot product when norm-as-confidence is desired; cosine when items should sit on equal geometric footing and popularity is handled separately; L2 when you've already normalised and want a true metric for indices that assume one.
MIPS is not a metric problem
Maximum Inner Product Search — "find argmax_v ⟨q, v⟩" — looks like nearest-neighbour search but isn't. Dot product violates the triangle inequality and isn't bounded; the textbook nearest-neighbour data structures (kd-tree, ball-tree, cover-tree) rely on metric pruning and don't apply directly.
Two practical fixes. Norm augmentation (Bachrach et al. 2014, the "extra coordinate" trick): append a coordinate so all augmented item vectors share the same norm M ≥ max‖v‖ — replace v with [v ; √(M² − ‖v‖²)] and q with [q ; 0]. Augmented vectors live on a sphere of radius M; MIPS over the originals becomes nearest-neighbour over the augmented set. Or train to the sphere: add an L2 normalisation layer at the top of both towers and use cosine end-to-end. Most modern systems pick the second route — the loss of norm-as-popularity is reclaimed with an explicit popularity feature in the ranker.
ANN index families — the three you must know
Linear scan costs O(Nd) per query. At N = 10⁹, d = 128, that's ~10¹¹ multiply-adds per query — minutes, not milliseconds. You need an index whose query cost is sub-linear in N.
| Family | How it prunes | Knobs | Where it lives in practice |
|---|---|---|---|
| Tree-based (kd-tree, annoy) |
Recursively partition space; at query time, descend the tree, backtrack into siblings whose bounding region could contain a closer point. | tree depth, num_trees | Works only in low d (≲ 20). The curse of dimensionality makes pruning ineffective: in high d, almost every sibling subtree could contain a closer point, so you visit nearly everything. |
| Inverted-file (IVF) | Cluster the index with k-means into nlist cells. Each item lives in its nearest cell. At query time, find the nprobe nearest cell centroids and scan only those. | nlist (build), nprobe (query) | Bread and butter at billion scale, usually as IVF-PQ (next section). FAISS's standard billion-scale recipe is IVF-PQ. |
| HNSW | Layered proximity graph. Top layer is sparse (long-range edges); descend layer by layer with greedy best-first search. Each node has up to M neighbours on upper layers, 2M on layer 0. | M, ef_construction (build), ef_search (query) | State of the art recall@K vs latency in the regime where the graph fits in RAM. Used by Pinecone, Weaviate, FAISS-HNSW, pgvector's hnsw. |
LSH (locality-sensitive hashing) is the historical alternative — random-projection families that make close vectors collide. First sub-linear theory results came from here. In practice HNSW and IVF dominate the recall/latency Pareto. Worth knowing, rarely worth deploying.
HNSW's greedy descent in one paragraph
Each item gets a random max level; higher levels are exponentially rarer. Within a level, connect to M nearest existing nodes. At query time, start at the entry point on the top (sparsest) layer, greedily walk toward the query, drop a level when no neighbour is closer. On the bottom layer, run a beam search of width ef_search and return the top K. Larger M ⇒ denser graph, better recall, more RAM. Larger ef_search ⇒ wider beam, better recall, more compute per query.
Product Quantization — an orthogonal axis
The above indices reduce the number of vectors scanned. Product quantization (Jégou et al. 2011) reduces the cost per vector. Split each d-dim vector into m sub-vectors, quantize each sub-vector to one of k centroids (typically k = 256, one byte). Raw size: d × 4 B fp32 = 512 B for d = 128. PQ size: m bytes. RAM reduction: 8–32×. Scoring uses an asymmetric distance table — precompute m × k sub-distances between query and centroids once per query, each item's distance is m table lookups summed.
The cost is quantization error: items sharing centroids in every sub-quantizer become indistinguishable. PQ is usually combined with residual re-ranking (top candidates by PQ score, re-score with raw vectors) and almost always with IVF — IVF-PQ is the standard billion-scale recipe (FAISS's IndexIVFPQ). HNSW-PQ exists but is more fragile: HNSW's greedy descent depends on accurate distance comparisons, and quantization noise can mis-route the walk. "PQ is for IVF" is the rule-of-thumb in practice.
The three-axis trade-off
The single frame that holds the rest of the lesson together: recall, latency, RAM. Pick two.
- Want max recall + low latency? HNSW with high M, ef_search — pay in RAM.
- Want max recall + low RAM? IVF-PQ with high nprobe and re-ranking — pay in latency.
- Want low latency + low RAM? IVF-PQ with low nprobe — pay in recall.
| HNSW | IVF (flat) | IVF-PQ | |
|---|---|---|---|
| Recall@10 @ matched latency | highest | middle | lower (PQ error) |
| RAM cost | ~1.3–2× raw vectors (graph overhead) | ~1× raw vectors | ~0.05–0.2× raw vectors |
| Build time | slow (graph construction) | fast (k-means) | fast + PQ training pass |
| Streaming inserts | supported but expensive; historically deletes were awkward (orphaned edges → tombstones + periodic rebuild), though hnswlib's markDelete-with-replacement and Lucene HNSW now handle deletes with reasonable efficiency — rebuilds remain the standard at scale | straightforward — assign to cell, append | straightforward inserts; PQ codebook drifts over time, periodic retrain |
| Knobs to tune | 3 (M, ef_construction, ef_search) | 2 (nlist, nprobe) | 4+ (nlist, nprobe, m, k, residual-rerank K) |
| Sweet spot | 10⁶–10⁸ vectors, RAM-rich, latency-critical | medium scale baseline | 10⁸–10¹⁰ vectors, RAM-constrained |
Interactive · ANN trade-off explorer
Slide the index size and target recall; switch algorithms. The outputs are rough scaling approximations, not measurements — they exist to make the three-axis trade-off feel concrete. Real numbers depend on hardware, dimension, batch size, and a dozen other things; the curves below get the shape right and the order of magnitude approximately right.
Sharding and updates
At a billion vectors, no single machine holds the index. The standard pattern is:
- Shard by hash or partition key. Per-shard ANN index. A coordinator scatters the query, each shard returns local top-K, the coordinator gathers and re-ranks.
- Replicate each shard for QPS and HA. ANN indices are read-heavy, so replication is cheap.
- Append-only with tombstones. Inserts are easy. Deletes are hard — HNSW in particular can't cleanly remove a node without orphaning neighbours' edges. Production answer: tombstone the doc, filter at query time, rebuild the shard nightly or weekly.
- Two-tier freshness. Small online index (recent items, minutes-old) plus large offline index (rebuilt nightly). Queries hit both; results merge. Avoids rebuilding a billion vectors every upload.
When you should NOT use ANN
ANN is a good answer when (a) you have many items, (b) you need sub-linear search, and (c) approximate recall is acceptable. Each of those can fail:
- Small catalogues. Under ~10⁵ items, a GPU/SIMD linear scan beats ANN on latency and gives exact recall.
- Heavy filtering. If queries filter to a small eligible set (ads matching this query within budget), post-filter recall can be terrible — ANN returned 1000 candidates but only 3 pass the filter. Build indices over filtered partitions, or use pre-filtered ANN (FAISS's
id_filter, ScaNN's masked search), accepting some recall hit. - Exact-match required. ANN is approximate by design. Use exact search where correctness is mandatory.
- Wrong-shape similarity. If similarity depends on cross features (user × item, attention over context), no static index helps — that's ranking's job, not retrieval's.
Interview prompts you should be ready for
- "Your two-tower model is trained with dot product but you deploy with cosine. What's the silent bug?" (Probes the train/serve geometry mismatch. The norm-as-popularity signal vanishes at serve time; offline recall drops because the items the trainer favoured by norm now tie on angle.)
- "Walk me through HNSW's search procedure end-to-end." (Layered graph, greedy descent from entry point, beam search of width ef_search on the bottom layer. Probes whether you can describe an algorithm clearly under pressure.)
- "You have a 1B-item index that needs sub-50 ms p99. RAM is constrained to 500 GB. What index do you build?" (Forces the three-axis frame. 1B × ~512 B = 512 GB raw — HNSW's graph overhead doesn't fit. IVF-PQ is the realistic answer; explain compression ratio and the recall hit you accept.)
- "Recall@10 dropped from 92% to 78% after adding 100M new items. Diagnose." (Probes: was the index rebuilt or streamed-inserted? Did the cluster centroids drift? PQ codebook stale? Catalogue distribution shifted away from training? Should have rebalanced/rebuilt.)
- "Why is product quantization compatible with IVF but more fragile with HNSW?" (Trick — it's compatible with both. The honest answer: HNSW's greedy descent assumes distances are accurate enough to navigate; quantization noise breaks navigation more than it breaks IVF's cell selection.)
- "When would you NOT use ANN?" (Small catalogue, heavy filter, exact-match required, or the similarity isn't expressible as a static vector dot product. Probes whether you reach for ANN reflexively or by argument.)