VQ-VAE, VQGAN, and FSQ
The tokenizer is the single most important component in any discrete generative pipeline. Three generations of designs, and what changed at each step.
The job
Given a raw image x ∈ ℝ3×H×W, produce a small discrete spatial grid of token indices, plus a decoder that reverses the map. Formally we want an encoder E: ℝ3×H×W → {1, …, K}h×w and a decoder D: {1, …, K}h×w → ℝ3×H×W such that D(E(x)) ≈ x.
Typical dimensions for a modern tokenizer on 256×256 images: h = w = 16 (so 256 tokens per image), vocabulary K ∈ [4k, 64k], embedding dim per code d ∈ [4, 32]. Compression ratio: 3 · 256 · 256 / 256 / log2(16k) ≈ 56×.
VQ-VAE (van den Oord et al. 2017) — the original
Three components:
- Encoder Eθ: x → ze ∈ ℝd×h×w — a conv net producing a continuous feature map.
- Codebook {ek}k=1..K — a learned embedding table.
- Quantizer: for each spatial position, pick the codebook entry nearest the encoder feature: zq[i, j] = ek* where k* = argmink ‖ze[i, j] − ek‖².
The decoder Dθ consumes the quantized zq and outputs pixels.
The straight-through gradient — the trick that makes it train
The quantizer (argmin → discrete index) has gradient zero almost everywhere. You can’t backpropagate through it. The straight-through estimator fixes this by pretending the quantizer was the identity for backward pass:
In the forward pass zq = ek* (we use the codebook entry). In the backward pass ∂zq/∂ze = I (the gradient flows through unchanged). This is biased but works astonishingly well.
Three loss terms:
- Reconstruction trains the encoder + decoder.
- Codebook trains the codebook entries to match the encoder outputs they’re standing in for.
- Commitment trains the encoder to commit to chosen codebook entries (otherwise it can drift far from any codebook entry, making the quantizer error arbitrarily large). β = 0.25 is the canonical value.
Why the stop-gradient placement matters
The codebook loss ‖sg(ze) − ek*‖² updates only the codebook, not the encoder. The commitment loss ‖ze − sg(ek*)‖² updates only the encoder, not the codebook.
If you drop the stop-gradients you don’t lose magnitude — the two losses are quadratic in the same difference, so they add to (1 + β)‖ze − ek*‖². What you lose is the asymmetric routing: the encoder and codebook can no longer pursue their own objectives at separate rates. Both endpoints get pulled toward each other at the joint rate set by (1 + β), and you can’t use β to make the encoder commit faster than the codebook moves (or vice versa). The stop-grads aren’t there to keep the loss alive — they’re there to keep the two halves of training tunable independently.
The killer failure mode: codebook collapse
Train naively and what you get is “codebook collapse”: a small fraction of the K entries get used; most are dead (never the nearest neighbor to any encoder output). At convergence you might be using ~10% of a 16k codebook, effectively a 1.6k vocabulary. Two reasons:
- Initialization gap. If a codebook entry is far from any encoder output, it’s never selected, gets no gradient, stays where it is, never gets selected, etc. A self-reinforcing dead zone.
- EMA vs. SGD mismatch. Codebook entries should track the moving mean of the encoder outputs that selected them. SGD on the codebook loss does this badly; EMA (exponential moving average) does it much better.
Standard fixes:
| Fix | What it does | Used in |
|---|---|---|
| EMA codebook | update ek as moving mean of assigned ze | VQ-VAE2, VQGAN, almost everything since |
| Restart dead codes | periodically replace unused entries with high-variance encoder outputs | SoundStream, EnCodec, common audio recipes |
| Linear projection to/from codebook space | quantize in low dim (8 or even 4), decode to full width | improved-VQGAN, MaskGIT, MAGVIT |
| L2-normalize codes and inputs | quantize on the sphere; same dead-code issue, but easier to reset | improved-VQGAN, ViT-VQGAN |
| Add KL term to encourage uniform usage | regularize the empirical distribution of selected codes | some text-codec work |
VQGAN (Esser et al. 2020) — sharper reconstructions
VQ-VAE with L2 reconstruction produces blurry samples (same blur as a vanilla autoencoder). VQGAN swaps in perceptual + adversarial losses:
- Perceptual loss: L2 in the feature space of a frozen VGG/CLIP rather than in pixel space. Encourages high-frequency texture.
- Adversarial loss: a small patch-discriminator trained against the decoder, weighted by an adaptive λ that balances against the perceptual loss.
Result: at the same bottleneck (e.g. 16×16 tokens for a 256×256 image), VQGAN reconstructions are visually crisp where VQ-VAE’s were soft. This is what made the “tokenize images then model with a transformer” pipeline practical.
FSQ (Mentzer et al. 2023) — the simplification
Finite Scalar Quantization throws out the learned codebook entirely. Instead:
- Force the encoder’s output to be a low-dimensional vector ze ∈ ℝd, with d tiny (e.g. 5–8).
- Apply tanh to each entry to bound it in [−1, 1].
- Quantize each entry independently to one of L levels (e.g. L ∈ {5, 7, 8} per dim).
The “codebook” is then implicit: the product grid of all per-dim levels, of size Ld. With d = 6, L = 8 that’s 262k codes — far more than the largest learned codebook anyone trains, and with no codebook collapse possible, because the grid is uniform by construction.
With straight-through gradients through the round. That’s the entire quantizer. No EMA, no restart, no codebook loss. Empirically matches VQ-VAE reconstruction quality and often beats it for downstream modeling.
| Method | Codebook | Collapse risk | Effective vocab | Auxiliary losses |
|---|---|---|---|---|
| VQ-VAE | learned table, K entries | high (use ~10% of K) | K (e.g. 16k) | commit + codebook |
| VQ-VAE + EMA | learned table, EMA updates | medium | K (e.g. 16k, ~50% used) | commit only |
| VQGAN | EMA learned table | medium | K, with perceptual+GAN losses on top | perceptual + GAN |
| FSQ | none (implicit grid) | impossible by construction | Ld (e.g. 262k) | none |
Interactive · 2D codebook trainer
Below: a tiny 2D “data” cloud (the two moons again), a learned codebook of K 2D points, and a live training loop that uses the VQ-VAE update rule (EMA or SGD, your choice). Watch the codebook entries crawl onto the data manifold. Try a small K and observe collapse; try the FSQ option to see what the uniform-grid alternative looks like.
What goes wrong without the codebook
You could just stop here — train a continuous autoencoder, no quantizer. Why not?
- You can’t use a transformer. Transformers need discrete tokens. With continuous latents, you have to train a continuous generative model in latent space (lesson 15 covers this hybrid — SD3 / Imagen 3 do exactly this). It works, but you’ve given up the unified-token-with-text move.
- You can’t edit at the token level. A discrete vocabulary lets you swap one token for another (“replace token 472 with token 891”), which is what enables MaskGIT, inpainting, and editing-style operations.
- You can’t interleave with language tokens. Same reason — an LLM has one big embedding table; image tokens just get more rows in it. With continuous latents you need a separate adapter to glue text to image.
Practical recipes (2026)
| Use case | Tokenizer | Why |
|---|---|---|
| Image-only generation, max quality | VQGAN + EMA, K = 16k | standard; the “Parti tokenizer” recipe |
| Native multimodal (text + image) | FSQ with d = 6–8, L = 5–8 | no collapse, big vocab, plays nicely with LM tokenizer |
| Video / long sequences | MAGVIT-v2 (causal 3D VQGAN with LFQ — lookup-free quantization, a close cousin of FSQ that uses {−1, +1} levels per dim with implicit codebook of 2d) | temporal causality preserved, LFQ’s {−1, +1} binarization scales to very large implicit vocabularies (≥ 218) |
| Audio | RVQ (residual VQ, ~8 layers) on EnCodec / SoundStream | captures multi-scale structure via residual chain |
| Speech (small vocab) | EnCodec or DAC, RVQ with restart | aggressive bitrate compression; restart prevents quiet-codebook collapse |