generative_continuous / 11 · VQ tokenizers lesson 11 / 15

VQ-VAE, VQGAN, and FSQ

The tokenizer is the single most important component in any discrete generative pipeline. Three generations of designs, and what changed at each step.

The job

Given a raw image x ∈ ℝ3×H×W, produce a small discrete spatial grid of token indices, plus a decoder that reverses the map. Formally we want an encoder E: ℝ3×H×W → {1, …, K}h×w and a decoder D: {1, …, K}h×w → ℝ3×H×W such that D(E(x)) ≈ x.

Typical dimensions for a modern tokenizer on 256×256 images: h = w = 16 (so 256 tokens per image), vocabulary K ∈ [4k, 64k], embedding dim per code d ∈ [4, 32]. Compression ratio: 3 · 256 · 256 / 256 / log2(16k) ≈ 56×.

VQ-VAE (van den Oord et al. 2017) — the original

Three components:

  1. Encoder Eθ: x → ze ∈ ℝd×h×w — a conv net producing a continuous feature map.
  2. Codebook {ek}k=1..K — a learned embedding table.
  3. Quantizer: for each spatial position, pick the codebook entry nearest the encoder feature: zq[i, j] = ek* where k* = argmink ‖ze[i, j] − ek‖².

The decoder Dθ consumes the quantized zq and outputs pixels.

The straight-through gradient — the trick that makes it train

The quantizer (argmin → discrete index) has gradient zero almost everywhere. You can’t backpropagate through it. The straight-through estimator fixes this by pretending the quantizer was the identity for backward pass:

zq = ze + stop_grad( ek* − ze )

In the forward pass zq = ek* (we use the codebook entry). In the backward pass ∂zq/∂ze = I (the gradient flows through unchanged). This is biased but works astonishingly well.

Three loss terms:

L = ‖x − D(zq)‖² + ‖sg(ze) − ek*‖² + β · ‖ze − sg(ek*)‖²
Why the stop-gradient placement matters

The codebook loss ‖sg(ze) − ek*‖² updates only the codebook, not the encoder. The commitment loss ‖ze − sg(ek*)‖² updates only the encoder, not the codebook.

If you drop the stop-gradients you don’t lose magnitude — the two losses are quadratic in the same difference, so they add to (1 + β)‖ze − ek*‖². What you lose is the asymmetric routing: the encoder and codebook can no longer pursue their own objectives at separate rates. Both endpoints get pulled toward each other at the joint rate set by (1 + β), and you can’t use β to make the encoder commit faster than the codebook moves (or vice versa). The stop-grads aren’t there to keep the loss alive — they’re there to keep the two halves of training tunable independently.

The killer failure mode: codebook collapse

Train naively and what you get is “codebook collapse”: a small fraction of the K entries get used; most are dead (never the nearest neighbor to any encoder output). At convergence you might be using ~10% of a 16k codebook, effectively a 1.6k vocabulary. Two reasons:

  1. Initialization gap. If a codebook entry is far from any encoder output, it’s never selected, gets no gradient, stays where it is, never gets selected, etc. A self-reinforcing dead zone.
  2. EMA vs. SGD mismatch. Codebook entries should track the moving mean of the encoder outputs that selected them. SGD on the codebook loss does this badly; EMA (exponential moving average) does it much better.

Standard fixes:

FixWhat it doesUsed in
EMA codebookupdate ek as moving mean of assigned zeVQ-VAE2, VQGAN, almost everything since
Restart dead codesperiodically replace unused entries with high-variance encoder outputsSoundStream, EnCodec, common audio recipes
Linear projection to/from codebook spacequantize in low dim (8 or even 4), decode to full widthimproved-VQGAN, MaskGIT, MAGVIT
L2-normalize codes and inputsquantize on the sphere; same dead-code issue, but easier to resetimproved-VQGAN, ViT-VQGAN
Add KL term to encourage uniform usageregularize the empirical distribution of selected codessome text-codec work

VQGAN (Esser et al. 2020) — sharper reconstructions

VQ-VAE with L2 reconstruction produces blurry samples (same blur as a vanilla autoencoder). VQGAN swaps in perceptual + adversarial losses:

LVQGAN = Lrecon-perceptual + β · Lcommit + λ · LGAN

Result: at the same bottleneck (e.g. 16×16 tokens for a 256×256 image), VQGAN reconstructions are visually crisp where VQ-VAE’s were soft. This is what made the “tokenize images then model with a transformer” pipeline practical.

FSQ (Mentzer et al. 2023) — the simplification

Finite Scalar Quantization throws out the learned codebook entirely. Instead:

  1. Force the encoder’s output to be a low-dimensional vector ze ∈ ℝd, with d tiny (e.g. 5–8).
  2. Apply tanh to each entry to bound it in [−1, 1].
  3. Quantize each entry independently to one of L levels (e.g. L ∈ {5, 7, 8} per dim).

The “codebook” is then implicit: the product grid of all per-dim levels, of size Ld. With d = 6, L = 8 that’s 262k codes — far more than the largest learned codebook anyone trains, and with no codebook collapse possible, because the grid is uniform by construction.

zq[i] = round( ze[i] · (L−1)/2 ) · 2/(L−1)

With straight-through gradients through the round. That’s the entire quantizer. No EMA, no restart, no codebook loss. Empirically matches VQ-VAE reconstruction quality and often beats it for downstream modeling.

MethodCodebookCollapse riskEffective vocabAuxiliary losses
VQ-VAElearned table, K entrieshigh (use ~10% of K)K (e.g. 16k)commit + codebook
VQ-VAE + EMAlearned table, EMA updatesmediumK (e.g. 16k, ~50% used)commit only
VQGANEMA learned tablemediumK, with perceptual+GAN losses on topperceptual + GAN
FSQnone (implicit grid)impossible by constructionLd (e.g. 262k)none

Interactive · 2D codebook trainer

Below: a tiny 2D “data” cloud (the two moons again), a learned codebook of K 2D points, and a live training loop that uses the VQ-VAE update rule (EMA or SGD, your choice). Watch the codebook entries crawl onto the data manifold. Try a small K and observe collapse; try the FSQ option to see what the uniform-grid alternative looks like.

2D codebook training — VQ vs. FSQ
Green = data points. Orange = codebook entries. Lines show which entry each data point gets assigned to. Hit train and watch entries migrate.
step
0
active codes (used >0)
recon MSE
effective vocab

Try SGD with K=4 then K=32 — you’ll see collapse: a few codes carry all the assignments, the rest sit unused. EMA fixes most of it. FSQ trivially uses L2 codes whether they’re “needed” or not.

What goes wrong without the codebook

You could just stop here — train a continuous autoencoder, no quantizer. Why not?

Practical recipes (2026)

Use caseTokenizerWhy
Image-only generation, max qualityVQGAN + EMA, K = 16kstandard; the “Parti tokenizer” recipe
Native multimodal (text + image)FSQ with d = 6–8, L = 5–8no collapse, big vocab, plays nicely with LM tokenizer
Video / long sequencesMAGVIT-v2 (causal 3D VQGAN with LFQ — lookup-free quantization, a close cousin of FSQ that uses {−1, +1} levels per dim with implicit codebook of 2d)temporal causality preserved, LFQ’s {−1, +1} binarization scales to very large implicit vocabularies (≥ 218)
AudioRVQ (residual VQ, ~8 layers) on EnCodec / SoundStreamcaptures multi-scale structure via residual chain
Speech (small vocab)EnCodec or DAC, RVQ with restartaggressive bitrate compression; restart prevents quiet-codebook collapse
Codebook size is not the whole story
A larger vocabulary doesn’t automatically buy you more capacity — if your tokenizer is information-bottlenecked elsewhere (e.g. spatial dim is too small), extra codes just give you more ways to spell the same thing. Reconstruction quality at fixed bottleneck dim — not vocab size — is what matters.
Punchline
A VQ tokenizer is a learned dictionary that turns images into short discrete sequences. The straight-through gradient lets you train through the discrete bottleneck; EMA on the codebook prevents collapse; FSQ replaces the whole codebook with a fixed grid and removes the collapse failure mode entirely. The downstream generative model then operates on these tokens. That’s the entire stage-1 of every modern text-to-image transformer.