VQ-VAE, VQGAN, and FSQ

The tokenizer is the single most important component in any discrete generative pipeline. Three generations of designs, and what changed at each step.

The job

Given a raw image x ∈ ℝ^3×H×W, produce a small discrete spatial grid of token indices, plus a decoder that reverses the map. Formally we want an encoder E: ℝ^3×H×W → {1, …, K}^h×w and a decoder D: {1, …, K}^h×w → ℝ^3×H×W such that D(E(x)) ≈ x.

Typical dimensions for a modern tokenizer on 256×256 images: h = w = 16 (so 256 tokens per image), vocabulary K ∈ [4k, 64k], embedding dim per code d ∈ [4, 32]. Compression ratio: 3 · 256 · 256 / 256 / log₂(16k) ≈ 56×.

VQ-VAE (van den Oord et al. 2017) — the original

Three components:

Encoder E_θ: x → z_e ∈ ℝ^d×h×w — a conv net producing a continuous feature map.
Codebook {e_k}_k=1..K — a learned embedding table.
Quantizer: for each spatial position, pick the codebook entry nearest the encoder feature: z_q[i, j] = e_k* where k* = argmin_k ‖z_e[i, j] − e_k‖².

The decoder D_θ consumes the quantized z_q and outputs pixels.

The straight-through gradient — the trick that makes it train

The quantizer (argmin → discrete index) has gradient zero almost everywhere. You can’t backpropagate through it. The straight-through estimator fixes this by pretending the quantizer was the identity for backward pass:

z_q = z_e + stop_grad( e_k* − z_e )

Read the formula as two faces of the same line. Forward, the stop_grad wrapper is invisible to the value, so the two z_e terms cancel and you are left with z_q = e_k* — the decoder sees the real, snapped codebook entry. Backward, the wrapper has zero derivative, so the only thing the gradient can flow through is the bare z_e out front: ∂z_q/∂z_e = I. So the snap is solid on the way in but transparent on the way out — the encoder gets told “the decoder’s complaint about z_q is your complaint about z_e,” as if no rounding had happened. This is biased — we are routing a gradient around a step it doesn’t actually pass through — but as long as e_k* sits close to z_e the lie is small, and it works astonishingly well.

Intuition · linear unpacking

Claim: the straight-through estimator is legitimate because the snap moves the vector only a little, so pretending it didn’t happen is a small, controllable lie.

The wall. The quantizer is argmin — pick the nearest code. Nudge the encoder a hair and the chosen index almost always stays the same, so the output doesn’t budge; cross a boundary and it jumps. Flat with occasional cliffs means the true gradient is zero almost everywhere. Backprop through it learns nothing.
The dodge. We need some signal to reach the encoder. So on the backward pass we pretend the snap was the identity map and copy the decoder’s gradient straight onto z_e, unchanged.
Why that’s allowed. The snap only ever moves z_e to its nearest code, so e_k* and z_e are close by construction. “Improve e_k*” and “improve z_e” therefore point in nearly the same direction — substituting one gradient for the other is a small error, not a wrong one.
The bookkeeping trick. Writing z_q = z_e + sg(e_k* − z_e) is just a way to get the autodiff engine to do this for free: the value equals e_k*, the derivative equals that of z_e. One line buys both behaviours.

Central point. A hard, non-differentiable lookup is made trainable by sending the gradient around it as if it were a wire — honest enough precisely because the codebook entry it skipped over was the closest one available.

Three loss terms:

L = ‖x − D(z_q)‖² + ‖sg(z_e) − e_k*‖² + β · ‖z_e − sg(e_k*)‖²

Reconstruction trains the encoder + decoder.
Codebook trains the codebook entries to match the encoder outputs they’re standing in for.
Commitment trains the encoder to commit to chosen codebook entries (otherwise it can drift far from any codebook entry, making the quantizer error arbitrarily large). β = 0.25 is the canonical value.

Intuition · linear unpacking

Claim: the codebook term and the commitment term are the same squared gap ‖z_e − e_k*‖² read twice — once pulling the code, once pulling the encoder — and the only job of sg is to say which side moves.

The snap leaves a gap. Quantization replaces z_e with the nearest code e_k*. The distance between them is exactly the error the decoder is forced to swallow, so training wants to shrink it.
Two ways to close a gap. You can move the code toward the encoder output, or move the encoder output toward the code. Both reduce the same ‖z_e − e_k*‖² — they just spend the adjustment on different parameters.
What sg picks. In the codebook term ‖sg(z_e) − e_k*‖², the encoder side is frozen, so only e_k* gets pulled — the code chases the encoder. In the commitment term ‖z_e − sg(e_k*)‖², the code is frozen, so only z_e gets pulled — the encoder stops wandering off and stays near a code it can actually be rounded to.
What β is for. Weighting only the commitment term lets the encoder commit at its own rate, separate from how fast the codebook moves. Drop the stop-gradients and the two terms collapse into one shared pull (1 + β)‖z_e − e_k*‖² — you keep the magnitude but lose the ability to tune the two halves independently.

Central point. One gap, two grips: the codebook term drags the dictionary toward the data, the commitment term drags the data toward the dictionary, and the stop-gradients are what keep those two pulls on separate leashes.

Why the stop-gradient placement matters

The codebook loss ‖sg(z_e) − e_k*‖² updates only the codebook, not the encoder. The commitment loss ‖z_e − sg(e_k*)‖² updates only the encoder, not the codebook.

If you drop the stop-gradients you don’t lose magnitude — the two losses are quadratic in the same difference, so they add to (1 + β)‖z_e − e_k*‖². What you lose is the asymmetric routing: the encoder and codebook can no longer pursue their own objectives at separate rates. Both endpoints get pulled toward each other at the joint rate set by (1 + β), and you can’t use β to make the encoder commit faster than the codebook moves (or vice versa). The stop-grads aren’t there to keep the loss alive — they’re there to keep the two halves of training tunable independently.

The killer failure mode: codebook collapse

Train naively and what you get is “codebook collapse”: a small fraction of the K entries get used; most are dead (never the nearest neighbor to any encoder output). At convergence you might be using ~10% of a 16k codebook, effectively a 1.6k vocabulary. Two reasons:

Initialization gap. If a codebook entry is far from any encoder output, it’s never selected, gets no gradient, stays where it is, never gets selected, etc. A self-reinforcing dead zone.
EMA vs. SGD mismatch. Codebook entries should track the moving mean of the encoder outputs that selected them. SGD on the codebook loss does this badly; EMA (exponential moving average) does it much better.

Standard fixes:

Fix	What it does	Used in
EMA codebook	update e_k as moving mean of assigned z_e	VQ-VAE2, VQGAN, almost everything since
Restart dead codes	periodically replace unused entries with high-variance encoder outputs	SoundStream, EnCodec, common audio recipes
Linear projection to/from codebook space	quantize in low dim (8 or even 4), decode to full width	improved-VQGAN, MaskGIT, MAGVIT
L2-normalize codes and inputs	quantize on the sphere; same dead-code issue, but easier to reset	improved-VQGAN, ViT-VQGAN
Add KL term to encourage uniform usage	regularize the empirical distribution of selected codes	some text-codec work

VQGAN (Esser et al. 2020) — sharper reconstructions

VQ-VAE with L2 reconstruction produces blurry samples (same blur as a vanilla autoencoder). VQGAN swaps in perceptual + adversarial losses:

L_VQGAN = L_{recon-perceptual} + β · L_commit + λ · L_GAN

Perceptual loss: L2 in the feature space of a frozen VGG/CLIP rather than in pixel space. Encourages high-frequency texture.
Adversarial loss: a small patch-discriminator trained against the decoder, weighted by an adaptive λ that balances against the perceptual loss.

Result: at the same bottleneck (e.g. 16×16 tokens for a 256×256 image), VQGAN reconstructions are visually crisp where VQ-VAE’s were soft. This is what made the “tokenize images then model with a transformer” pipeline practical.

FSQ (Mentzer et al. 2023) — the simplification

Finite Scalar Quantization throws out the learned codebook entirely. Instead:

Force the encoder’s output to be a low-dimensional vector z_e ∈ ℝ^d, with d tiny (e.g. 5–8).
Apply tanh to each entry to bound it in [−1, 1].
Quantize each entry independently to one of L levels (e.g. L ∈ {5, 7, 8} per dim).

The “codebook” is then implicit: the product grid of all per-dim levels, of size L^d. With d = 6, L = 8 that’s 262k codes — far more than the largest learned codebook anyone trains, and with no codebook collapse possible, because the grid is uniform by construction.

z_q[i] = round( z_e[i] · (L−1)/2 ) · 2/(L−1)

With straight-through gradients through the round. That’s the entire quantizer. No EMA, no restart, no codebook loss. Empirically matches VQ-VAE reconstruction quality and often beats it for downstream modeling.

Method	Codebook	Collapse risk	Effective vocab	Auxiliary losses
VQ-VAE	learned table, K entries	high (use ~10% of K)	K (e.g. 16k)	commit + codebook
VQ-VAE + EMA	learned table, EMA updates	medium	K (e.g. 16k, ~50% used)	commit only
VQGAN	EMA learned table	medium	K, with perceptual+GAN losses on top	perceptual + GAN
FSQ	none (implicit grid)	impossible by construction	L^d (e.g. 262k)	none

Interactive · 2D codebook trainer

Below: a tiny 2D “data” cloud (the two moons again), a learned codebook of K 2D points, and a live training loop that uses the VQ-VAE update rule (EMA or SGD, your choice). Watch the codebook entries crawl onto the data manifold. Try a small K and observe collapse; try the FSQ option to see what the uniform-grid alternative looks like.

2D codebook training — VQ vs. FSQ

Green = data points. Orange = codebook entries. Lines show which entry each data point gets assigned to. Hit train and watch entries migrate.

K (codebook size): 16 method: FSQ levels L: 5

step

active codes (used >0)

—

recon MSE

—

effective vocab

—

Try SGD with K=4 then K=32 — you’ll see collapse: a few codes carry all the assignments, the rest sit unused. EMA fixes most of it. FSQ trivially uses L² codes whether they’re “needed” or not.

What goes wrong without the codebook

You could just stop here — train a continuous autoencoder, no quantizer. Why not?

You can’t use a transformer. Transformers need discrete tokens. With continuous latents, you have to train a continuous generative model in latent space (lesson 15 covers this hybrid — SD3 / Imagen 3 do exactly this). It works, but you’ve given up the unified-token-with-text move.
You can’t edit at the token level. A discrete vocabulary lets you swap one token for another (“replace token 472 with token 891”), which is what enables MaskGIT, inpainting, and editing-style operations.
You can’t interleave with language tokens. Same reason — an LLM has one big embedding table; image tokens just get more rows in it. With continuous latents you need a separate adapter to glue text to image.

Practical recipes (2026)

Use case	Tokenizer	Why
Image-only generation, max quality	VQGAN + EMA, K = 16k	standard; the “Parti tokenizer” recipe
Native multimodal (text + image)	FSQ with d = 6–8, L = 5–8	no collapse, big vocab, plays nicely with LM tokenizer
Video / long sequences	MAGVIT-v2 (causal 3D VQGAN with LFQ — lookup-free quantization, a close cousin of FSQ that uses {−1, +1} levels per dim with implicit codebook of 2^d)	temporal causality preserved, LFQ’s {−1, +1} binarization scales to very large implicit vocabularies (≥ 2¹⁸)
Audio	RVQ (residual VQ, ~8 layers) on EnCodec / SoundStream	captures multi-scale structure via residual chain
Speech (small vocab)	EnCodec or DAC, RVQ with restart	aggressive bitrate compression; restart prevents quiet-codebook collapse

Codebook size is not the whole story

A larger vocabulary doesn’t automatically buy you more capacity — if your tokenizer is information-bottlenecked elsewhere (e.g. spatial dim is too small), extra codes just give you more ways to spell the same thing. Reconstruction quality at fixed bottleneck dim — not vocab size — is what matters.

Punchline

A VQ tokenizer is a learned dictionary that turns images into short discrete sequences. The straight-through gradient lets you train through the discrete bottleneck; EMA on the codebook prevents collapse; FSQ replaces the whole codebook with a fixed grid and removes the collapse failure mode entirely. The downstream generative model then operates on these tokens. That’s the entire stage-1 of every modern text-to-image transformer.