Why discrete?

When tokens beat pixels, and the four-way menu of discrete generation methods that follow from that choice.

The frame shift

Part A treated the data as continuous — x ∈ ℝ^D — and learned a vector field or denoiser that operated on real numbers. That’s the right move for raw pixels at high resolution. But there’s a different option that turns out to be very productive: quantize the data into a sequence of discrete symbols from a finite vocabulary, then model the symbol sequence with a transformer.

Before we sell the upside, it’s worth seeing why this is even a separate problem. The whole engine of Part A — add a little Gaussian noise, predict it, repeat — quietly assumes the data lives somewhere you can nudge: between any two points there’s a meaningful midpoint, and a tiny step in any direction lands you somewhere still valid. Discrete symbols break both assumptions. There is no point “halfway between the word cat and the word dog,” and there is no “slightly noisier” version of token #4,072 — you either have that token or you have a different one. So the continuous machinery doesn’t transfer for free; the methods in this part exist precisely to define corruption and gradients on a set that has no in-between.

Intuition · linear unpacking

Claim: you cannot just run Gaussian diffusion on discrete tokens — the noise has nowhere sensible to go.

What Gaussian noise needs. DDPM works by adding a small real-valued perturbation: x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε. That only means something if the data lives in a continuous space where “a bit of x₀ plus a bit of noise” is itself a legal data point.
Tokens are labels, not coordinates. A token is an index into a vocabulary — symbol #17, #18, #19. The numbers are names, not positions. #18 is not “between” #17 and #19 in any content sense, so averaging two tokens is meaningless, and adding 0.3 of a Gaussian to an index lands on nothing at all.
Embeddings don’t rescue it. You could map each token to a vector and add noise there — but the noised vector almost never sits on top of a real embedding. It floats in the gaps, and snapping it back to the nearest token is a hard, discontinuous decision the smooth diffusion math was built to avoid.

Central point. Gaussian diffusion assumes a space you can take small steps in; a finite vocabulary has no “small step,” so discrete generation needs its own notion of corruption (mask a token, swap a token) rather than “add a little noise.”

Why bother? Three reasons:

Compression. A 256×256 image is 196,608 floats. A typical VQ tokenizer turns it into 256–1024 tokens drawn from a vocabulary of ~16k entries. Two orders of magnitude smaller. A transformer that has to attend to 256 tokens uses ~1000× less compute than one that has to attend to 196k.
Compatibility with language models. An LLM is already a sequence-of-tokens model. If you tokenize images into the same alphabet (or a parallel alphabet), one transformer can jointly reason over text and image tokens. This is the architecture under Chameleon, Gemini 2.5 Flash Image (Nano Banana), GPT-Image-1/2 and friends. We get to lesson 13–14.
Discrete objectives are easier to interpret. Categorical cross-entropy gives you per-token loss, per-token sample probabilities, per-token attention — the same telemetry an LLM gives you. Continuous diffusion makes you reason about score fields and KL bounds.

The price

Quantization loses information. Whatever doesn’t fit in a finite codebook is gone forever — you can only generate things that are compositions of codebook entries. With a good codebook (lesson 11) the loss is imperceptible at typical compression ratios; with a bad one, samples look like clipart. The tokenizer is the single most load-bearing component in a discrete pipeline.

The two-stage pattern

Almost every discrete generative system follows the same shape:

decoder → pixels.

Stage 1 is a tokenizer (lesson 11): a VAE-style encoder that produces a small spatial grid of discrete indices, plus a decoder that reverses the map. The tokenizer is trained separately on raw images, then frozen.

Stage 2 is a sequence model over the discrete tokens. Four flavors:

Method	How it samples	Steps	Quality vs. AR	Used by
Autoregressive (next-token)	one token at a time, left to right	N (= seq length)	highest fidelity, slowest	Chameleon, GPT-Image (text-to-image branch), DALL-E 1, ImageGPT
MaskGIT / parallel decoding	start fully masked, unmask top-K confident per round	~10–20	nearly AR-quality at 10× speed	MaskGIT, Muse, MAGVIT-v2
Discrete diffusion (D3PM, MD4)	iteratively edit a sequence with mask/swap transitions	~20–100	matches AR on some benchmarks	D3PM, MD4, SEDD, Lou et al. 2024
Hybrid (token model + diffusion decoder)	generate discrete tokens with AR or parallel decoding, then run a continuous diffusion decoder on top → pixels	(N or ~10) + ~50	state of the art at the cost of two-model engineering	several 2025–2026 “pro” tiers compose this way (the discrete model handles structure/text; diffusion handles texture). Note: Parti by itself is pure-AR-with-VQ-decoder (no diffusion); Imagen by itself is cascaded pixel diffusion (no AR)

Autoregression — the simplest baseline

Given tokens z₁, …, z_N, factorize the joint by the chain rule:

p_θ(z₁, …, z_N) = ∏_i=1..N p_θ(z_i | z_<i)

Train with cross-entropy; sample left-to-right. This is literally GPT applied to image tokens. The pros are obvious (transformers are well-understood, the loss is convex per-token, parallel training is trivial). The con is sampling cost: a 256×256 image is 256 tokens, so generation takes 256 sequential forward passes through the network.

Notice the quiet reason we train with cross-entropy on the distribution rather than on the chosen token: turning a categorical into an actual symbol — argmax or a sample — is a wall that gradients cannot cross. This is the second thing that makes discrete generation its own discipline, and it shapes every method on the menu.

Intuition · linear unpacking

Claim: picking a token (argmax or sampling a categorical) is non-differentiable, so you cannot backprop through a generated symbol.

What the network actually outputs. Not a token — a vector of probabilities over the whole vocabulary, smooth in the network’s weights. You can tweak a weight and watch that probability move continuously.
Where it breaks. To get a usable token you must collapse that smooth vector to one winner: argmax, or draw a sample. The output is now a flat index — #4,072. Nudge a weight a little and the winner doesn’t shift a little; it stays #4,072, then suddenly flips to #4,073. A staircase, not a ramp.
Why that kills the gradient. Backprop needs “how much does the output move when I move this weight?” On a staircase that slope is zero almost everywhere (flat steps) and undefined at the jumps. There is no useful gradient to pass back through the act of choosing.
The two ways out. Train on the distribution and never sample during training — cross-entropy compares the smooth probability vector to the true token, so gradients flow (this is what autoregression does). Or, where you truly must push gradients through a discrete choice, replace the hard pick with a smooth stand-in (a temperature-softened, “continuous-relaxation” sample) that becomes the hard pick only in the limit.

Central point. The discrete commitment step has no slope, so discrete models are trained against the predicted probabilities, not the sampled token — sampling is pushed to inference time, where no gradient needs to survive it.

Parallel decoding (MaskGIT) — the practical win

Treat the sequence as a partially-filled grid; train the model to predict the masked positions given the unmasked ones. At inference, start with everything masked, predict everything at once, keep only the top-K most confident, mask the rest, repeat. After ~10 rounds the whole grid is filled.

This is much faster (1.6 seconds vs. 30 on a typical text-to-image task at matched resolution) and quality is competitive with autoregression on standard benchmarks. Lesson 12 walks through the iterative-unmasking algorithm with an interactive widget. The key insight: order matters less for images than text, so parallel decoding doesn’t hurt much.

Discrete diffusion

The natural generalization of MaskGIT: instead of just “masked or revealed,” allow the corruption process to edit tokens (swap one symbol for another) at each step. Then train the model to predict the clean tokens given the corrupted ones, exactly as DDPM predicts clean data given noisy data. D3PM (Austin et al. 2021) was the first big paper here; SEDD (Lou et al. 2024) brought it competitive with autoregression on text.

For images, masked-only diffusion (= MaskGIT with a longer schedule) usually wins; for text and code, swap-based diffusion is more interesting because tokens have natural “neighbors” (synonyms, edit-distance-1 swaps).

Interactive · the four samplers, in miniature

Below, four samplers fill a 16-token grid (think: one flattened row of patches). The “target” is a synthetic distribution where each cell prefers a specific color. Watch how each method fills the grid step-by-step. Autoregression is strictly left-to-right; MaskGIT fills from confident regions outward; discrete diffusion edits everywhere at once; the hybrid does AR for token IDs then refines via a smoothing pass.

How to pick

Situation	Pick	Why
Need joint text+image reasoning	autoregressive over unified tokens	simplest path to chain-of-thought + image-as-output (lesson 13–14)
Need fastest image generation at known resolution	MaskGIT / Muse	~10 forward passes; competitive quality
Want absolute best quality, throughput is secondary	discrete token model + frozen diffusion decoder on the tokens	standard recipe for several 2026 “pro” tiers (Parti and Imagen are not a single pipeline — this row describes the architectural pattern, not a specific product)
Domain has natural token neighbors (text, code)	SEDD / discrete diffusion	edits-during-corruption matches what text wants to do
Audio codec output	autoregressive or RVQ + diffusion	autoregressive over EnCodec tokens has been the dominant baseline

What this part of the series is doing

Lesson 11 covers the tokenizer in depth (VQ-VAE, VQGAN, FSQ — the modern simplification). Lesson 12 covers discrete diffusion / MaskGIT mechanics. Lesson 13 zooms out to unified-token transformers (Chameleon, JanusFlow). Lesson 14 covers the “chain-of-thought before drawing” pattern that gives Nano Banana Pro and the GPT-Image-2 family their reasoning. Lesson 15 covers hybrid pipelines (LLM-conditioned diffusion: DALL-E 3, SD3+T5, Imagen 3) for when full autoregression isn’t the right answer.

Punchline

Tokenize the image into ~256 discrete symbols, model the symbol sequence with a transformer, recover pixels with a learned decoder. The model is now just an LLM, with all the architectural and reasoning machinery LLMs have. That’s the prerequisite for the integrated systems we cover later in the series.