generative_continuous / 10 · why discrete? lesson 10 / 15

Why discrete?

When tokens beat pixels, and the four-way menu of discrete generation methods that follow from that choice.

The frame shift

Part A treated the data as continuous — x ∈ ℝD — and learned a vector field or denoiser that operated on real numbers. That’s the right move for raw pixels at high resolution. But there’s a different option that turns out to be very productive: quantize the data into a sequence of discrete symbols from a finite vocabulary, then model the symbol sequence with a transformer.

Why bother? Three reasons:

  1. Compression. A 256×256 image is 196,608 floats. A typical VQ tokenizer turns it into 256–1024 tokens drawn from a vocabulary of ~16k entries. Two orders of magnitude smaller. A transformer that has to attend to 256 tokens uses ~1000× less compute than one that has to attend to 196k.
  2. Compatibility with language models. An LLM is already a sequence-of-tokens model. If you tokenize images into the same alphabet (or a parallel alphabet), one transformer can jointly reason over text and image tokens. This is the architecture under Chameleon, Gemini 2.5 Flash Image (Nano Banana), GPT-Image-1/2 and friends. We get to lesson 13–14.
  3. Discrete objectives are easier to interpret. Categorical cross-entropy gives you per-token loss, per-token sample probabilities, per-token attention — the same telemetry an LLM gives you. Continuous diffusion makes you reason about score fields and KL bounds.
The price
Quantization loses information. Whatever doesn’t fit in a finite codebook is gone forever — you can only generate things that are compositions of codebook entries. With a good codebook (lesson 11) the loss is imperceptible at typical compression ratios; with a bad one, samples look like clipart. The tokenizer is the single most load-bearing component in a discrete pipeline.

The two-stage pattern

Almost every discrete generative system follows the same shape:

raw image (B, 3, 256, 256) VQ encoder token grid (B, 16, 16) vocab ~16k flatten token sequence 256 tokens model π(z) categorical Stage 1: tokenize (frozen)  ·  Stage 2: model the token sequence Generation reverses: sample token sequence → VQ decoder → pixels.

Stage 1 is a tokenizer (lesson 11): a VAE-style encoder that produces a small spatial grid of discrete indices, plus a decoder that reverses the map. The tokenizer is trained separately on raw images, then frozen.

Stage 2 is a sequence model over the discrete tokens. Four flavors:

MethodHow it samplesStepsQuality vs. ARUsed by
Autoregressive (next-token)one token at a time, left to rightN (= seq length)highest fidelity, slowestChameleon, GPT-Image (text-to-image branch), DALL-E 1, ImageGPT
MaskGIT / parallel decodingstart fully masked, unmask top-K confident per round~10–20nearly AR-quality at 10× speedMaskGIT, Muse, MAGVIT-v2
Discrete diffusion (D3PM, MD4)iteratively edit a sequence with mask/swap transitions~20–100matches AR on some benchmarksD3PM, MD4, SEDD, Lou et al. 2024
Hybrid (token model + diffusion decoder)generate discrete tokens with AR or parallel decoding, then run a continuous diffusion decoder on top → pixels(N or ~10) + ~50state of the art at the cost of two-model engineeringseveral 2025–2026 “pro” tiers compose this way (the discrete model handles structure/text; diffusion handles texture). Note: Parti by itself is pure-AR-with-VQ-decoder (no diffusion); Imagen by itself is cascaded pixel diffusion (no AR)

Autoregression — the simplest baseline

Given tokens z1, …, zN, factorize the joint by the chain rule:

pθ(z1, …, zN) = ∏i=1..N pθ(zi | z<i)

Train with cross-entropy; sample left-to-right. This is literally GPT applied to image tokens. The pros are obvious (transformers are well-understood, the loss is convex per-token, parallel training is trivial). The con is sampling cost: a 256×256 image is 256 tokens, so generation takes 256 sequential forward passes through the network.

Parallel decoding (MaskGIT) — the practical win

Treat the sequence as a partially-filled grid; train the model to predict the masked positions given the unmasked ones. At inference, start with everything masked, predict everything at once, keep only the top-K most confident, mask the rest, repeat. After ~10 rounds the whole grid is filled.

This is much faster (1.6 seconds vs. 30 on a typical text-to-image task at matched resolution) and quality is competitive with autoregression on standard benchmarks. Lesson 12 walks through the iterative-unmasking algorithm with an interactive widget. The key insight: order matters less for images than text, so parallel decoding doesn’t hurt much.

Discrete diffusion

The natural generalization of MaskGIT: instead of just “masked or revealed,” allow the corruption process to edit tokens (swap one symbol for another) at each step. Then train the model to predict the clean tokens given the corrupted ones, exactly as DDPM predicts clean data given noisy data. D3PM (Austin et al. 2021) was the first big paper here; SEDD (Lou et al. 2024) brought it competitive with autoregression on text.

For images, masked-only diffusion (= MaskGIT with a longer schedule) usually wins; for text and code, swap-based diffusion is more interesting because tokens have natural “neighbors” (synonyms, edit-distance-1 swaps).

Interactive · the four samplers, in miniature

Below, four samplers fill a 16-token grid (think: one flattened row of patches). The “target” is a synthetic distribution where each cell prefers a specific color. Watch how each method fills the grid step-by-step. Autoregression is strictly left-to-right; MaskGIT fills from confident regions outward; discrete diffusion edits everywhere at once; the hybrid does AR for token IDs then refines via a smoothing pass.

Four samplers, one 16-token grid
Hit run-all to start all four sweeps in parallel. Each cell has a true preferred color (the “data”); the sampler should converge to that color. Watch how many forward passes each one spends.
AR forwards
0
MaskGIT forwards
0
Discrete-diff forwards
0
Hybrid forwards
0

How to pick

SituationPickWhy
Need joint text+image reasoningautoregressive over unified tokenssimplest path to chain-of-thought + image-as-output (lesson 13–14)
Need fastest image generation at known resolutionMaskGIT / Muse~10 forward passes; competitive quality
Want absolute best quality, throughput is secondarydiscrete token model + frozen diffusion decoder on the tokensstandard recipe for several 2026 “pro” tiers (Parti and Imagen are not a single pipeline — this row describes the architectural pattern, not a specific product)
Domain has natural token neighbors (text, code)SEDD / discrete diffusionedits-during-corruption matches what text wants to do
Audio codec outputautoregressive or RVQ + diffusionautoregressive over EnCodec tokens has been the dominant baseline

What this part of the series is doing

Lesson 11 covers the tokenizer in depth (VQ-VAE, VQGAN, FSQ — the modern simplification). Lesson 12 covers discrete diffusion / MaskGIT mechanics. Lesson 13 zooms out to unified-token transformers (Chameleon, JanusFlow). Lesson 14 covers the “chain-of-thought before drawing” pattern that gives Nano Banana Pro and the GPT-Image-2 family their reasoning. Lesson 15 covers hybrid pipelines (LLM-conditioned diffusion: DALL-E 3, SD3+T5, Imagen 3) for when full autoregression isn’t the right answer.

Punchline
Tokenize the image into ~256 discrete symbols, model the symbol sequence with a transformer, recover pixels with a learned decoder. The model is now just an LLM, with all the architectural and reasoning machinery LLMs have. That’s the prerequisite for the integrated systems we cover later in the series.