generative_continuous / 13 · unified tokens lesson 13 / 15

Unified-token transformers

One transformer, one embedding table, one big vocabulary. Text and image tokens interleaved in a single sequence. The architectural backbone of every “native multimodal” flagship in 2026.

The architectural move

Before 2024-ish, multimodal models came in two flavors:

  1. Two-tower / adapter: a vision encoder produces features, an LLM consumes them via cross-attention or learned adapters (BLIP-2, LLaVA, Flamingo). The vision and language paths are different networks; they meet only at an interface layer.
  2. Separate generators: a text LLM rewrites the prompt; a frozen text-to-image diffusion model consumes the rewritten prompt (DALL-E 3 with GPT-4’s help). The generative path is again two systems glued at a string interface.

The unified-token move (Chameleon, Meta 2024; JanusFlow; LWM; the architectural family that underlies Gemini 2.5 Flash Image / “Nano Banana”, GPT-Image-2, etc.) collapses this:

image → VQ encoder → indices text → BPE → indices interleaved sequence [txt, txt, <BOI>, img, img, img, <EOI>, txt, txt …] one transformer causal mask unified vocab ~100k–200k π(z) Image and text are both just rows in the same embedding table. One forward pass; no adapter, no cross-attention to a separate vision tower.

What changes vs. text-only LM

  1. Embedding table grows. A text LM has vocab ~32k–100k. Adding image tokens means appending another ~8k–16k rows (one per VQ codebook entry). Audio tokens add another few thousand. Vocab grows to ~100k–200k, embedding-table memory grows proportionally, but the rest of the model is unchanged.
  2. Modality markers. Special tokens <BOI> / <EOI> (begin / end of image) wrap image-token spans, so the model knows when it’s reading or writing one. Some implementations also put per-modality positional encodings.
  3. Output head, sometimes split. The vanilla design uses one big softmax over the unified vocab. Some designs (JanusFlow) use a separate head for each modality, on the grounds that text and image have very different output entropies and a shared head can be hard to balance.
  4. Training data must be interleaved. The whole point is to train on sequences like “caption <BOI> image <EOI> more caption <BOI> image…” so the model learns the joint distribution. Pure (text, image) pairs work but limit what the model can do.

The decoding-mode switch

An autoregressive unified-token LM, naively used, generates image tokens one at a time — 256+ sequential forward passes for one image. That’s why most production systems use a hybrid per-modality decoding strategy:

SpanDecoderWhy
Text tokens (CoT, answer)autoregressive, one-at-a-timestrict left-to-right ordering; chain-of-thought needs sequential semantics
Image tokens (between <BOI> and <EOI>)parallel / MaskGIT-style~10 forwards instead of 256; bidirectional context within the image
Audio tokensparallel or RVQ-aware ARresidual VQ adds a layer dimension; usually autoregress along the layer axis, parallel within

The transformer doesn’t know — the decoder strategy is a sampling-time choice. The same trained weights serve both modes. (You do train the model to handle the masking pattern that parallel decoding uses: masked image tokens at variable rates appear in the training mix.)

Interactive · drag tokens into the sequence

Below: an editable interleaved sequence. Drag “text” or “image” tokens into the timeline; click any token to see its modality and where attention can flow from it. The causal mask is shown as the lower-triangular shaded region — every token attends to itself and all earlier positions, regardless of modality.

Interleaved sequence builder + attention mask viewer
Click a token in the sequence to highlight it; the matrix shows which positions it attends to (causal: positions ≤ itself). Toggle masks to see the difference between fully-causal (vanilla AR) and parallel-image masking (used in Chameleon/Gemini-style decoding).

The Chameleon / JanusFlow design choices, in detail

ChoiceChameleon (Meta 2024)JanusFlow (DeepSeek 2024)Comment
TokenizerVQGAN, 8k codesseparate encoders for understanding vs generationJanusFlow argues understanding wants different features than generation; two tokenizers, one transformer
Image vocab vs. text vocabfully merged into oneseparate output heads, shared embeddingshared head is simplest, separate heads ease entropy balancing
Decoding for imagefully AR over image tokensMaskGIT-style parallelparallel wins on throughput; Chameleon’s AR choice made interleaving simpler at training time
Training datainterleaved web docs + image-text pairsinterleaved + reasoning-augmented datainterleaving is what enables “text → reason → image → text” chains at inference
Continuous head?no — pure discreteyes — FM-style head for generation, AR head for textJanusFlow’s split is the modern compromise: text discrete (LM), image continuous (flow) within one transformer body

Why this beats two-tower

Trade-offs of full unification

When unified hurts

Where this sits in the series

This lesson built the substrate: a transformer that thinks in text and image tokens at once. Lesson 14 builds the capability on top: the chain-of-thought style reasoning that Nano Banana Pro and the GPT-Image-2 family ship with, and why that’s a qualitative change from prompt-rewriting tricks. Lesson 15 covers the alternative architecture — LLM-conditioned diffusion — that you’d pick if you don’t want to commit to the unified-token route.

Punchline
Tokenize the image, append it to the LM’s embedding table, train one transformer on interleaved [text, image] sequences. The same network now reads, reasons about, and generates both modalities. Per-modality decoder choices (AR for text, parallel for image) are a sampling-time switch. That’s the architectural family every “native multimodal” flagship belongs to.