Unified-token transformers
One transformer, one embedding table, one big vocabulary. Text and image tokens interleaved in a single sequence. The architectural backbone of every “native multimodal” flagship in 2026.
The architectural move
Before 2024-ish, multimodal models came in two flavors:
- Two-tower / adapter: a vision encoder produces features, an LLM consumes them via cross-attention or learned adapters (BLIP-2, LLaVA, Flamingo). The vision and language paths are different networks; they meet only at an interface layer.
- Separate generators: a text LLM rewrites the prompt; a frozen text-to-image diffusion model consumes the rewritten prompt (DALL-E 3 with GPT-4’s help). The generative path is again two systems glued at a string interface.
The unified-token move (Chameleon, Meta 2024; JanusFlow; LWM; the architectural family that underlies Gemini 2.5 Flash Image / “Nano Banana”, GPT-Image-2, etc.) collapses this:
What changes vs. text-only LM
- Embedding table grows. A text LM has vocab ~32k–100k. Adding image tokens means appending another ~8k–16k rows (one per VQ codebook entry). Audio tokens add another few thousand. Vocab grows to ~100k–200k, embedding-table memory grows proportionally, but the rest of the model is unchanged.
- Modality markers. Special tokens
<BOI>/<EOI>(begin / end of image) wrap image-token spans, so the model knows when it’s reading or writing one. Some implementations also put per-modality positional encodings. - Output head, sometimes split. The vanilla design uses one big softmax over the unified vocab. Some designs (JanusFlow) use a separate head for each modality, on the grounds that text and image have very different output entropies and a shared head can be hard to balance.
- Training data must be interleaved. The whole point is to train on sequences like “caption <BOI> image <EOI> more caption <BOI> image…” so the model learns the joint distribution. Pure (text, image) pairs work but limit what the model can do.
The decoding-mode switch
An autoregressive unified-token LM, naively used, generates image tokens one at a time — 256+ sequential forward passes for one image. That’s why most production systems use a hybrid per-modality decoding strategy:
| Span | Decoder | Why |
|---|---|---|
| Text tokens (CoT, answer) | autoregressive, one-at-a-time | strict left-to-right ordering; chain-of-thought needs sequential semantics |
Image tokens (between <BOI> and <EOI>) | parallel / MaskGIT-style | ~10 forwards instead of 256; bidirectional context within the image |
| Audio tokens | parallel or RVQ-aware AR | residual VQ adds a layer dimension; usually autoregress along the layer axis, parallel within |
The transformer doesn’t know — the decoder strategy is a sampling-time choice. The same trained weights serve both modes. (You do train the model to handle the masking pattern that parallel decoding uses: masked image tokens at variable rates appear in the training mix.)
Interactive · drag tokens into the sequence
Below: an editable interleaved sequence. Drag “text” or “image” tokens into the timeline; click any token to see its modality and where attention can flow from it. The causal mask is shown as the lower-triangular shaded region — every token attends to itself and all earlier positions, regardless of modality.
The Chameleon / JanusFlow design choices, in detail
| Choice | Chameleon (Meta 2024) | JanusFlow (DeepSeek 2024) | Comment |
|---|---|---|---|
| Tokenizer | VQGAN, 8k codes | separate encoders for understanding vs generation | JanusFlow argues understanding wants different features than generation; two tokenizers, one transformer |
| Image vocab vs. text vocab | fully merged into one | separate output heads, shared embedding | shared head is simplest, separate heads ease entropy balancing |
| Decoding for image | fully AR over image tokens | MaskGIT-style parallel | parallel wins on throughput; Chameleon’s AR choice made interleaving simpler at training time |
| Training data | interleaved web docs + image-text pairs | interleaved + reasoning-augmented data | interleaving is what enables “text → reason → image → text” chains at inference |
| Continuous head? | no — pure discrete | yes — FM-style head for generation, AR head for text | JanusFlow’s split is the modern compromise: text discrete (LM), image continuous (flow) within one transformer body |
Why this beats two-tower
- Reasoning can use image content. The model can produce a text chain-of-thought that references what it just “saw” (image tokens earlier in the context) and what it’s about to draw (image tokens later). Two-tower models can’t do this naturally — the vision tower has no read-write access to the LM’s reasoning state.
- Editing is in-band. “Take this image, replace the cat with a dog” becomes a single sequence: image tokens, instruction text, new image tokens. The model just generates the latter from the former; no separate editing pipeline.
- Self-supervision composes. Image-to-text and text-to-image are the same objective (next-token prediction) on different orderings of the same data. No need for two heads, two losses, two optimizers.
Trade-offs of full unification
- Output quality vs. specialized diffusion. Frozen-best diffusion (Imagen 3, MJ v7, “Flux.2 Pro”) still beats unified-token AR/parallel on raw image fidelity at matched compute — you traded some quality for the joint-reasoning capability. Pro tiers often hybridize: unified-token model produces a plan and rough latents; a frozen diffusion decoder refines.
- Vocab waste. The image vocab is much larger than necessary if you only ever generate one image per session; you pay embedding-table compute on every text-only turn.
- Training cost. Interleaved data is harder to curate than pure text or pure image-caption pairs. Most companies spend significant infra on the interleaved-data pipeline.
Where this sits in the series
This lesson built the substrate: a transformer that thinks in text and image tokens at once. Lesson 14 builds the capability on top: the chain-of-thought style reasoning that Nano Banana Pro and the GPT-Image-2 family ship with, and why that’s a qualitative change from prompt-rewriting tricks. Lesson 15 covers the alternative architecture — LLM-conditioned diffusion — that you’d pick if you don’t want to commit to the unified-token route.