Unified-token transformers

One transformer, one embedding table, one big vocabulary. Text and image tokens interleaved in a single sequence. The architectural backbone of every “native multimodal” flagship in 2026.

The architectural move

Before 2024-ish, multimodal models came in two flavors:

Two-tower / adapter: a vision encoder produces features, an LLM consumes them via cross-attention or learned adapters (BLIP-2, LLaVA, Flamingo). The vision and language paths are different networks; they meet only at an interface layer.
Separate generators: a text LLM rewrites the prompt; a frozen text-to-image diffusion model consumes the rewritten prompt (DALL-E 3 with GPT-4’s help). The generative path is again two systems glued at a string interface.

The unified-token move (Chameleon, Meta 2024; JanusFlow; LWM; the architectural family that underlies Gemini 2.5 Flash Image / “Nano Banana”, GPT-Image-2, etc.) collapses this:

both just rows in the same embedding table. One forward pass; no adapter, no cross-attention to a separate vision tower.

Intuition · linear unpacking

Claim: an image can be made into “words” that live in the same sentence as text, so one ordinary language model handles both.

A token is just an integer. To a transformer, “text” is never letters — it is a list of integers, each one an index into a big lookup table of learned vectors. The model only ever sees integers and the vectors they point to.
An image can be turned into integers too. A VQ encoder (lesson 11) chops a picture into a grid of patches and replaces each patch with the index of the closest entry in a fixed codebook of ~8k visual “words.” So a picture becomes, say, 256 integers — exactly the same kind of thing a sentence is.
Give them non-overlapping integers. Text words use indices 0–100k; image words get fresh indices 100k–108k. Now one lookup table has a row for every word of both kinds, and a single sequence can freely mix them.
The model can’t tell the difference, and doesn’t need to. It just predicts the next integer given the earlier ones. Predicting the next text word and predicting the next image patch are the same operation on different rows of the same table.

Central point. “Put the image in the embedding table” means: convert the image to integers from a reserved range, and next-token prediction now spans both modalities for free — no second network, no adapter.

What changes vs. text-only LM

Embedding table grows. A text LM has vocab ~32k–100k. Adding image tokens means appending another ~8k–16k rows (one per VQ codebook entry). Audio tokens add another few thousand. Vocab grows to ~100k–200k, embedding-table memory grows proportionally, but the rest of the model is unchanged.
Modality markers. Special tokens <BOI> / <EOI> (begin / end of image) wrap image-token spans, so the model knows when it’s reading or writing one. Some implementations also put per-modality positional encodings.
Output head, sometimes split. The vanilla design uses one big softmax over the unified vocab. Some designs (JanusFlow) use a separate head for each modality, on the grounds that text and image have very different output entropies and a shared head can be hard to balance.
Training data must be interleaved. The whole point is to train on sequences like “caption <BOI> image <EOI> more caption <BOI> image…” so the model learns the joint distribution. Pure (text, image) pairs work but limit what the model can do.

The decoding-mode switch

An autoregressive unified-token LM, naively used, generates image tokens one at a time — 256+ sequential forward passes for one image. That’s why most production systems use a hybrid per-modality decoding strategy:

Span	Decoder	Why
Text tokens (CoT, answer)	autoregressive, one-at-a-time	strict left-to-right ordering; chain-of-thought needs sequential semantics
Image tokens (between `<BOI>` and `<EOI>`)	parallel / MaskGIT-style	~10 forwards instead of 256; bidirectional context within the image
Audio tokens	parallel or RVQ-aware AR	residual VQ adds a layer dimension; usually autoregress along the layer axis, parallel within

Here is the part that surprises people: the network itself doesn’t change between these modes. A trained transformer is just a function that, given a sequence, scores every possible next token. How you use those scores — reveal one token then re-run, or fill in a whole masked block at once — is a choice you make at sampling time, not a property baked into the weights. So the same weights run fully-sequential for text and block-parallel for image. (One caveat: you do show the model the masked-token pattern during training, so it has practiced predicting image tokens that have other image tokens still hidden around them.)

Intuition · linear unpacking

Claim: the same trained model can decode text one token at a time and images many-at-once, because decoding order is a runtime decision, not part of the weights.

What training buys you is a scorer. The transformer learns to answer one question: given the tokens I can see, how likely is each possible next token? It is a function from context to scores — nothing in it commits to when you ask.
Text must go in order. The next word genuinely depends on the previous word you just committed to, so you reveal one token, feed it back, ask again. Slow but necessary — chain-of-thought reasoning only makes sense left-to-right.
Image patches don’t. Within one image the patches are mutually consistent rather than strictly ordered, so you can guess a confident subset of them at once, fill those in, and re-ask for the rest — about ten rounds instead of 256.
You just have to warn it. To use the fast mode you train with image tokens randomly hidden, so the model has practiced predicting a patch while its neighbors are still blanks — the exact situation parallel decoding creates.

Central point. One set of weights, two decoding speeds: the model only ever scores next tokens, and you choose at inference whether to cash those scores in one-at-a-time (text) or a block-at-a-time (image).

Interactive · drag tokens into the sequence

Below: an editable interleaved sequence. Drag “text” or “image” tokens into the timeline; click any token to see its modality and where attention can flow from it. The causal mask is shown as the lower-triangular shaded region — every token attends to itself and all earlier positions, regardless of modality.

Interleaved sequence builder + attention mask viewer

Click a token in the sequence to highlight it; the matrix shows which positions it attends to (causal: positions ≤ itself). Toggle masks to see the difference between fully-causal (vanilla AR) and parallel-image masking (used in Chameleon/Gemini-style decoding).

attention mode:

The Chameleon / JanusFlow design choices, in detail

Choice	Chameleon (Meta 2024)	JanusFlow (DeepSeek 2024)	Comment
Tokenizer	VQGAN, 8k codes	separate encoders for understanding vs generation	JanusFlow argues understanding wants different features than generation; two tokenizers, one transformer
Image vocab vs. text vocab	fully merged into one	separate output heads, shared embedding	shared head is simplest, separate heads ease entropy balancing
Decoding for image	fully AR over image tokens	MaskGIT-style parallel	parallel wins on throughput; Chameleon’s AR choice made interleaving simpler at training time
Training data	interleaved web docs + image-text pairs	interleaved + reasoning-augmented data	interleaving is what enables “text → reason → image → text” chains at inference
Continuous head?	no — pure discrete	yes — FM-style head for generation, AR head for text	JanusFlow’s split is the modern compromise: text discrete (LM), image continuous (flow) within one transformer body

Why this beats two-tower

Reasoning can use image content. The model can produce a text chain-of-thought that references what it just “saw” (image tokens earlier in the context) and what it’s about to draw (image tokens later). Two-tower models can’t do this naturally — the vision tower has no read-write access to the LM’s reasoning state.
Editing is in-band. “Take this image, replace the cat with a dog” becomes a single sequence: image tokens, instruction text, new image tokens. The model just generates the latter from the former; no separate editing pipeline.
Self-supervision composes. Image-to-text and text-to-image are the same objective (next-token prediction) on different orderings of the same data. No need for two heads, two losses, two optimizers.

Trade-offs of full unification

When unified hurts

Output quality vs. specialized diffusion. Frozen-best diffusion (Imagen 3, MJ v7, “Flux.2 Pro”) still beats unified-token AR/parallel on raw image fidelity at matched compute — you traded some quality for the joint-reasoning capability. Pro tiers often hybridize: unified-token model produces a plan and rough latents; a frozen diffusion decoder refines.
Vocab waste. The image vocab is much larger than necessary if you only ever generate one image per session; you pay embedding-table compute on every text-only turn.
Training cost. Interleaved data is harder to curate than pure text or pure image-caption pairs. Most companies spend significant infra on the interleaved-data pipeline.

Where this sits in the series

This lesson built the substrate: a transformer that thinks in text and image tokens at once. Lesson 14 builds the capability on top: the chain-of-thought style reasoning that Nano Banana Pro and the GPT-Image-2 family ship with, and why that’s a qualitative change from prompt-rewriting tricks. Lesson 15 covers the alternative architecture — LLM-conditioned diffusion — that you’d pick if you don’t want to commit to the unified-token route.

Punchline

Tokenize the image, append it to the LM’s embedding table, train one transformer on interleaved [text, image] sequences. The same network now reads, reasons about, and generates both modalities. Per-modality decoder choices (AR for text, parallel for image) are a sampling-time switch. That’s the architectural family every “native multimodal” flagship belongs to.