Reasoning + image generation

The qualitative shift in late-2025 / 2026 image models: chain-of-thought before drawing. Nano Banana Pro and the GPT-Image-2 family as the archetypal examples.

What’s public vs. inferred

This lesson focuses on the architectural pattern. The detailed weights, training data, and pipeline of Nano Banana Pro (Google), GPT-Image-2 (OpenAI), and similar production systems are proprietary. What’s public from research papers and product reporting: they are built on unified-token transformers (lesson 13), they use chain-of-thought-style reasoning over image generation tasks, and they hybridize discrete and continuous decoders. Anything more specific is inferred from outputs and reasonable architecture priors.

What changed

The first wave of text-to-image models (DALL-E 1, SD 1.x, Midjourney v3) treated the prompt as a fixed string. You wrote a sentence, it produced an image. If the result wasn’t what you wanted, you iterated on the prompt — manually.

The reasoning-augmented wave (Nano Banana Pro, the GPT-Image-2 family, and reasoning-style image generation in Gemini and Claude-class native multimodal models) treats the prompt as a problem to solve. The model:

Reads the prompt.
Writes a chain-of-thought in text tokens: parses the request, lists constraints, sketches what the image should contain, considers ambiguities, picks a layout.
(Optionally) writes intermediate planning artifacts: a textual scene graph, bounding boxes for major elements, a color palette, a style description.
Then emits image tokens conditioned on all of the above.

Why chain-of-thought helps generation

For tasks that are decomposable, planning beats pattern-matching:

Text rendering. Spelling words correctly in an image is hard without a plan; if the model can write “the sign should say STOP in white letters on red” in its CoT, then emit image tokens that try to honor that string, error rate drops dramatically. Empirically, text rendering is the headline benchmark where reasoning models stomp the prior generation.
Counting. “Three apples on a table” was historically a coin flip for image models — they would produce 2, 3, or 4 with roughly equal probability. A model that can write “three apples, one front-left, one front-right, one center-back” before drawing gets to three much more reliably.
Spatial composition. “A red cube on top of a blue cylinder” requires understanding the geometric relation. A CoT can spell out the bounding boxes explicitly before the image tokens commit.
Constraint negotiation. “A photorealistic medieval knight holding a smartphone” has an internal tension (medieval / smartphone). A CoT can explicitly resolve: “the smartphone is the focal anomaly, render it crisp; the rest is period-appropriate.”

Editing as token surgery

Unified-token transformers make editing into a one-shot prompt:

# Conceptual sequence at inference time
<input image tokens>          # the original
<text: "remove the lamp">     # the instruction
<BOI> ... <EOI>               # what the model emits — the edited image

Here is the trick that makes editing cheap. The model is just predicting the next token, and the source image is sitting right there in its context as a sequence of image tokens. So when it emits the edited image, the easiest thing it can do for any region the instruction didn’t mention is to re-emit the same token it already sees in the source — an almost-free copy. It only has to do real work where the instruction asks for change, producing new tokens there that diverge from the source. The reason this is possible at all is that the source and the output speak the same token vocabulary: a token isn’t a description of a patch of image, it is the patch, so “leave this alone” reduces to “copy this symbol.”

Intuition · linear unpacking

Claim: a unified-token model can edit by copying most of the image and rewriting only the part the instruction touches.

Same alphabet. The source image and the edited image are written in the identical vocabulary of image tokens. A token is a concrete piece of picture, not a sentence about it.
The model only does one thing. Generating the edit is just “predict the next token”, over and over, with the source image visible in context the whole time.
Copying is the path of least resistance. For any region the instruction never mentioned, the token that best fits is the one already there in the source. The model can re-emit it verbatim — cheap, lossless, no guessing.
Effort goes only where asked. Where the instruction does ask for change, the model emits different tokens, so the output diverges from the source exactly at the edited region and nowhere else.

Central point. Editing isn’t a special mode — it’s ordinary token prediction where “preserve this” happens to mean “copy this symbol,” which is why a unified-token model can leave most of an image untouched while surgically changing one part.

Reasoning helps editing in the same way it helps generation. The CoT spells out: “the lamp is the brass object near the top-left; replace its pixels with continuation of the wall texture; preserve everything else.”

Interactive · “think then draw” side by side

Below: a mock model running the same prompt under two policies. Left: no CoT, just emits image tokens. Right: with CoT, emits planning tokens first, then image tokens. The “image” is a synthetic 8×8 token grid; the planning step modifies the probabilities the model uses for token decisions. Click run to see them side by side.

Reasoning over multiple input images

The unified-token architecture handles “here are three reference images, combine them in this way” with no special pipeline:

<image_1 tokens>
<text: "use the woman from image 1, the background from image 2,">
<text: "and apply the lighting from image 3.">
<image_2 tokens>
<image_3 tokens>
<text: "produce a composite that ...">
<BOI> ... <EOI>

The transformer attends across all input image spans and the text in one pass; the output image tokens condition on everything. Reasoning helps the model decompose the constraints (whose face, whose background, whose lighting) before committing to the composite.

What this enables that prompt-rewriting didn’t

DALL-E 3 famously uses GPT-4 to rewrite the user’s prompt into something the underlying SD-like diffusion model understands better. That’s a useful trick but it’s strictly weaker than in-band CoT:

Capability	Prompt rewriting	In-band CoT
Plan responds to user intent	yes (LLM rewrites)	yes
Generator can attend to plan during emission	only via the rewritten prompt’s text embedding	directly via attention to plan tokens
Plan can reference image tokens of input images	only verbally	directly (tokens in same vocabulary)
Plan can iterate on a partial generation	no (one-shot)	yes (model can emit some tokens, “reconsider”, emit more)
Same model for understanding and generation	no (two models)	yes (one transformer)

The last row is the structural one. With two models, the rewriter never sees the generator’s output, the generator never sees the rewriter’s reasoning. With one transformer, every token can attend to every prior token; reasoning and emission share state.

Intuition · linear unpacking

Claim: in-band CoT beats prompt-rewriting because the plan is still there when the image tokens are being drawn, not summarized away into one embedding.

Two models, two rooms. In prompt-rewriting, an LLM writes a better prompt, hands it off, and leaves. The generator only ever receives the final rewritten sentence — encoded into a fixed set of text embeddings. All the LLM’s intermediate reasoning is gone.
An embedding is a summary, not a transcript. The generator can’t look back at why the prompt says what it says, or at the original images the planner saw. It gets the conclusion, never the work.
One model keeps the work on the table. With a single transformer, the plan is written out as actual tokens that stay in the context window. When the model starts emitting image tokens, attention can reach back and read those plan tokens directly — “I said three apples, I’ve drawn two, one to go.”
Shared state enables mid-course correction. Because reasoning and emission live in the same sequence, the model can emit some image tokens, notice they violate the plan, and adjust — impossible across a one-shot handoff between two separate models.

Central point. The plan helps most when the generator can keep reading it while drawing; prompt-rewriting throws the reasoning away and passes on only the final prompt before drawing starts, while in-band CoT leaves every reasoning token in reach of attention until the last pixel is committed.

Trade-offs and failure modes

Where reasoning models still lose

Raw aesthetic quality, specific styles. Frozen-best diffusion specialists (Midjourney v7, Flux v2 Pro, etc.) still produce more visually striking images at matched user effort. Reasoning models close the gap on “the user got what they asked for” but not yet on “the image is beautiful on its own merits”.
Latency. Emitting hundreds of CoT tokens before any image token shows up adds latency. Worth it for tasks that need planning; overhead for simple stylization.
Compute. Joint training is expensive; you can’t hot-swap a better text model into a unified-token system without retraining.
Inheritance of LLM failure modes. Reasoning models hallucinate plans, then commit to them. A diffusion model would hallucinate pixels; a reasoning model hallucinates pixels that match a confidently-wrong plan, which can look more like a deliberate error and be harder to debug.

The pattern, abstracted

Stage	What happens	Why
Parse	model writes text identifying entities, attributes, relations in prompt	makes structure explicit; lets later stages reference
Plan	model writes layout, color palette, style, maybe per-region captions	turns one image into a series of locally-easy decisions
Reason about constraints	model resolves conflicts, fills gaps, chooses defaults	this is where “think harder” pays off — counting, text rendering, composition
(optional) intermediate sketch	some systems emit a low-res image first, evaluate, refine	multi-pass refinement; expensive but improves consistency
Emit	parallel-decode the final image tokens	committed plan now controls token sampling
(optional) diffusion polish	frozen diffusion decoder takes the image tokens and refines pixels	last-mile quality boost; the “Pro” tier in many product names

Punchline

Reasoning + generation = same unified-token transformer that thinks in language tokens, then writes image tokens, with everything attending to everything else. The model gets to use its language capabilities (parse, plan, count, spell) in service of generation. The dramatic improvement in text-rendering, counting, and instruction-following in late-2025 / 2026 image models all flows from this one architectural choice.