Reasoning + image generation
The qualitative shift in late-2025 / 2026 image models: chain-of-thought before drawing. Nano Banana Pro and the GPT-Image-2 family as the archetypal examples.
What changed
The first wave of text-to-image models (DALL-E 1, SD 1.x, Midjourney v3) treated the prompt as a fixed string. You wrote a sentence, it produced an image. If the result wasn’t what you wanted, you iterated on the prompt — manually.
The reasoning-augmented wave (Nano Banana Pro, the GPT-Image-2 family, and reasoning-style image generation in Gemini and Claude-class native multimodal models) treats the prompt as a problem to solve. The model:
- Reads the prompt.
- Writes a chain-of-thought in text tokens: parses the request, lists constraints, sketches what the image should contain, considers ambiguities, picks a layout.
- (Optionally) writes intermediate planning artifacts: a textual scene graph, bounding boxes for major elements, a color palette, a style description.
- Then emits image tokens conditioned on all of the above.
Why chain-of-thought helps generation
For tasks that are decomposable, planning beats pattern-matching:
- Text rendering. Spelling words correctly in an image is hard without a plan; if the model can write “the sign should say STOP in white letters on red” in its CoT, then emit image tokens that try to honor that string, error rate drops dramatically. Empirically, text rendering is the headline benchmark where reasoning models stomp the prior generation.
- Counting. “Three apples on a table” was historically a coin flip for image models — they would produce 2, 3, or 4 with roughly equal probability. A model that can write “three apples, one front-left, one front-right, one center-back” before drawing gets to three much more reliably.
- Spatial composition. “A red cube on top of a blue cylinder” requires understanding the geometric relation. A CoT can spell out the bounding boxes explicitly before the image tokens commit.
- Constraint negotiation. “A photorealistic medieval knight holding a smartphone” has an internal tension (medieval / smartphone). A CoT can explicitly resolve: “the smartphone is the focal anomaly, render it crisp; the rest is period-appropriate.”
Editing as token surgery
Unified-token transformers make editing into a one-shot prompt:
# Conceptual sequence at inference time
<input image tokens> # the original
<text: "remove the lamp"> # the instruction
<BOI> ... <EOI> # what the model emits — the edited image
The model has the source image, the instruction, and learned during training to honor instructions on interleaved data. It produces new image tokens that selectively diverge from the source where the instruction asks for change. Because both source and target are in the same vocabulary, the model can literally copy unchanged regions token-for-token.
Reasoning helps editing in the same way it helps generation. The CoT spells out: “the lamp is the brass object near the top-left; replace its pixels with continuation of the wall texture; preserve everything else.”
Interactive · “think then draw” side by side
Below: a mock model running the same prompt under two policies. Left: no CoT, just emits image tokens. Right: with CoT, emits planning tokens first, then image tokens. The “image” is a synthetic 8×8 token grid; the planning step modifies the probabilities the model uses for token decisions. Click run to see them side by side.
Reasoning over multiple input images
The unified-token architecture handles “here are three reference images, combine them in this way” with no special pipeline:
<image_1 tokens>
<text: "use the woman from image 1, the background from image 2,">
<text: "and apply the lighting from image 3.">
<image_2 tokens>
<image_3 tokens>
<text: "produce a composite that ...">
<BOI> ... <EOI>
The transformer attends across all input image spans and the text in one pass; the output image tokens condition on everything. Reasoning helps the model decompose the constraints (whose face, whose background, whose lighting) before committing to the composite.
What this enables that prompt-rewriting didn’t
DALL-E 3 famously uses GPT-4 to rewrite the user’s prompt into something the underlying SD-like diffusion model understands better. That’s a useful trick but it’s strictly weaker than in-band CoT:
| Capability | Prompt rewriting | In-band CoT |
|---|---|---|
| Plan responds to user intent | yes (LLM rewrites) | yes |
| Generator can attend to plan during emission | only via the rewritten prompt’s text embedding | directly via attention to plan tokens |
| Plan can reference image tokens of input images | only verbally | directly (tokens in same vocabulary) |
| Plan can iterate on a partial generation | no (one-shot) | yes (model can emit some tokens, “reconsider”, emit more) |
| Same model for understanding and generation | no (two models) | yes (one transformer) |
The last row is the structural one. With two models, the rewriter never sees the generator’s output, the generator never sees the rewriter’s reasoning. With one transformer, every token can attend to every prior token; reasoning and emission share state.
Trade-offs and failure modes
- Raw aesthetic quality, specific styles. Frozen-best diffusion specialists (Midjourney v7, Flux v2 Pro, etc.) still produce more visually striking images at matched user effort. Reasoning models close the gap on “the user got what they asked for” but not yet on “the image is beautiful on its own merits”.
- Latency. Emitting hundreds of CoT tokens before any image token shows up adds latency. Worth it for tasks that need planning; overhead for simple stylization.
- Compute. Joint training is expensive; you can’t hot-swap a better text model into a unified-token system without retraining.
- Inheritance of LLM failure modes. Reasoning models hallucinate plans, then commit to them. A diffusion model would hallucinate pixels; a reasoning model hallucinates pixels that match a confidently-wrong plan, which can look more like a deliberate error and be harder to debug.
The pattern, abstracted
| Stage | What happens | Why |
|---|---|---|
| Parse | model writes text identifying entities, attributes, relations in prompt | makes structure explicit; lets later stages reference |
| Plan | model writes layout, color palette, style, maybe per-region captions | turns one image into a series of locally-easy decisions |
| Reason about constraints | model resolves conflicts, fills gaps, chooses defaults | this is where “think harder” pays off — counting, text rendering, composition |
| (optional) intermediate sketch | some systems emit a low-res image first, evaluate, refine | multi-pass refinement; expensive but improves consistency |
| Emit | parallel-decode the final image tokens | committed plan now controls token sampling |
| (optional) diffusion polish | frozen diffusion decoder takes the image tokens and refines pixels | last-mile quality boost; the “Pro” tier in many product names |