generative_continuous / 14 · reasoning + generation lesson 14 / 15

Reasoning + image generation

The qualitative shift in late-2025 / 2026 image models: chain-of-thought before drawing. Nano Banana Pro and the GPT-Image-2 family as the archetypal examples.

What’s public vs. inferred
This lesson focuses on the architectural pattern. The detailed weights, training data, and pipeline of Nano Banana Pro (Google), GPT-Image-2 (OpenAI), and similar production systems are proprietary. What’s public from research papers and product reporting: they are built on unified-token transformers (lesson 13), they use chain-of-thought-style reasoning over image generation tasks, and they hybridize discrete and continuous decoders. Anything more specific is inferred from outputs and reasonable architecture priors.

What changed

The first wave of text-to-image models (DALL-E 1, SD 1.x, Midjourney v3) treated the prompt as a fixed string. You wrote a sentence, it produced an image. If the result wasn’t what you wanted, you iterated on the prompt — manually.

The reasoning-augmented wave (Nano Banana Pro, the GPT-Image-2 family, and reasoning-style image generation in Gemini and Claude-class native multimodal models) treats the prompt as a problem to solve. The model:

  1. Reads the prompt.
  2. Writes a chain-of-thought in text tokens: parses the request, lists constraints, sketches what the image should contain, considers ambiguities, picks a layout.
  3. (Optionally) writes intermediate planning artifacts: a textual scene graph, bounding boxes for major elements, a color palette, a style description.
  4. Then emits image tokens conditioned on all of the above.
user prompt “a chess scene” text CoT: parse, plan, constraints checks for ambiguities, picks layout, palette scene plan (text) bboxes, captions per region image tokens emitted parallel-decoded, ~12 rounds (optional) diffusion decoder refines tokens → pixels image All boxes except the diffusion decoder are inside one transformer.

Why chain-of-thought helps generation

For tasks that are decomposable, planning beats pattern-matching:

Editing as token surgery

Unified-token transformers make editing into a one-shot prompt:

# Conceptual sequence at inference time
<input image tokens>          # the original
<text: "remove the lamp">     # the instruction
<BOI> ... <EOI>               # what the model emits — the edited image

The model has the source image, the instruction, and learned during training to honor instructions on interleaved data. It produces new image tokens that selectively diverge from the source where the instruction asks for change. Because both source and target are in the same vocabulary, the model can literally copy unchanged regions token-for-token.

Reasoning helps editing in the same way it helps generation. The CoT spells out: “the lamp is the brass object near the top-left; replace its pixels with continuation of the wall texture; preserve everything else.”

Interactive · “think then draw” side by side

Below: a mock model running the same prompt under two policies. Left: no CoT, just emits image tokens. Right: with CoT, emits planning tokens first, then image tokens. The “image” is a synthetic 8×8 token grid; the planning step modifies the probabilities the model uses for token decisions. Click run to see them side by side.

Same prompt, two policies
Type a prompt (try counting requests: “three blue cells in a row”, or text-rendering: “letters: STOP”). The CoT side spends extra tokens planning, then commits to image cells with that plan as conditioning.

The CoT side is mock-deterministic: a simple planner parses the prompt for counts and target colors and biases the decoder accordingly. Real CoT in production models is an emergent capability of the joint transformer.

Reasoning over multiple input images

The unified-token architecture handles “here are three reference images, combine them in this way” with no special pipeline:

<image_1 tokens>
<text: "use the woman from image 1, the background from image 2,">
<text: "and apply the lighting from image 3.">
<image_2 tokens>
<image_3 tokens>
<text: "produce a composite that ...">
<BOI> ... <EOI>

The transformer attends across all input image spans and the text in one pass; the output image tokens condition on everything. Reasoning helps the model decompose the constraints (whose face, whose background, whose lighting) before committing to the composite.

What this enables that prompt-rewriting didn’t

DALL-E 3 famously uses GPT-4 to rewrite the user’s prompt into something the underlying SD-like diffusion model understands better. That’s a useful trick but it’s strictly weaker than in-band CoT:

CapabilityPrompt rewritingIn-band CoT
Plan responds to user intentyes (LLM rewrites)yes
Generator can attend to plan during emissiononly via the rewritten prompt’s text embeddingdirectly via attention to plan tokens
Plan can reference image tokens of input imagesonly verballydirectly (tokens in same vocabulary)
Plan can iterate on a partial generationno (one-shot)yes (model can emit some tokens, “reconsider”, emit more)
Same model for understanding and generationno (two models)yes (one transformer)

The last row is the structural one. With two models, the rewriter never sees the generator’s output, the generator never sees the rewriter’s reasoning. With one transformer, every token can attend to every prior token; reasoning and emission share state.

Trade-offs and failure modes

Where reasoning models still lose

The pattern, abstracted

StageWhat happensWhy
Parsemodel writes text identifying entities, attributes, relations in promptmakes structure explicit; lets later stages reference
Planmodel writes layout, color palette, style, maybe per-region captionsturns one image into a series of locally-easy decisions
Reason about constraintsmodel resolves conflicts, fills gaps, chooses defaultsthis is where “think harder” pays off — counting, text rendering, composition
(optional) intermediate sketchsome systems emit a low-res image first, evaluate, refinemulti-pass refinement; expensive but improves consistency
Emitparallel-decode the final image tokenscommitted plan now controls token sampling
(optional) diffusion polishfrozen diffusion decoder takes the image tokens and refines pixelslast-mile quality boost; the “Pro” tier in many product names
Punchline
Reasoning + generation = same unified-token transformer that thinks in language tokens, then writes image tokens, with everything attending to everything else. The model gets to use its language capabilities (parse, plan, count, spell) in service of generation. The dramatic improvement in text-rendering, counting, and instruction-following in late-2025 / 2026 image models all flows from this one architectural choice.