Hybrid pipelines

When full unification isn’t the right answer: an LLM produces conditioning, a continuous diffusion model produces pixels. SD3 + T5, DALL-E 3 + GPT-4, Imagen 3 + PaLM-style encoder, and where each design lives or dies.

The frame

Lessons 13–14 made the case for unified-token transformers: one model, joint reasoning + generation, in-band CoT. That’s the right architecture if you’re building a multimodal flagship from scratch with enough compute. But many production systems don’t commit to it — instead they keep an LLM and a diffusion model as separate components glued by a conditioning interface.

Why? Three reasons:

Quality of the specialist. The state-of-the-art continuous diffusion stack (DiT-XL, FM with v-prediction, latent space from a deep VAE) is genuinely hard to beat for raw image quality at matched compute. If you don’t want to retrain that quality bar from scratch on a unified-token regime, you keep it.
Independent iteration. The LLM team and the diffusion team can ship improvements independently. Two-tower lets you swap your LLM for a newer one without retraining the generator (and vice versa).
Compute economics. A pure-text LLM is much cheaper per token than a unified-token LM with image tokens in vocab. If 95% of your traffic is text-only, you pay for image-vocab embeddings on every turn.

The interface: how an LLM conditions a diffusion model

This is the crux of every hybrid. The diffusion model expects a conditioning vector (or sequence of vectors); the LLM produces something. Bridging them is the design choice.

Style	What the LLM produces	How the diffusion model consumes it	Used by
String rewrite	rewritten prompt (text)	frozen text encoder (CLIP / T5) → cross-attn	DALL-E 3 (GPT-4 → SD-class diffuser), early Midjourney + LLM
Continuous text embedding	last-layer hidden states of an LLM at the prompt	cross-attn from diffusion U-Net / DiT, with adapter	Imagen-class (T5-XXL → diffusion), eDiff-I, GLIDE
Multi-encoder ensemble	(CLIP-text-emb, T5-text-emb, optional LLM-emb)	concatenated / cross-attn per encoder	SD3 (CLIP-L + CLIP-G + T5-XXL); Flux 1 (CLIP-L + T5-XXL, the two-encoder variant of the same family)
Plan + reference	structured plan: bboxes + per-region captions, plus latent reference image	cross-attn for text + ControlNet-style spatial conditioning	RPG-DiffusionMaster, several layout-aware pro pipelines
LLM-as-controller	tool calls (generate this, edit that, mask here)	diffusion model is called per tool invocation	agentic image pipelines, “visual ChatGPT”-style systems

The Stable Diffusion 3 / Flux recipe in detail

SD3 (Esser et al. 2024) is the canonical example. Three text encoders, all frozen, concatenated for joint conditioning:

CLIP-L: small text encoder trained on image-text contrastive. Captures lexical / coarse concept matching.
CLIP-G: bigger CLIP variant. Same modality alignment, more capacity.
T5-XXL: 11B-parameter language model. Captures syntax, compositionality, longer-range relations.

Flux 1 (Black Forest Labs 2024) is a close relative but uses two encoders: CLIP-L + T5-XXL (no CLIP-G). The argument: CLIP-L gives a contrastive-aligned summary; T5 carries the compositional / syntactic load; CLIP-G adds capacity at meaningful cost without a clear win. Same MM-DiT backbone, leaner conditioning interface.

The diffusion backbone is a flow-matching DiT (lesson 7–8). Here is the move that trips people up. The three encoders each turn the prompt into a little run of feature vectors; you lay all of those out in one line, then lay the noisy image patches out right after them, so the transformer sees one long sequence of [text features, then image patches]. Inside the attention layer every position can look at every other position — image patches read the text, text reads the image patches, image patches read each other — all in the same pool, both directions. That is what “joint attention” means, and SD3 calls this block the MM-DiT. The one asymmetry: at the end, only the image positions get cleaned up (denoised). The text positions are along for the ride — they are there to be read from, not to be turned into output. So the prompt steers the picture through attention, but the prompt itself is never “generated.”

Intuition · linear unpacking

Claim: in an MM-DiT the text tokens shape the image without ever being an output, because attention is a reading relationship, not a writing one.

One sequence, two kinds of tokens. Glue the prompt’s feature vectors and the noisy image patches into a single list. Inside the attention step they all sit in one pool of positions, so everyone can look at everyone — that is the part that matters here. (MM-DiT does keep separate weights per modality elsewhere in the block; it’s the shared attention pool that lets text and image talk.)
Attention is “who do I get to look at.” Every position pulls information from every other position. So an image patch can pull from the words, and from its neighbouring patches, in the same step. That is the only channel through which the prompt reaches the pixels.
Only the image positions are scored. The training loss asks one thing: did you denoise the image patches correctly? It never asks the text positions to predict anything. So the network only ever learns to clean up the image half.
Therefore the text is read-only. The words contribute by being looked at (step 2), not by being produced (step 3). They condition the picture without themselves being generated — which is exactly why this is a conditioning interface, not a unified-token model.

Central point. The prompt does not become part of the image; it sits in the same attention pool so the image can keep glancing at it while it denoises.

Why three encoders?

Different text encoders have different inductive biases:

CLIP encoders were trained against images and learn vocabulary that maps to visual concepts — they understand “photorealistic,” “watercolor,” “cyberpunk.”
T5 was trained as a pure language model and understands syntax, negation, compositional relations.

Concatenating both gives the diffusion model access to both flavors of conditioning. Empirically, adding T5 is what made “a photo of A but not B” type prompts actually work — CLIP alone confuses negation.

Intuition · linear unpacking

Claim: you stack multiple encoders because each one was taught to listen for a different thing, and the prompt needs all of them heard at once.

CLIP learned words-to-look. It was trained by matching captions to images, so it is fluent in “what does this look like” — styles, textures, named visual concepts. But it was never trained to track sentence structure, so “A but not B” and “A and B” look nearly the same to it.
T5 learned grammar. It was trained only on language, so it tracks word order, negation, and which adjective attaches to which noun — the things CLIP drops.
The bottleneck is real. Whatever an encoder did not learn to hear is simply gone by the time the features reach the diffusion model; there is no later stage that can recover it.
So you run them in parallel. Feed the same prompt through both, lay their features side by side, and the diffusion model gets the visual vocabulary from CLIP and the sentence logic from T5 in one conditioning signal.

Central point. Multiple encoders are not redundancy — each catches a part of the prompt the others are deaf to, and concatenating them is how you avoid losing it.

Interactive · pick a hybrid architecture

Below: a decision-tree explorer. Click choices about your use case and watch the recommended architecture materialize on the right.

Where each design fails

Hybrid failure modes

Prompt-rewrite drift. The LLM rewrites the user’s “simple sketch in pencil” into “a hyper-detailed cinematic photograph” because that’s what its training thinks “looks good.” The diffusion model honors the rewrite. The user gets the opposite of what they asked for.
Encoder-bottleneck information loss. CLIP-L is ~80M parameters; if you compress a 2000-word art-direction prompt through it, most of the structure is gone. T5-XXL helps but the bottleneck is real.
No introspection. The diffusion model can’t ask “wait, what color hat?”. If the prompt is ambiguous, the model just commits to one interpretation. A unified-token reasoning model could include a clarification round (or at least produce a CoT making the choice explicit).
Editing is a separate pipeline. “Take this image, change one thing” isn’t native; you need a separate inpainting / ControlNet / InstructPix2Pix model and a router that picks the right one. Unified-token models handle this in band.
Latency stacks. LLM rewrites prompt (1s), CLIP encodes (50ms), T5 encodes (200ms), diffusion runs 25 steps (3s). Each component adds latency; the unified-token route can in principle short-circuit the easy cases.

When hybrid still wins (in 2026)

Situation	Why hybrid
You want the absolute best raw image quality	frozen-best diffusion specialist + good text conditioning is still SOTA at fixed compute
You need to swap models often	two-tower lets you swap LLM or diffusion independently
Most queries are text-only	don’t pay image-vocab embedding cost on every turn
You have no joint training data	can use any LLM, any diffusion model; no interleaved data needed
Tight latency + simple prompts	cached text embedding + small diffusion = fastest per-query

The pattern, unified

The whole series has been about where you put the boundary between language-shaped and pixel-shaped computation.

Architecture	Language-shaped layer	Pixel-shaped layer	Glue
Pure diffusion (Part A)	text encoder (CLIP/T5)	DiT / UNet on continuous latents	cross-attention
Unified-token (lessons 13-14)	one transformer over text+image tokens	VQ decoder (small)	vocab and embedding table
Reasoning-augmented (lesson 14)	same as above, with CoT	parallel-decoded image tokens + optional diffusion polish	shared transformer + optional decoder hand-off
Hybrid (lesson 15)	separate LLM	separate diffusion model	prompt string or learned conditioning vector

The four rows correspond to four points on a spectrum: how much do you commit to a shared representation between language and image computation? The full series has visited each.

Where to go from here

Audio. The same toolkit (VQ → token transformer + parallel decoder, or diffusion in latent space) is what underlies modern speech and music synthesis. Lesson 11’s RVQ note is the entry point.
Video. Add a temporal axis. Two distinct lineages here, not one: Sora-style models do continuous diffusion on spacetime latent patches (DiT operating over (H, W, T) tokens from a learned VAE); MAGVIT-v2-style models tokenize the video into discrete codes (causal 3D VQGAN with LFQ) and then run an LM / parallel decoder over the resulting token grid. Same end product, opposite stage-1 choices.
Agentic generation. The reasoning models in lesson 14 are the precursor to image-generation agents that decompose tasks, generate intermediate artifacts, evaluate, retry. The boundary between “reasoning model” and “agent” will keep blurring.
3D / scenes. NeRF-style and 3DGS-style generation are converging on diffusion-in-latent-space (DreamFusion, LRM, Genie 3-style). Same math, different output space.

Punchline

Hybrid pipelines are not obsolete — they’re the right answer when you want to keep specialist quality and independent iteration. The unified-token route wins for joint reasoning, in-band editing, and shared-state generation. Pick by use case, not by hype.

End of Part B — onward to the production stack

Parts A and B took the two architectural forks: continuous diffusion (A) and discrete/multimodal tokens (B). Part C returns to continuous diffusion and builds the actual production text-to-image stack — the score/SDE theory, the fast samplers, guidance, the VAE latent space, CLIP conditioning, evaluation, the Stable Diffusion lineage, and the adapter ecosystem. It branches off Part A, so the next lesson reconnects to lessons 03–07.