Hybrid pipelines
When full unification isn’t the right answer: an LLM produces conditioning, a continuous diffusion model produces pixels. SD3 + T5, DALL-E 3 + GPT-4, Imagen 3 + PaLM-style encoder, and where each design lives or dies.
The frame
Lessons 13–14 made the case for unified-token transformers: one model, joint reasoning + generation, in-band CoT. That’s the right architecture if you’re building a multimodal flagship from scratch with enough compute. But many production systems don’t commit to it — instead they keep an LLM and a diffusion model as separate components glued by a conditioning interface.
Why? Three reasons:
- Quality of the specialist. The state-of-the-art continuous diffusion stack (DiT-XL, FM with v-prediction, latent space from a deep VAE) is genuinely hard to beat for raw image quality at matched compute. If you don’t want to retrain that quality bar from scratch on a unified-token regime, you keep it.
- Independent iteration. The LLM team and the diffusion team can ship improvements independently. Two-tower lets you swap your LLM for a newer one without retraining the generator (and vice versa).
- Compute economics. A pure-text LLM is much cheaper per token than a unified-token LM with image tokens in vocab. If 95% of your traffic is text-only, you pay for image-vocab embeddings on every turn.
The interface: how an LLM conditions a diffusion model
This is the crux of every hybrid. The diffusion model expects a conditioning vector (or sequence of vectors); the LLM produces something. Bridging them is the design choice.
| Style | What the LLM produces | How the diffusion model consumes it | Used by |
|---|---|---|---|
| String rewrite | rewritten prompt (text) | frozen text encoder (CLIP / T5) → cross-attn | DALL-E 3 (GPT-4 → SD-class diffuser), early Midjourney + LLM |
| Continuous text embedding | last-layer hidden states of an LLM at the prompt | cross-attn from diffusion U-Net / DiT, with adapter | Imagen-class (T5-XXL → diffusion), eDiff-I, GLIDE |
| Multi-encoder ensemble | (CLIP-text-emb, T5-text-emb, optional LLM-emb) | concatenated / cross-attn per encoder | SD3 (CLIP-L + CLIP-G + T5-XXL); Flux 1 (CLIP-L + T5-XXL, the two-encoder variant of the same family) |
| Plan + reference | structured plan: bboxes + per-region captions, plus latent reference image | cross-attn for text + ControlNet-style spatial conditioning | RPG-DiffusionMaster, several layout-aware pro pipelines |
| LLM-as-controller | tool calls (generate this, edit that, mask here) | diffusion model is called per tool invocation | agentic image pipelines, “visual ChatGPT”-style systems |
The Stable Diffusion 3 / Flux recipe in detail
SD3 (Esser et al. 2024) is the canonical example. Three text encoders, all frozen, concatenated for joint conditioning:
- CLIP-L: small text encoder trained on image-text contrastive. Captures lexical / coarse concept matching.
- CLIP-G: bigger CLIP variant. Same modality alignment, more capacity.
- T5-XXL: 11B-parameter language model. Captures syntax, compositionality, longer-range relations.
Flux 1 (Black Forest Labs 2024) is a close relative but uses two encoders: CLIP-L + T5-XXL (no CLIP-G). The argument: CLIP-L gives a contrastive-aligned summary; T5 carries the compositional / syntactic load; CLIP-G adds capacity at meaningful cost without a clear win. Same MM-DiT backbone, leaner conditioning interface.
The diffusion backbone is a flow-matching DiT (lesson 7–8). The text features from all three encoders are concatenated along the sequence axis and fed into the DiT via joint attention — both text and image tokens are in the same attention pool, attending bidirectionally. (This is sometimes called the “MM-DiT” block in SD3.) The text tokens never get image-token outputs — only the image-token positions are denoised — but they get the attention signal.
Why three encoders?
Different text encoders have different inductive biases:
- CLIP encoders were trained against images and learn vocabulary that maps to visual concepts — they understand “photorealistic,” “watercolor,” “cyberpunk.”
- T5 was trained as a pure language model and understands syntax, negation, compositional relations.
Concatenating both gives the diffusion model access to both flavors of conditioning. Empirically, adding T5 is what made “a photo of A but not B” type prompts actually work — CLIP alone confuses negation.
Interactive · pick a hybrid architecture
Below: a decision-tree explorer. Click choices about your use case and watch the recommended architecture materialize on the right.
Where each design fails
- Prompt-rewrite drift. The LLM rewrites the user’s “simple sketch in pencil” into “a hyper-detailed cinematic photograph” because that’s what its training thinks “looks good.” The diffusion model honors the rewrite. The user gets the opposite of what they asked for.
- Encoder-bottleneck information loss. CLIP-L is ~80M parameters; if you compress a 2000-word art-direction prompt through it, most of the structure is gone. T5-XXL helps but the bottleneck is real.
- No introspection. The diffusion model can’t ask “wait, what color hat?”. If the prompt is ambiguous, the model just commits to one interpretation. A unified-token reasoning model could include a clarification round (or at least produce a CoT making the choice explicit).
- Editing is a separate pipeline. “Take this image, change one thing” isn’t native; you need a separate inpainting / ControlNet / InstructPix2Pix model and a router that picks the right one. Unified-token models handle this in band.
- Latency stacks. LLM rewrites prompt (1s), CLIP encodes (50ms), T5 encodes (200ms), diffusion runs 25 steps (3s). Each component adds latency; the unified-token route can in principle short-circuit the easy cases.
When hybrid still wins (in 2026)
| Situation | Why hybrid |
|---|---|
| You want the absolute best raw image quality | frozen-best diffusion specialist + good text conditioning is still SOTA at fixed compute |
| You need to swap models often | two-tower lets you swap LLM or diffusion independently |
| Most queries are text-only | don’t pay image-vocab embedding cost on every turn |
| You have no joint training data | can use any LLM, any diffusion model; no interleaved data needed |
| Tight latency + simple prompts | cached text embedding + small diffusion = fastest per-query |
The pattern, unified
The whole series has been about where you put the boundary between language-shaped and pixel-shaped computation.
| Architecture | Language-shaped layer | Pixel-shaped layer | Glue |
|---|---|---|---|
| Pure diffusion (Part A) | text encoder (CLIP/T5) | DiT / UNet on continuous latents | cross-attention |
| Unified-token (lessons 13-14) | one transformer over text+image tokens | VQ decoder (small) | vocab and embedding table |
| Reasoning-augmented (lesson 14) | same as above, with CoT | parallel-decoded image tokens + optional diffusion polish | shared transformer + optional decoder hand-off |
| Hybrid (lesson 15) | separate LLM | separate diffusion model | prompt string or learned conditioning vector |
The four rows correspond to four points on a spectrum: how much do you commit to a shared representation between language and image computation? The full series has visited each.
Where to go from here
- Audio. The same toolkit (VQ → token transformer + parallel decoder, or diffusion in latent space) is what underlies modern speech and music synthesis. Lesson 11’s RVQ note is the entry point.
- Video. Add a temporal axis. Two distinct lineages here, not one: Sora-style models do continuous diffusion on spacetime latent patches (DiT operating over (H, W, T) tokens from a learned VAE); MAGVIT-v2-style models tokenize the video into discrete codes (causal 3D VQGAN with LFQ) and then run an LM / parallel decoder over the resulting token grid. Same end product, opposite stage-1 choices.
- Agentic generation. The reasoning models in lesson 14 are the precursor to image-generation agents that decompose tasks, generate intermediate artifacts, evaluate, retry. The boundary between “reasoning model” and “agent” will keep blurring.
- 3D / scenes. NeRF-style and 3DGS-style generation are converging on diffusion-in-latent-space (DreamFusion, LRM, Genie 3-style). Same math, different output space.
generative_continuous/ is the Part-A implementation; everything in Part B is the architectural family on top of it.