generative_continuous / 15 · hybrid pipelines lesson 15 / 15

Hybrid pipelines

When full unification isn’t the right answer: an LLM produces conditioning, a continuous diffusion model produces pixels. SD3 + T5, DALL-E 3 + GPT-4, Imagen 3 + PaLM-style encoder, and where each design lives or dies.

The frame

Lessons 13–14 made the case for unified-token transformers: one model, joint reasoning + generation, in-band CoT. That’s the right architecture if you’re building a multimodal flagship from scratch with enough compute. But many production systems don’t commit to it — instead they keep an LLM and a diffusion model as separate components glued by a conditioning interface.

Why? Three reasons:

  1. Quality of the specialist. The state-of-the-art continuous diffusion stack (DiT-XL, FM with v-prediction, latent space from a deep VAE) is genuinely hard to beat for raw image quality at matched compute. If you don’t want to retrain that quality bar from scratch on a unified-token regime, you keep it.
  2. Independent iteration. The LLM team and the diffusion team can ship improvements independently. Two-tower lets you swap your LLM for a newer one without retraining the generator (and vice versa).
  3. Compute economics. A pure-text LLM is much cheaper per token than a unified-token LM with image tokens in vocab. If 95% of your traffic is text-only, you pay for image-vocab embeddings on every turn.

The interface: how an LLM conditions a diffusion model

This is the crux of every hybrid. The diffusion model expects a conditioning vector (or sequence of vectors); the LLM produces something. Bridging them is the design choice.

StyleWhat the LLM producesHow the diffusion model consumes itUsed by
String rewriterewritten prompt (text)frozen text encoder (CLIP / T5) → cross-attnDALL-E 3 (GPT-4 → SD-class diffuser), early Midjourney + LLM
Continuous text embeddinglast-layer hidden states of an LLM at the promptcross-attn from diffusion U-Net / DiT, with adapterImagen-class (T5-XXL → diffusion), eDiff-I, GLIDE
Multi-encoder ensemble(CLIP-text-emb, T5-text-emb, optional LLM-emb)concatenated / cross-attn per encoderSD3 (CLIP-L + CLIP-G + T5-XXL); Flux 1 (CLIP-L + T5-XXL, the two-encoder variant of the same family)
Plan + referencestructured plan: bboxes + per-region captions, plus latent reference imagecross-attn for text + ControlNet-style spatial conditioningRPG-DiffusionMaster, several layout-aware pro pipelines
LLM-as-controllertool calls (generate this, edit that, mask here)diffusion model is called per tool invocationagentic image pipelines, “visual ChatGPT”-style systems

The Stable Diffusion 3 / Flux recipe in detail

SD3 (Esser et al. 2024) is the canonical example. Three text encoders, all frozen, concatenated for joint conditioning:

Flux 1 (Black Forest Labs 2024) is a close relative but uses two encoders: CLIP-L + T5-XXL (no CLIP-G). The argument: CLIP-L gives a contrastive-aligned summary; T5 carries the compositional / syntactic load; CLIP-G adds capacity at meaningful cost without a clear win. Same MM-DiT backbone, leaner conditioning interface.

The diffusion backbone is a flow-matching DiT (lesson 7–8). The text features from all three encoders are concatenated along the sequence axis and fed into the DiT via joint attention — both text and image tokens are in the same attention pool, attending bidirectionally. (This is sometimes called the “MM-DiT” block in SD3.) The text tokens never get image-token outputs — only the image-token positions are denoised — but they get the attention signal.

prompt (text) CLIP-Lfrozen CLIP-Gfrozen T5-XXLfrozen MM-DiT (flow matching)joint attention over [text features, noisy image latents] image latents → VAE decode → pixels

Why three encoders?

Different text encoders have different inductive biases:

Concatenating both gives the diffusion model access to both flavors of conditioning. Empirically, adding T5 is what made “a photo of A but not B” type prompts actually work — CLIP alone confuses negation.

Interactive · pick a hybrid architecture

Below: a decision-tree explorer. Click choices about your use case and watch the recommended architecture materialize on the right.

Pick a hybrid architecture for your use case
Five questions. The right panel shows the recommended stack and its trade-offs.

Where each design fails

Hybrid failure modes

When hybrid still wins (in 2026)

SituationWhy hybrid
You want the absolute best raw image qualityfrozen-best diffusion specialist + good text conditioning is still SOTA at fixed compute
You need to swap models oftentwo-tower lets you swap LLM or diffusion independently
Most queries are text-onlydon’t pay image-vocab embedding cost on every turn
You have no joint training datacan use any LLM, any diffusion model; no interleaved data needed
Tight latency + simple promptscached text embedding + small diffusion = fastest per-query

The pattern, unified

The whole series has been about where you put the boundary between language-shaped and pixel-shaped computation.

ArchitectureLanguage-shaped layerPixel-shaped layerGlue
Pure diffusion (Part A)text encoder (CLIP/T5)DiT / UNet on continuous latentscross-attention
Unified-token (lessons 13-14)one transformer over text+image tokensVQ decoder (small)vocab and embedding table
Reasoning-augmented (lesson 14)same as above, with CoTparallel-decoded image tokens + optional diffusion polishshared transformer + optional decoder hand-off
Hybrid (lesson 15)separate LLMseparate diffusion modelprompt string or learned conditioning vector

The four rows correspond to four points on a spectrum: how much do you commit to a shared representation between language and image computation? The full series has visited each.

Where to go from here

Punchline
Hybrid pipelines are not obsolete — they’re the right answer when you want to keep specialist quality and independent iteration. The unified-token route wins for joint reasoning, in-band editing, and shared-state generation. Pick by use case, not by hype.
End of series
Fifteen lessons. From DDPM’s closed-form marginal to the reasoning systems that compose on top. The code in generative_continuous/ is the Part-A implementation; everything in Part B is the architectural family on top of it.