Computer vision — detection to image generation
Vision is the home turf of backpropagation: convolutions, pixels, gradients flowing everywhere. So why is RL here at all? Because two of vision's most useful moves are not differentiable. Choosing where to look on an image is a discrete decision; scoring a generated image as "beautiful" or "faithful to the prompt" is a verdict you cannot differentiate through. Both gaps have the same shape, and we have already built the tool for that shape twice.
- Policy gradient — lesson 10. The proven estimator ∇J = 𝔼[ G · ∇log πθ ]. Whenever a decision is discrete or a reward is a black box, you cannot backprop through it — but you can sample it and weight its log-probability by the reward. That is the whole trick, applied here to "where the eye looks" and to "which image got generated."
- GANs ↔ RL / GAIL — lesson 17. A GAN is an adversarial game: a generator versus a discriminator. Lesson 17 made the bridge explicit — GAIL's discriminator is a learned reward feeding a policy-gradient learner. We return to that bridge to align modern image generators.
The unifying defect: non-differentiable decisions
Standard vision training is one long differentiable pipeline. Pixels go in, a loss comes out, and ∂loss/∂θ is defined at every step, so gradient descent works. RL appears precisely where that chain breaks:
| Where it breaks | The non-differentiable step | The RL fix |
|---|---|---|
| Hard attention | pick a discrete pixel location to glimpse next | REINFORCE on the location policy (L07) |
| Sequential detection | emit / stop emitting object boxes one at a time | policy gradient over the box sequence (L07) |
| Image generation alignment | a reward (aesthetics, prompt-match) is a black box | treat denoising as a trajectory, reward the final image (L07) |
| Adversarial generation | generator vs discriminator is a game, not a loss | the GAN ↔ RL view (L14) |
Keep one sentence in mind for the rest of the lesson: if you cannot differentiate the decision, sample it and weight its log-probability by the reward. Everything below is that sentence in a different costume.
Hard attention: an agent that chooses where to look
A high-resolution image is expensive. Biology's answer is the fovea: the eye sees one small spot sharply and saccades — jumps — to the next informative spot, integrating a handful of glimpses into a percept. Hard attention (the Recurrent Attention Model, Mnih et al. 2014) copies this. The network never sees the full image. Instead, on each of N steps it extracts a small glimpse centred at a location it chose, updates a recurrent memory, and after its glimpse budget is spent it classifies.
This is an MDP, term for term:
for t = 1 .. N: # N = the GLIMPSE BUDGET
glimpse g_t = crop(image, ℓ_t) # see a patch around chosen location ℓ_t
h_t = RNN(h_{t-1}, g_t) # integrate into memory (the state)
ℓ_{t+1} ~ π_θ(· | h_t) # CHOOSE where to look next (the action)
classify ŷ = head(h_N) # final decision after budget spent
reward r = 1 if ŷ == y else 0 # terminal, non-differentiable
The classifier head is trained by ordinary cross-entropy — that part is differentiable. The problem is the location policy πθ(ℓ | h): choosing where to look is a discrete sampling step. There is no gradient from "I looked here" back into "looking here was wise," because the reward (did the final guess turn out right?) arrives much later and depends on a chain of sampled locations. This is exactly the wall lesson 10 climbed. So we train the locations with REINFORCE:
Read it with lesson-07 eyes: R is the terminal reward (1 for a correct class, 0 otherwise), b is a baseline that subtracts the average so below-average glimpse-sequences get pushed down, and the sum runs over the location decisions because the trajectory's log-probability factorizes over its steps. Glimpse policies that led to correct classifications get their locations made more likely; the rest get suppressed. The fork from the orientation is visible: the location head is the policy; the baseline is a tiny value estimate critiquing it.
Interactive · the foveating classifier
Below is a tiny hard-attention agent. The "image" is a grid: one bright cell is the target that decides the true class (which quadrant it sits in), and the rest is distractor texture. The agent has a glimpse budget: each glimpse reveals a small neighbourhood, and after the budget is spent it guesses the class from what it saw. A random glimpse policy scatters its budget and usually misses the target; a learned policy concentrates glimpses where the evidence has been, finding the target with far fewer looks. Turn the budget down with the learned policy off — watch accuracy collapse. That is the bug, and it is the lesson.
The widget compresses the real RAM training loop into a hand-built belief update so it runs in your browser, but the moral is faithful: a policy that decides where to look, a terminal reward (right class or not), and a budget that makes the decision matter. With the learned policy on, raising the budget buys accuracy fast; with it off, even a generous budget barely beats chance because random looks rarely land on the one cell that carries the signal.
Sequential object detection
The same idea scales past classification. Instead of one label, a detector must emit a set of bounding boxes — and "emit another box or stop" is a discrete, sequential decision. Treat detection as a trajectory: the agent attends to a region, decides a box, attends again, and decides when to halt. The reward is mean average precision (mAP) over the emitted set — a metric you cannot differentiate, computed only after the whole sequence is done. Policy gradient handles the discrete emit/stop and where-to-attend choices; a differentiable head still regresses the exact box coordinates. It is hard attention with a richer action space and the same L07 backbone.
Image generation: aligning a generator with a reward
Now the second gap. A diffusion model generates an image by starting from pure noise and denoising over T steps. Train it the usual way and it learns to match a data distribution — but it has no notion of "this image is more beautiful" or "this image is more faithful to the prompt." Those are exactly the judgments we want, and they come from a black box: a learned aesthetic scorer, a CLIP prompt-similarity, a human preference. You cannot backprop a clean gradient through "a human liked it."
So make the move from lesson 16, but for pixels. View the T denoising steps as a trajectory, the partially-denoised latent at each step as the state, the denoiser's stochastic update as the action, and the reward as a single scalar handed back after the final image is scored. Then the diffusion model is a policy and we optimize it with the lesson-07 gradient — this is DDPO (Denoising Diffusion Policy Optimization) / RLHF-for-diffusion:
where x0 is the finished image and r(x0) is the black-box reward. Term for term this is the RAM gradient above and the BLEU gradient from lesson 62: a terminal, non-differentiable reward, a trajectory whose log-probability factorizes over steps, a baseline to cut variance. The exact same machinery aligns a text generator (RLHF), a translator (BLEU), and an image generator (DDPO). And it inherits the same hazard — reward hacking: optimize an aesthetic scorer too hard and you get over-saturated, uncanny images that score high and look wrong, which is why a KL anchor to the original model (the lesson-13 trick) earns its keep here too.
Map back to the spine
Both halves of computer-vision RL are one idea wearing two hats:
| Vision task | Non-differentiable thing | Spine method |
|---|---|---|
| Hard attention / RAM | discrete "where to look" choice | policy + tiny value baseline — REINFORCE (L03/07) |
| Sequential detection | discrete emit/stop over a box sequence | policy gradient on a trajectory (L07) |
| Diffusion alignment (DDPO) | black-box reward on the final image | RLHF for pixels — trajectory PG + KL anchor (L07/13) |
| GAN generation | adversarial game, not a loss | discriminator as reward — the GAN ↔ RL bridge (L14) |
Plot it on the value / policy / model map from the orientation. There is no model of pixel dynamics being learned and no environment to plan in, so the model corner stays empty. Everything lives on the policy side of the fork: vision's RL is policy gradient applied wherever the decision is discrete (where to look, what to emit) or the reward is a black box (is this image good?). The value appears only in its supporting role — the baseline that tames REINFORCE's variance. Same fork, same variance villain, now pointed at pixels.