all lessons / reinforcement learning / 64 · Computer vision lesson 64 / 87

Computer vision — detection to image generation

Vision is the home turf of backpropagation: convolutions, pixels, gradients flowing everywhere. So why is RL here at all? Because two of vision's most useful moves are not differentiable. Choosing where to look on an image is a discrete decision; scoring a generated image as "beautiful" or "faithful to the prompt" is a verdict you cannot differentiate through. Both gaps have the same shape, and we have already built the tool for that shape twice.

What this lesson reuses

The unifying defect: non-differentiable decisions

Standard vision training is one long differentiable pipeline. Pixels go in, a loss comes out, and ∂loss/∂θ is defined at every step, so gradient descent works. RL appears precisely where that chain breaks:

Where it breaksThe non-differentiable stepThe RL fix
Hard attentionpick a discrete pixel location to glimpse nextREINFORCE on the location policy (L07)
Sequential detectionemit / stop emitting object boxes one at a timepolicy gradient over the box sequence (L07)
Image generation alignmenta reward (aesthetics, prompt-match) is a black boxtreat denoising as a trajectory, reward the final image (L07)
Adversarial generationgenerator vs discriminator is a game, not a lossthe GAN ↔ RL view (L14)

Keep one sentence in mind for the rest of the lesson: if you cannot differentiate the decision, sample it and weight its log-probability by the reward. Everything below is that sentence in a different costume.

Hard attention: an agent that chooses where to look

A high-resolution image is expensive. Biology's answer is the fovea: the eye sees one small spot sharply and saccades — jumps — to the next informative spot, integrating a handful of glimpses into a percept. Hard attention (the Recurrent Attention Model, Mnih et al. 2014) copies this. The network never sees the full image. Instead, on each of N steps it extracts a small glimpse centred at a location it chose, updates a recurrent memory, and after its glimpse budget is spent it classifies.

This is an MDP, term for term:

for t = 1 .. N:                       # N = the GLIMPSE BUDGET
    glimpse  g_t  = crop(image, ℓ_t)  # see a patch around chosen location ℓ_t
    h_t          = RNN(h_{t-1}, g_t)  # integrate into memory  (the state)
    ℓ_{t+1} ~ π_θ(· | h_t)            # CHOOSE where to look next  (the action)
classify  ŷ = head(h_N)              # final decision after budget spent
reward    r = 1 if ŷ == y else 0     # terminal, non-differentiable

The classifier head is trained by ordinary cross-entropy — that part is differentiable. The problem is the location policy πθ(ℓ | h): choosing where to look is a discrete sampling step. There is no gradient from "I looked here" back into "looking here was wise," because the reward (did the final guess turn out right?) arrives much later and depends on a chain of sampled locations. This is exactly the wall lesson 10 climbed. So we train the locations with REINFORCE:

θ J = 𝔼πθ [ ( R − b ) · ∑t=1Nθ log πθ(ℓt | ht) ]

Read it with lesson-07 eyes: R is the terminal reward (1 for a correct class, 0 otherwise), b is a baseline that subtracts the average so below-average glimpse-sequences get pushed down, and the sum runs over the location decisions because the trajectory's log-probability factorizes over its steps. Glimpse policies that led to correct classifications get their locations made more likely; the rest get suppressed. The fork from the orientation is visible: the location head is the policy; the baseline is a tiny value estimate critiquing it.

Why not just look everywhere?
Because the point is a budget. Hard attention buys you sub-linear cost in image size — a 4-megapixel image classified from six 24×24 glimpses — and an interpretable trace of where the model attended. The cost is that the "where to look" gradient is a high-variance REINFORCE estimate, with all the variance pathologies of lesson 10. Spend the budget badly and you classify from noise.

Interactive · the foveating classifier

Below is a tiny hard-attention agent. The "image" is a grid: one bright cell is the target that decides the true class (which quadrant it sits in), and the rest is distractor texture. The agent has a glimpse budget: each glimpse reveals a small neighbourhood, and after the budget is spent it guesses the class from what it saw. A random glimpse policy scatters its budget and usually misses the target; a learned policy concentrates glimpses where the evidence has been, finding the target with far fewer looks. Turn the budget down with the learned policy off — watch accuracy collapse. That is the bug, and it is the lesson.

Hard attention: spend your glimpses well
Each trial hides one target cell on a 12×12 grid. The agent takes budget glimpses (yellow rings = where it looked), then guesses the quadrant. Toggle the learned policy and slide the budget; the right panel tracks classification accuracy over many trials.
Trials
0
Accuracy
Found target?
Last guess
Show the glimpse-policy core (≈22 lines)
// Learned policy: a saliency belief over cells, sharpened by what each
// glimpse reveals. This stands in for the REINFORCE-trained location head:
// look where evidence has been, not uniformly at random.
function chooseLocation(belief, learned) {
  if (!learned) return randCell();          // random budget = the bug
  // sample proportional to current belief (explore + exploit)
  return sampleByWeight(belief);
}
function glimpse(loc, target) {
  // reveal a 3x3 neighbourhood; report whether target is inside,
  // and a noisy "warmer/colder" distance cue for nearby cells.
  const seen = neighbourhood(loc);
  return { hit: seen.includes(target), cue: proximity(loc, target) };
}
function updateBelief(belief, loc, obs) {
  if (obs.hit) { belief.fill(0); belief[target] = 1; return; }  // certain
  // push belief mass toward cells consistent with the proximity cue
  for (const c of cells) belief[c] *= consistency(c, loc, obs.cue);
  normalize(belief);
}
// after `budget` glimpses, classify = quadrant of argmax(belief)

The widget compresses the real RAM training loop into a hand-built belief update so it runs in your browser, but the moral is faithful: a policy that decides where to look, a terminal reward (right class or not), and a budget that makes the decision matter. With the learned policy on, raising the budget buys accuracy fast; with it off, even a generous budget barely beats chance because random looks rarely land on the one cell that carries the signal.

Sequential object detection

The same idea scales past classification. Instead of one label, a detector must emit a set of bounding boxes — and "emit another box or stop" is a discrete, sequential decision. Treat detection as a trajectory: the agent attends to a region, decides a box, attends again, and decides when to halt. The reward is mean average precision (mAP) over the emitted set — a metric you cannot differentiate, computed only after the whole sequence is done. Policy gradient handles the discrete emit/stop and where-to-attend choices; a differentiable head still regresses the exact box coordinates. It is hard attention with a richer action space and the same L07 backbone.

Image generation: aligning a generator with a reward

Now the second gap. A diffusion model generates an image by starting from pure noise and denoising over T steps. Train it the usual way and it learns to match a data distribution — but it has no notion of "this image is more beautiful" or "this image is more faithful to the prompt." Those are exactly the judgments we want, and they come from a black box: a learned aesthetic scorer, a CLIP prompt-similarity, a human preference. You cannot backprop a clean gradient through "a human liked it."

So make the move from lesson 16, but for pixels. View the T denoising steps as a trajectory, the partially-denoised latent at each step as the state, the denoiser's stochastic update as the action, and the reward as a single scalar handed back after the final image is scored. Then the diffusion model is a policy and we optimize it with the lesson-07 gradient — this is DDPO (Denoising Diffusion Policy Optimization) / RLHF-for-diffusion:

θ J = 𝔼πθ [ r(x0) · ∑t=1Tθ log πθ(xt−1 | xt) ]

where x0 is the finished image and r(x0) is the black-box reward. Term for term this is the RAM gradient above and the BLEU gradient from lesson 62: a terminal, non-differentiable reward, a trajectory whose log-probability factorizes over steps, a baseline to cut variance. The exact same machinery aligns a text generator (RLHF), a translator (BLEU), and an image generator (DDPO). And it inherits the same hazard — reward hacking: optimize an aesthetic scorer too hard and you get over-saturated, uncanny images that score high and look wrong, which is why a KL anchor to the original model (the lesson-13 trick) earns its keep here too.

The GAN ↔ RL bridge, returned to
Lesson 17 showed GAIL as a GAN whose discriminator is a learned reward feeding a policy-gradient learner. A plain GAN is the same shape: a generator (the policy producing images) plays against a discriminator (a learned critic scoring real-vs-fake) — an adversarial game, not a fixed loss. Reading the discriminator's score as the generator's reward turns "make realistic images" into "maximize a learned reward," exactly the policy-improvement half of the fork. DDPO simply swaps the adversarial discriminator for an explicit reward model (aesthetics, prompt-match), and uses a real trajectory — the denoising chain — instead of a single generation step.

Map back to the spine

Both halves of computer-vision RL are one idea wearing two hats:

Vision taskNon-differentiable thingSpine method
Hard attention / RAMdiscrete "where to look" choicepolicy + tiny value baseline — REINFORCE (L03/07)
Sequential detectiondiscrete emit/stop over a box sequencepolicy gradient on a trajectory (L07)
Diffusion alignment (DDPO)black-box reward on the final imageRLHF for pixels — trajectory PG + KL anchor (L07/13)
GAN generationadversarial game, not a lossdiscriminator as reward — the GAN ↔ RL bridge (L14)

Plot it on the value / policy / model map from the orientation. There is no model of pixel dynamics being learned and no environment to plan in, so the model corner stays empty. Everything lives on the policy side of the fork: vision's RL is policy gradient applied wherever the decision is discrete (where to look, what to emit) or the reward is a black box (is this image good?). The value appears only in its supporting role — the baseline that tames REINFORCE's variance. Same fork, same variance villain, now pointed at pixels.

Takeaway
RL enters computer vision exactly where backpropagation cannot: a discrete decision (where to foveate, which box to emit) or a non-differentiable reward (is this generated image beautiful, is it on-prompt). The cure is the lesson-07 reflex — sample the decision, weight its log-probability by the reward, subtract a baseline — whether the policy is a glimpse network (hard attention) or a denoising diffusion model (DDPO). Generation alignment is just RLHF for pixels, and the adversarial generator-vs-discriminator game is the same GAN ↔ RL bridge we met with GAIL in lesson 17.