Imitation learning & inverse RL
Every method so far — value-based, policy-based, RLHF — assumed a reward signal exists: a scalar the environment hands back, or a model that produces one. But what if all you have is a stack of expert demonstrations and no reward function at all? This lesson covers the two answers: copy the actions (imitation), or recover the reward behind them (inverse RL).
The new setup: demonstrations, no reward
Drop the reward R(s,a) from the MDP. What remains is a set of expert trajectories
generated by some expert policy πE we cannot query for the reward it was chasing. This is the common case in the real world: a human drives a car well, a surgeon moves an instrument well, a pilot lands a plane well — but none of them can write down the R(s,a) they are implicitly optimizing. We only have a record of what they did.
There are two families of answers, and they map exactly onto the fork from orientation:
- Imitation learning — copy the policy directly. "In states like this, the expert did a, so I will too." (Behavioral cloning, DAgger.)
- Inverse RL (IRL) — recover the reward R the expert seems to optimize, then run ordinary forward RL on it. Don't copy the moves; infer the intent, then re-solve.
Behavioral cloning = supervised learning of the policy
The simplest idea: treat each (s, a) pair from the demonstrations as a labeled example and fit a policy by supervised learning. The state is the input, the expert's action is the label:
That is it — behavioral cloning (BC) is plain maximum-likelihood classification (discrete actions) or regression (continuous actions). No environment, no rollouts, no reward. It is fast, stable, and often the first thing anyone tries. It is also the thing that quietly breaks the moment you deploy it.
The failure: compounding error / covariate shift
BC is trained only on states the expert visited — the thin ribbon of "good" states down the middle of the road. Suppose your cloned policy is almost perfect, making a small error with probability ε at each step. One small error nudges you slightly off the expert's path, into a state the expert never demonstrated. There, your policy has no training signal — it is extrapolating — so it errs again, pushing you further out, into an even stranger state. Errors don't average out; they compound.
The classic analysis: a supervised learner with per-step error ε on the expert's distribution, run for a horizon T, accumulates a total cost that grows like
That extra factor of T is the whole story. The mismatch between the state distribution the policy was trained on (πE's) and the one it induces at test time (πθ's) is called covariate shift. It is the central pathology of imitation — and, foreshadowing lesson 16, the same distribution-shift demon that haunts offline RL.
expert ribbon: ····●●●●●●●●●●●●●●···· stays in demonstrated states
BC, one slip: ····●●●●●○ slip → unseen state → no signal →
╲○ bigger slip → stranger state → …
╲○___ off the road (compounding error)
DAgger: ask the expert about your states
The diagnosis points straight at the cure. The problem is that BC only ever sees the expert's states; the fix is to get labels on the states the learner actually visits. DAgger (Dataset Aggregation) does exactly this, in a loop:
D ← expert demonstrations
repeat:
train π_θ on D # behavioral cloning on current data
roll out π_θ in the environment # visit the LEARNER's own states s
for each visited state s:
a* ← π^E(s) # ASK THE EXPERT what it would do here
D ← D ∪ {(s, a*)} # aggregate the correction
The key line is "ask the expert what it would do here": the expert labels the learner's mistakes, in the off-distribution states the learner drifts into. After a few iterations the training set covers not just the ideal ribbon but also the recovery states — "you're drifting right, steer left" — so the policy learns to come back. DAgger provably converts the ε·T2 blow-up back into a benign ε·T. The price: you need an expert you can query online on arbitrary states, which is exactly what you don't have if your expert is a fixed log or a busy human.
Interactive · BC vs DAgger on a lane-tracking toy
A car drives left-to-right down a 1-D lane; the only state that matters is its vertical offset s from the lane center, and the action is a steering correction a. The expert is a clean proportional controller, a = −k·s (steer back toward center). We collect demonstrations along the expert's path, then learn a controller two ways and let each drive under a little process noise.
With 0 expert corrections, the learner is pure BC: it only ever saw near-center states, so its fitted controller is brittle off-center. Push the noise up and watch it drift, fail to correct, and compound its way off the road — the bug is the lesson. Each DAgger correction relabels one of the learner's own drifted states with the expert's action; add a few and the same learner suddenly stays glued to the center.
Why does BC fail here even though its fit looks fine? Least-squares on a narrow band of near-zero states is poorly conditioned — the data barely constrains the slope, so the recovered controller is too weak (or wrong-signed) far from center. DAgger's off-center samples widen the training spread, pin down the correct slope, and the learner recovers like the expert does.
Inverse RL: recover the reward, not the moves
Imitation copies actions. But actions are brittle: they don't transfer to a new car, a new track, a slightly different dynamics. The expert's reward — "stay near the center, smoothly" — is far more portable. Inverse RL inverts the usual problem:
Find a reward function R under which the expert's demonstrations look optimal; then run forward RL (any method from lessons 02–13) on that recovered R. Because you re-solve the MDP, the resulting policy generalizes: it can recover from states the expert never showed, because it is optimizing the same intent rather than parroting the same moves.
Maximum-entropy IRL: pick the least-committal explanation
MaxEnt IRL resolves the ambiguity with a clean principle: among all reward functions (and the trajectory distributions they induce) consistent with the expert's behavior, prefer the one that is maximally uncertain — assumes the least beyond what the data forces. Model the expert as choosing trajectories with probability exponential in their return:
High-return trajectories are exponentially more likely, but every trajectory keeps nonzero probability — so the model gracefully tolerates a slightly-suboptimal, noisy human. Fitting ψ by maximum likelihood gives a unique, least-committal reward. (This exponential-of-return form is the same soft/entropy view that returns in lesson 15's SAC.)
GAIL: an IRL that is secretly a GAN
Classic MaxEnt IRL has an expensive inner loop: every reward update requires (approximately) solving the forward RL problem to compare the expert's trajectory distribution against the current policy's. GAIL (Generative Adversarial Imitation Learning) skips ever writing down an explicit reward and casts imitation as a two-player game — structurally identical to a GAN:
| GAN | GAIL |
|---|---|
| Generator G makes fake images | Policy πθ generates state–action pairs |
| Discriminator D tells real from fake | Discriminator Dw(s,a) tells expert from learner |
| G trained to fool D | π trained to fool D — via policy gradient |
The discriminator Dw(s,a) ∈ (0,1) learns to score "how expert-like is this state–action?" — and that score is the learned reward. The policy is then trained to maximize it with exactly the policy gradient from lessons 03 and 07:
So GAIL is the discriminator as reward (the IRL half) feeding a policy-gradient learner (the forward-RL half) — the same fork, reunified once more. As the policy improves, the expert and learner distributions become indistinguishable, D → ½, and the game reaches equilibrium. We'll meet this same GANs↔RL bridge again in the computer-vision lesson (29), where it aligns image generators.
The map of this lesson
| Method | What it learns | Fixes / needs |
|---|---|---|
| Behavioral cloning | policy, by supervised MLE | simple; breaks on covariate shift (ε·T2) |
| DAgger | policy, on learner's own states | fixes shift; needs an online expert |
| MaxEnt IRL | reward function | generalizes; expensive inner forward-RL loop |
| GAIL | reward (= discriminator) + policy | GAN view; reuses policy gradient (L03/07) |