rl_foundations / lessons / 14 · imitation & inverse RL lesson 14 / 32

Imitation learning & inverse RL

Every method so far — value-based, policy-based, RLHF — assumed a reward signal exists: a scalar the environment hands back, or a model that produces one. But what if all you have is a stack of expert demonstrations and no reward function at all? This lesson covers the two answers: copy the actions (imitation), or recover the reward behind them (inverse RL).

The new setup: demonstrations, no reward

Drop the reward R(s,a) from the MDP. What remains is a set of expert trajectories

D = { τ1, τ2, … },   τ = (s0, a0, s1, a1, … )

generated by some expert policy πE we cannot query for the reward it was chasing. This is the common case in the real world: a human drives a car well, a surgeon moves an instrument well, a pilot lands a plane well — but none of them can write down the R(s,a) they are implicitly optimizing. We only have a record of what they did.

There are two families of answers, and they map exactly onto the fork from orientation:

Behavioral cloning = supervised learning of the policy

The simplest idea: treat each (s, a) pair from the demonstrations as a labeled example and fit a policy by supervised learning. The state is the input, the expert's action is the label:

minθ  𝔼(s,a) ∼ D [ −log πθ(a | s) ]

That is it — behavioral cloning (BC) is plain maximum-likelihood classification (discrete actions) or regression (continuous actions). No environment, no rollouts, no reward. It is fast, stable, and often the first thing anyone tries. It is also the thing that quietly breaks the moment you deploy it.

The hidden assumption
Supervised learning assumes the training and test inputs are drawn from the same distribution (i.i.d.). But a policy generates its own inputs: the states you see at test time are the states your actions take you to — not the states the expert visited. The moment your action differs from the expert's, you have violated the one assumption that made the fit valid.

The failure: compounding error / covariate shift

BC is trained only on states the expert visited — the thin ribbon of "good" states down the middle of the road. Suppose your cloned policy is almost perfect, making a small error with probability ε at each step. One small error nudges you slightly off the expert's path, into a state the expert never demonstrated. There, your policy has no training signal — it is extrapolating — so it errs again, pushing you further out, into an even stranger state. Errors don't average out; they compound.

The classic analysis: a supervised learner with per-step error ε on the expert's distribution, run for a horizon T, accumulates a total cost that grows like

cost(BC) ∝ ε · T2   (vs.  ε · T  if states were i.i.d.)

That extra factor of T is the whole story. The mismatch between the state distribution the policy was trained on (πE's) and the one it induces at test time (πθ's) is called covariate shift. It is the central pathology of imitation — and, foreshadowing lesson 16, the same distribution-shift demon that haunts offline RL.

expert ribbon:   ····●●●●●●●●●●●●●●····      stays in demonstrated states
BC, one slip:    ····●●●●●○                   slip → unseen state → no signal →
                          ╲○                  bigger slip → stranger state → …
                            ╲○___ off the road (compounding error)

DAgger: ask the expert about your states

The diagnosis points straight at the cure. The problem is that BC only ever sees the expert's states; the fix is to get labels on the states the learner actually visits. DAgger (Dataset Aggregation) does exactly this, in a loop:

D ← expert demonstrations
repeat:
    train π_θ on D                       # behavioral cloning on current data
    roll out π_θ in the environment      # visit the LEARNER's own states s
    for each visited state s:
        a* ← π^E(s)                      # ASK THE EXPERT what it would do here
        D ← D ∪ {(s, a*)}                # aggregate the correction

The key line is "ask the expert what it would do here": the expert labels the learner's mistakes, in the off-distribution states the learner drifts into. After a few iterations the training set covers not just the ideal ribbon but also the recovery states — "you're drifting right, steer left" — so the policy learns to come back. DAgger provably converts the ε·T2 blow-up back into a benign ε·T. The price: you need an expert you can query online on arbitrary states, which is exactly what you don't have if your expert is a fixed log or a busy human.

Interactive · BC vs DAgger on a lane-tracking toy

A car drives left-to-right down a 1-D lane; the only state that matters is its vertical offset s from the lane center, and the action is a steering correction a. The expert is a clean proportional controller, a = −k·s (steer back toward center). We collect demonstrations along the expert's path, then learn a controller two ways and let each drive under a little process noise.

With 0 expert corrections, the learner is pure BC: it only ever saw near-center states, so its fitted controller is brittle off-center. Push the noise up and watch it drift, fail to correct, and compound its way off the road — the bug is the lesson. Each DAgger correction relabels one of the learner's own drifted states with the expert's action; add a few and the same learner suddenly stays glued to the center.

Lane tracking: behavioral cloning vs DAgger
Blue = expert. Orange = BC/DAgger learner. With 0 corrections the learner is pure BC and drifts off-distribution; each correction teaches it one recovery state. Turn noise up to stress it.
Mode
pure BC
Final |offset|
Off-road?
Training spread
narrow
Show the core JS (≈24 lines)
// Expert: proportional controller, steer back to center.
const expert = s => -K * s;

// Build the training set. BC sees only near-center (expert) states.
// Each DAgger correction adds an OFF-CENTER state, labeled by the expert.
function trainingSet(nCorr){
  const X = [], Y = [];
  for (const s of expertStates) { X.push(s); Y.push(expert(s)); }     // expert ribbon
  for (let i=0; i<nCorr; i++){
    const s = drift(i);                  // a state the LEARNER drifts into
    X.push(s); Y.push(expert(s));        // ASK THE EXPERT here
  }
  return fitLinear(X, Y);                // least-squares slope a = w*s
}

// Drive under noise using the learned slope w.
function drive(w, sigma){
  let s = 0.05;                          // tiny initial offset
  for (let t=0; t<T; t++){
    const a = w * s;                     // learner's steering
    s = s + a + sigma*randn();           // dynamics + process noise
    if (Math.abs(s) > 1) return 'off-road';  // compounding error
  }
  return s;
}

Why does BC fail here even though its fit looks fine? Least-squares on a narrow band of near-zero states is poorly conditioned — the data barely constrains the slope, so the recovered controller is too weak (or wrong-signed) far from center. DAgger's off-center samples widen the training spread, pin down the correct slope, and the learner recovers like the expert does.

Inverse RL: recover the reward, not the moves

Imitation copies actions. But actions are brittle: they don't transfer to a new car, a new track, a slightly different dynamics. The expert's reward — "stay near the center, smoothly" — is far more portable. Inverse RL inverts the usual problem:

forward RL:  R  →  π*        inverse RL:  πE (via D)  →  R

Find a reward function R under which the expert's demonstrations look optimal; then run forward RL (any method from lessons 02–13) on that recovered R. Because you re-solve the MDP, the resulting policy generalizes: it can recover from states the expert never showed, because it is optimizing the same intent rather than parroting the same moves.

The ambiguity that makes IRL hard
IRL is badly ill-posed: infinitely many reward functions explain the same demonstrations (the constant reward R ≡ 0 makes every policy optimal, so it "explains" anything). We need a principle to pick one.

Maximum-entropy IRL: pick the least-committal explanation

MaxEnt IRL resolves the ambiguity with a clean principle: among all reward functions (and the trajectory distributions they induce) consistent with the expert's behavior, prefer the one that is maximally uncertain — assumes the least beyond what the data forces. Model the expert as choosing trajectories with probability exponential in their return:

p(τ) ∝ exp( Rψ(τ) ),   Rψ(τ) = ∑t Rψ(st, at)

High-return trajectories are exponentially more likely, but every trajectory keeps nonzero probability — so the model gracefully tolerates a slightly-suboptimal, noisy human. Fitting ψ by maximum likelihood gives a unique, least-committal reward. (This exponential-of-return form is the same soft/entropy view that returns in lesson 15's SAC.)

GAIL: an IRL that is secretly a GAN

Classic MaxEnt IRL has an expensive inner loop: every reward update requires (approximately) solving the forward RL problem to compare the expert's trajectory distribution against the current policy's. GAIL (Generative Adversarial Imitation Learning) skips ever writing down an explicit reward and casts imitation as a two-player game — structurally identical to a GAN:

GANGAIL
Generator G makes fake imagesPolicy πθ generates state–action pairs
Discriminator D tells real from fakeDiscriminator Dw(s,a) tells expert from learner
G trained to fool Dπ trained to fool D — via policy gradient

The discriminator Dw(s,a) ∈ (0,1) learns to score "how expert-like is this state–action?" — and that score is the learned reward. The policy is then trained to maximize it with exactly the policy gradient from lessons 03 and 07:

rGAIL(s,a) = −log( 1 − Dw(s,a) ),   ∇θ J = 𝔼πθ [ ∇θ log πθ(a|s) · A(s,a) ]

So GAIL is the discriminator as reward (the IRL half) feeding a policy-gradient learner (the forward-RL half) — the same fork, reunified once more. As the policy improves, the expert and learner distributions become indistinguishable, D → ½, and the game reaches equilibrium. We'll meet this same GANs↔RL bridge again in the computer-vision lesson (29), where it aligns image generators.

RLHF's reward model is IRL-flavored
Look back at lesson 13. The Bradley–Terry reward model trained on human preference pairs is doing exactly the IRL move: it does not copy human answers (that would be SFT / behavioral cloning); it recovers a reward function that explains which answers humans prefer, then forward RL (PPO) optimizes it. RLHF = "IRL from comparisons" + forward RL. SFT, by contrast, is behavioral cloning — which is why SFT alone suffers its own mild covariate shift, and why the RL stage helps.

The map of this lesson

MethodWhat it learnsFixes / needs
Behavioral cloningpolicy, by supervised MLEsimple; breaks on covariate shift (ε·T2)
DAggerpolicy, on learner's own statesfixes shift; needs an online expert
MaxEnt IRLreward functiongeneralizes; expensive inner forward-RL loop
GAILreward (= discriminator) + policyGAN view; reuses policy gradient (L03/07)
Takeaway
When no reward signal exists, you have two moves. Imitate: behavioral cloning is supervised learning of π, but it compounds error under covariate shift — one slip lands you in states the expert never showed — and DAgger fixes that by labeling the learner's own states. Or invert: recover the reward the expert optimizes (MaxEnt IRL resolves the which-reward ambiguity; GAIL realizes it as a GAN whose discriminator is the reward and whose generator is a policy-gradient learner), then run forward RL — which generalizes. RLHF's reward model is this same IRL move, learned from preferences instead of demonstrations.