all lessons / reinforcement learning / 57 · Robot control (中) lesson 57 / 87

Robot control (中) — sim-to-real & sample efficiency

Lesson 56 trained a continuous-control policy in a simulator and ended on a cliffhanger: it works in sim, then fails on the real arm. This lesson is about closing that gap — making a simulator-trained policy survive contact with reality, and squeezing the most out of the few real samples you can afford.

Which core lessons this reuses
This is an applications lesson; it invents almost no new algorithm. It stands on three earlier ones: The job of this lesson is to wire those three together against one enemy: real data is scarce and the simulator lies.

What broke: the sim-to-real gap

A simulator is a fast, free, safe MDP. You can run a million SAC episodes overnight and the policy converges to near-perfect performance — in the simulator. Then you load those weights onto the physical robot and it flails.

The reason is precise. RL maximizes return under the transition dynamics it was trained on:

θ* = arg maxθ   𝔼s'∼Psim(·|s,a), a∼πθ [ G ]

But the real robot runs under Preal, not Psim. Real friction is higher than your guess; the real motors have latency the sim ignored; the real link mass is 12% off the CAD model; the camera is a few milliseconds behind. The policy overfit to the exact physics of one simulator — and a policy that is optimal for Psim can be terrible for Preal:

Jreal(θ*)  ≪  Jsim(θ*)
The trap
This looks exactly like ordinary overfitting, but the "test set" is a different physics, not just held-out data. You cannot fix it by collecting more sim data — more sim data just makes you more confidently optimal for the wrong dynamics. The fix has to change what you train on, not how much.

Fix #1 — domain randomization: treat reality as just another variation

If a policy overfits to one setting of the physics, train it on many settings. Domain randomization samples the simulator's physical parameters at the start of every episode: link masses, friction coefficients, motor gains, sensor latency, even gravity and visual textures. Let ξ be that bundle of parameters, drawn from a distribution p(ξ) you choose to be wide enough to bracket reality:

θ* = arg maxθ   𝔼ξ∼p(ξ)   𝔼s'∼Pξ, a∼πθ [ G ]

Now the policy is no longer optimal for one physics; it is optimal on average across a whole family of physics. The bet — and it is a remarkably good bet — is that if Preal lives somewhere inside the support of p(ξ), then to the trained policy reality is just one more sample of ξ. It has already seen robots that were heavier, stickier, and laggier than the real one, so the real one is unremarkable.

The cost is honest: averaging over many physics means the policy is no longer the razor-sharp optimum for any single one. You give up a little peak sim performance to buy robustness. That trade — slightly lower sim success, dramatically higher transfer — is exactly what the widget below lets you feel.

Why this is the same idea as a baseline / regularizer
Training over 𝔼ξ is a regularizer in disguise: it forbids the policy from exploiting any feature of the dynamics that is not stable across the family. A solution that needs friction to be exactly 0.5 to work is punished by every episode where friction isn't. What survives is a policy that uses only the robust structure of the task.

Fix #2 — sample efficiency: real data is the expensive resource

Domain randomization shrinks the gap but does not eliminate it, and eventually you must touch the real robot. Here the economics flip. In sim, samples are free; on hardware, every episode costs minutes of wall-clock, supervision, wear on the joints, and the risk of a crash. A model-free SAC run that needs ten million transitions is a non-starter at ~1 transition per real timestep. So robotics leans hard on the two sample-saving ideas we already built.

Model-based RL — plan in your head (lesson 07)

Lesson 07 made the point: if you know the model you can plan instead of sampling. Here we don't know Preal, so we learn it. Fit a dynamics model φ(s'|s,a) (and a reward model) to the logged transitions, then generate imagined rollouts inside and train the policy on those — or plan a short horizon at decision time (MPC), the model-predictive cousin of the MCTS lookahead from lesson 07:

collect a few REAL transitions  (s, a, r, s')      ← expensive, do sparingly
        │
        ▼
fit dynamics model  P̂_φ(s'|s,a)                    ← supervised learning
        │
        ▼
imagine many rollouts inside P̂  ─────────────►  improve π_θ on imagined data  (free)
        │                                                    │
        └──────────── act on real robot ◄───────────────────┘

One real transition now trains the policy many times over, because the learned model manufactures synthetic experience. This is the planning/model-based half of the spine doing sample-efficiency duty. The catch is model bias: imagined rollouts that wander where was never fit are fiction — which is the same out-of-distribution disease we name next.

Offline RL — reuse the logs you already have (lesson 19)

Every real run leaves a trail of (s,a,r,s') tuples. Offline RL (lesson 19) learns a better policy from that fixed dataset with no new interaction — which on a robot means no new risk. The catch is the one lesson 19 diagnosed: the Bellman backup queries Q at out-of-distribution actions the logs never contain, and the error compounds. So robotics uses the conservative variants (BCQ / CQL, lessons 20–21) that keep the policy on the data manifold. Domain-randomized sim data and real logs can even be pooled into one offline dataset.

Fix #3 — safe exploration: the robot can break itself

One assumption every earlier lesson made for free is fatal here: that exploration is harmless. A simulated agent that drives off a cliff just resets. A real arm that swings to a joint limit at full torque snaps a tendon; a quadruped that explores a backflip lands on its sensors. Exploration on hardware has a physical downside that does not reset.

So real-robot RL constrains exploration rather than maximizing it:

The three fixes are one loop
Domain randomization gets you a policy safe enough to deploy; safe exploration governs the few real episodes you collect; model-based and offline RL squeeze maximum learning out of those episodes. Each real-world deployment widens or re-centers p(ξ) and grows the offline log — and you go around again.

Interactive · domain randomization vs the reality gap

Train a toy reaching policy two ways and then test it under shifted physics. With randomization OFF, training pins the policy to one nominal physics (friction = 1.0): sim success is high, but slide the test-physics shift away from nominal and success collapses — that collapse is the lesson. Turn randomization ON and training averages over a band of physics: sim success dips a little, but the success curve stays high across a wide range of shifts. The dashed line marks the real robot's (unknown, off-nominal) physics.

Sim-to-real: overfit one physics vs randomize over many
Toggle domain randomization, then drag the test-physics shift to move the "real robot" away from the simulator's nominal setting. Watch what each training regime does at the real-robot line.
Mode
overfit (OFF)
Sim success (nominal)
Real success (shifted)
Sim→real drop
Show the JS that runs this widget (≈22 lines)
// success(shift) = how well the trained policy does at a given physics offset.
// OFF: a sharp Gaussian bump centered at nominal (shift 0) — great at 0, dead away from it.
// ON : a plateau of width w — slightly lower peak, but flat across the randomized band.
function success(shift, randomize, w){
  if (!randomize){
    const peak = 0.97, sigma = 0.22;          // sharp, overfit to nominal
    return peak * Math.exp(-(shift*shift)/(2*sigma*sigma));
  } else {
    const peak = 0.88;                          // give up a little peak...
    const edge = w;                             // ...to hold a plateau of half-width w
    if (Math.abs(shift) <= edge) return peak;   // flat inside the trained band
    const d = Math.abs(shift) - edge;           // gentle falloff just past it
    return peak * Math.exp(-(d*d)/(2*0.18*0.18));
  }
}
// "sim success" = success at shift 0 (nominal). "real success" = success at the slider's shift.

Notice the failure mode you can dial in: randomization OFF, shift around +0.55, and the policy that scored 0.97 in the simulator scores almost nothing on the "real" robot — a textbook sim-to-real collapse. Flip randomization ON and the same shift barely dents performance, at the price of a sim-success number that started a little lower. That is the entire trade in one picture.

Map back to the value / policy / model spine

Nothing here is outside the course; sim-to-real is a re-combination of three pieces of the spine pointed at one problem:

Sim-to-real ingredientSpine piece it reusesBranch
The policy being transferred (continuous torques)DDPG / TD3 / SAC — continuous control, lesson 18policy (actor) + value (critic)
Domain randomizationRobustness as regularized policy optimization over 𝔼ξ — a wrapper on lesson 18's objectivepolicy
Learn , imagine rollouts, planModel-based RL & planning, lesson 07 (MCTS/DP lookahead)model
Reuse logged real episodesOffline / conservative RL, lessons 19–21 (BCQ, CQL)value (pessimistic critic)
Safe explorationConstrained MDP — the exploration/exploitation tradeoff (lesson 08) with a physical costboth

The robot makes the spine's three axes tangible: the policy is the thing you ship, the model is how you stop wasting real time, the value (made pessimistic) is how you stay safe on logged data. Lesson 58 takes the same toolkit to autonomous driving, where the stakes of unsafe exploration and the role of imitation + offline RL get sharper still.

Takeaway
A policy that is optimal for one simulator is overfit to one physics; it fails on the real robot because Preal ≠ Psim. Domain randomization turns reality into just-another-sampled-physics by training over a family p(ξ) — trading a little peak sim performance for transfer. Because real samples are precious, model-based RL (lesson 07) imagines rollouts from a learned and offline RL (lesson 19) reuses logged episodes, while safe exploration keeps the robot from breaking itself. Sim-to-real is not a new algorithm — it is lessons 07, 15, and 16 aimed together at the gap between a simulator and the world.