Robot control (中) — sim-to-real & sample efficiency
Lesson 56 trained a continuous-control policy in a simulator and ended on a cliffhanger: it works in sim, then fails on the real arm. This lesson is about closing that gap — making a simulator-trained policy survive contact with reality, and squeezing the most out of the few real samples you can afford.
- Continuous control (lesson 18) — DDPG / TD3 / SAC give us the policy πθ that outputs continuous torques. That is what we are transferring.
- Model-based RL & planning (lesson 07) — if we learn the dynamics P̂(s'|s,a), we can imagine rollouts and plan in our head, spending compute instead of real robot time.
- Offline RL (lesson 19) — every real-robot episode is logged. Offline RL lets us re-learn from that fixed dataset without risking new interaction.
What broke: the sim-to-real gap
A simulator is a fast, free, safe MDP. You can run a million SAC episodes overnight and the policy converges to near-perfect performance — in the simulator. Then you load those weights onto the physical robot and it flails.
The reason is precise. RL maximizes return under the transition dynamics it was trained on:
But the real robot runs under Preal, not Psim. Real friction is higher than your guess; the real motors have latency the sim ignored; the real link mass is 12% off the CAD model; the camera is a few milliseconds behind. The policy overfit to the exact physics of one simulator — and a policy that is optimal for Psim can be terrible for Preal:
Fix #1 — domain randomization: treat reality as just another variation
If a policy overfits to one setting of the physics, train it on many settings. Domain randomization samples the simulator's physical parameters at the start of every episode: link masses, friction coefficients, motor gains, sensor latency, even gravity and visual textures. Let ξ be that bundle of parameters, drawn from a distribution p(ξ) you choose to be wide enough to bracket reality:
Now the policy is no longer optimal for one physics; it is optimal on average across a whole family of physics. The bet — and it is a remarkably good bet — is that if Preal lives somewhere inside the support of p(ξ), then to the trained policy reality is just one more sample of ξ. It has already seen robots that were heavier, stickier, and laggier than the real one, so the real one is unremarkable.
The cost is honest: averaging over many physics means the policy is no longer the razor-sharp optimum for any single one. You give up a little peak sim performance to buy robustness. That trade — slightly lower sim success, dramatically higher transfer — is exactly what the widget below lets you feel.
Fix #2 — sample efficiency: real data is the expensive resource
Domain randomization shrinks the gap but does not eliminate it, and eventually you must touch the real robot. Here the economics flip. In sim, samples are free; on hardware, every episode costs minutes of wall-clock, supervision, wear on the joints, and the risk of a crash. A model-free SAC run that needs ten million transitions is a non-starter at ~1 transition per real timestep. So robotics leans hard on the two sample-saving ideas we already built.
Model-based RL — plan in your head (lesson 07)
Lesson 07 made the point: if you know the model you can plan instead of sampling. Here we don't know Preal, so we learn it. Fit a dynamics model P̂φ(s'|s,a) (and a reward model) to the logged transitions, then generate imagined rollouts inside P̂ and train the policy on those — or plan a short horizon at decision time (MPC), the model-predictive cousin of the MCTS lookahead from lesson 07:
collect a few REAL transitions (s, a, r, s') ← expensive, do sparingly
│
▼
fit dynamics model P̂_φ(s'|s,a) ← supervised learning
│
▼
imagine many rollouts inside P̂ ─────────────► improve π_θ on imagined data (free)
│ │
└──────────── act on real robot ◄───────────────────┘
One real transition now trains the policy many times over, because the learned model manufactures synthetic experience. This is the planning/model-based half of the spine doing sample-efficiency duty. The catch is model bias: imagined rollouts that wander where P̂ was never fit are fiction — which is the same out-of-distribution disease we name next.
Offline RL — reuse the logs you already have (lesson 19)
Every real run leaves a trail of (s,a,r,s') tuples. Offline RL (lesson 19) learns a better policy from that fixed dataset with no new interaction — which on a robot means no new risk. The catch is the one lesson 19 diagnosed: the Bellman backup queries Q at out-of-distribution actions the logs never contain, and the error compounds. So robotics uses the conservative variants (BCQ / CQL, lessons 20–21) that keep the policy on the data manifold. Domain-randomized sim data and real logs can even be pooled into one offline dataset.
Fix #3 — safe exploration: the robot can break itself
One assumption every earlier lesson made for free is fatal here: that exploration is harmless. A simulated agent that drives off a cliff just resets. A real arm that swings to a joint limit at full torque snaps a tendon; a quadruped that explores a backflip lands on its sensors. Exploration on hardware has a physical downside that does not reset.
So real-robot RL constrains exploration rather than maximizing it:
- Action / torque limits and safety layers that clamp or veto any command leaving a safe envelope (a constrained MDP — maximize return subject to a cost budget 𝔼[C] ≤ d).
- Learn the dangerous part in sim, where falling is free, and only fine-tune the last mile on hardware — which is precisely why domain randomization matters: it makes the sim-trained policy safe enough to even start on the real robot.
- Offline / conservative updates so the deployed policy never strays to untested actions between data-collection rounds.
Interactive · domain randomization vs the reality gap
Train a toy reaching policy two ways and then test it under shifted physics. With randomization OFF, training pins the policy to one nominal physics (friction = 1.0): sim success is high, but slide the test-physics shift away from nominal and success collapses — that collapse is the lesson. Turn randomization ON and training averages over a band of physics: sim success dips a little, but the success curve stays high across a wide range of shifts. The dashed line marks the real robot's (unknown, off-nominal) physics.
Notice the failure mode you can dial in: randomization OFF, shift around +0.55, and the policy that scored 0.97 in the simulator scores almost nothing on the "real" robot — a textbook sim-to-real collapse. Flip randomization ON and the same shift barely dents performance, at the price of a sim-success number that started a little lower. That is the entire trade in one picture.
Map back to the value / policy / model spine
Nothing here is outside the course; sim-to-real is a re-combination of three pieces of the spine pointed at one problem:
| Sim-to-real ingredient | Spine piece it reuses | Branch |
|---|---|---|
| The policy being transferred (continuous torques) | DDPG / TD3 / SAC — continuous control, lesson 18 | policy (actor) + value (critic) |
| Domain randomization | Robustness as regularized policy optimization over 𝔼ξ — a wrapper on lesson 18's objective | policy |
| Learn P̂, imagine rollouts, plan | Model-based RL & planning, lesson 07 (MCTS/DP lookahead) | model |
| Reuse logged real episodes | Offline / conservative RL, lessons 19–21 (BCQ, CQL) | value (pessimistic critic) |
| Safe exploration | Constrained MDP — the exploration/exploitation tradeoff (lesson 08) with a physical cost | both |
The robot makes the spine's three axes tangible: the policy is the thing you ship, the model is how you stop wasting real time, the value (made pessimistic) is how you stay safe on logged data. Lesson 58 takes the same toolkit to autonomous driving, where the stakes of unsafe exploration and the role of imitation + offline RL get sharper still.