Robot control (上) — manipulators
A robot arm reaching for a cup is lesson 18 embodied. The state is joint angles and velocities, the action is a torque vector in ℝd, and there is no argmaxa to take. The new lesson is not the algorithm — it is the reward: how you write it decides whether the arm learns to reach, refuses to learn, or learns the wrong thing entirely.
The manipulation MDP
Take the canonical task: a two-link planar arm must move its fingertip to a target. Write it as an MDP exactly as in lesson 01.
| MDP piece | For a reaching arm |
|---|---|
| State s | joint angles (q1, q2), joint velocities (q̇1, q̇2), and the target position (xg, yg). |
| Action a | a torque applied at each joint: a = (τ1, τ2) ∈ ℝ2. Real arms have 6–7 joints, so a ∈ ℝ7. |
| Transition P(s'|s,a) | rigid-body dynamics — the torque accelerates the links; a physics simulator integrates one timestep. |
| Reward R(s,a) | some function of "how close is the fingertip to the target?" — and writing this is the hard part. |
| Discount γ | just under 1: the arm should care about reaching soon, not eventually. |
The control runs at, say, 20–100 Hz: every few milliseconds the policy reads s and emits a fresh torque vector. The episode ends when the fingertip is within some tolerance of the target, or after a time limit.
Why this forces lesson 18
The action is a continuous torque vector, and that single fact knocks out both branches of the fork (orientation) in exactly the way lesson 18 described:
Concretely a robot lab reaches for SAC or TD3 from lesson 18. SAC's max-entropy objective is popular because the entropy term keeps the arm exploring a spread of torques instead of collapsing onto one twitchy motion, and its off-policy replay buffer means every expensive interaction is reused many times — which matters enormously when each "step" is a real motor moving real metal. Nothing in the algorithm is new. The thing the algorithm cannot fix for you is the reward.
The reward is the whole game: shaping
The honest reward for "reach the target" is sparse: r = 1 on the step the fingertip arrives, 0 everywhere else.
This is correct — it rewards exactly the thing we want and nothing else. It is also nearly untrainable from scratch. A flailing arm hits the tolerance ball δ by chance maybe once in thousands of episodes, so the critic sees r = 0 almost always and there is no gradient pointing toward the goal. This is the credit-assignment problem from lesson 01 at its worst: the one informative reward is buried under a sea of zeros.
The fix is reward shaping — hand the agent a denser signal that increases as it gets warmer. The standard choice is negative distance every step:
Now every step carries information: move the fingertip 1 cm closer and the reward goes up by 1 cm worth. The gradient points at the goal from the very first episode, and training that took millions of sparse samples now succeeds in a fraction of them. The widget below makes the gap visible — flip from sparse to shaped and watch the reach time collapse.
The clean fix, when you can afford it, is potential-based shaping: add F = γΦ(s') − Φ(s) for some potential Φ (e.g. Φ = −distance). Ng's theorem says this form leaves the optimal policy unchanged while still densifying the signal — you get the fast training of shaping with a guarantee you have not changed which behavior is best. The widget's "mis-shaped" mode is what you get when you ignore that discipline.
Interactive · reach the cup, and pick your reward
A two-link arm (anchored at the left) is driven by a small continuous policy: it reads the angle-to-target error and emits a torque on each joint, exactly the μφ(s) idea from lesson 18. The policy is trained live by a tiny hill-climbing update on whatever reward you select. Watch how fast — or whether — it learns to touch the green target.
Three things to try. (1) Reset, choose sparse, train — the reach rate crawls; the arm almost never stumbles onto the reward. (2) Reset, choose shaped, train — reach rate climbs fast and final distance drops toward zero. (3) Reset, choose mis-shaped, train — watch max tip speed shoot up while the arm blows past the target: it optimized the proxy you wrote, not the goal you meant.
The first crack: the simulation gap
Everything above quietly assumed we can take millions of steps. On a real arm you cannot — each step is slow, and a flailing policy breaks hardware. So almost all robot RL trains in a physics simulator, where steps are cheap and crashes are free, then ships the learned policy to the real arm.
The problem is that the simulator is wrong in a thousand small ways: friction, motor latency, link masses, sensor noise, contact dynamics. A policy that exploited the sim's exact numbers — and lesson 18's off-policy optimizers are very good at exploiting exact numbers — can fall apart on the real arm. Flip the widget's sim gap toggle: the same trained policy now runs on perturbed dynamics and its reach rate drops. That degradation is the sim-to-real gap, and closing it (domain randomization, model-based RL, safe exploration) is the entire subject of lesson 57.
Map back to the spine
Place this lesson on the value / policy / model map from orientation:
- It is the policy branch reunited with value — actor–critic. The arm is driven by an actor μφ(s) emitting a continuous torque, trained by a critic, exactly the continuous control machinery of lesson 18 (DDPG → TD3 → SAC). Manipulation did not need a new algorithm; it needed the one we already had.
- The new content is reward, not method. Sparse vs shaped reward is the signal question from orientation, made physical: the agent only ever optimizes the scalar you hand it, so a mis-specified shape produces a confidently wrong policy — reward hacking, the same disease as RLHF (lesson 16) without a KL leash.
- It opens the next crack. Continuous control assumed cheap interaction; real robots make interaction expensive and dangerous, so we train in sim and pay the sim-to-real gap — the bridge into lesson 57.