Robot control (上) — manipulators

A robot arm reaching for a cup is lesson 18 embodied. The state is joint angles and velocities, the action is a torque vector in ℝ^d, and there is no argmax_a to take. The new lesson is not the algorithm — it is the reward: how you write it decides whether the arm learns to reach, refuses to learn, or learns the wrong thing entirely.

What this reuses

This is an applications lesson. It reuses lesson 18 — continuous control (DDPG / TD3 / SAC) almost wholesale: a deterministic-or-stochastic actor μ_φ(s) emits the action, a critic supplies the gradient, an off-policy replay buffer recycles experience. We do not re-derive any of that. What is new here is everything around the algorithm: turning a physical task into an MDP, shaping the reward, and the first crack in the foundation — the simulation gap (lesson 57).

The manipulation MDP

Take the canonical task: a two-link planar arm must move its fingertip to a target. Write it as an MDP exactly as in lesson 01.

MDP piece	For a reaching arm
State s	joint angles (q₁, q₂), joint velocities (q̇₁, q̇₂), and the target position (x_g, y_g).
Action a	a torque applied at each joint: a = (τ₁, τ₂) ∈ ℝ². Real arms have 6–7 joints, so a ∈ ℝ⁷.
Transition P(s'\|s,a)	rigid-body dynamics — the torque accelerates the links; a physics simulator integrates one timestep.
Reward R(s,a)	some function of "how close is the fingertip to the target?" — and writing this is the hard part.
Discount γ	just under 1: the arm should care about reaching soon, not eventually.

The control runs at, say, 20–100 Hz: every few milliseconds the policy reads s and emits a fresh torque vector. The episode ends when the fingertip is within some tolerance of the target, or after a time limit.

Why this forces lesson 18

The action is a continuous torque vector, and that single fact knocks out both branches of the fork (orientation) in exactly the way lesson 18 described:

argmax over torques is impossible

A value-based agent (lesson 05) picks a* = argmax_a Q(s,a). Here a ∈ ℝ² (or ℝ⁷) — there is no finite list of torque vectors to sweep, and no finite set of logits to softmax (lesson 06). You cannot enumerate ℝ^d. So we do exactly what lesson 18 prescribed: the actor μ_φ(s) emits the torque, and the critic's gradient ∇_aQ teaches it which way is uphill.

Concretely a robot lab reaches for SAC or TD3 from lesson 18. SAC's max-entropy objective is popular because the entropy term keeps the arm exploring a spread of torques instead of collapsing onto one twitchy motion, and its off-policy replay buffer means every expensive interaction is reused many times — which matters enormously when each "step" is a real motor moving real metal. Nothing in the algorithm is new. The thing the algorithm cannot fix for you is the reward.

The reward is the whole game: shaping

The honest reward for "reach the target" is sparse: r = 1 on the step the fingertip arrives, 0 everywhere else.

r_sparse(s) = 1 if ‖fingertip(s) − goal‖ < δ else 0

This is correct — it rewards exactly the thing we want and nothing else. It is also nearly untrainable from scratch. A flailing arm hits the tolerance ball δ by chance maybe once in thousands of episodes, so the critic sees r = 0 almost always and there is no gradient pointing toward the goal. This is the credit-assignment problem from lesson 01 at its worst: the one informative reward is buried under a sea of zeros.

The fix is reward shaping — hand the agent a denser signal that increases as it gets warmer. The standard choice is negative distance every step:

r_shaped(s) = − ‖fingertip(s) − goal‖

Now every step carries information: move the fingertip 1 cm closer and the reward goes up by 1 cm worth. The gradient points at the goal from the very first episode, and training that took millions of sparse samples now succeeds in a fraction of them. The widget below makes the gap visible — flip from sparse to shaped and watch the reach time collapse.

Shaping is a loaded gun: reward hacking

A shaped reward is a proxy for what you want, and the agent optimizes the proxy with no mercy. Reward the velocity toward the goal instead of the distance, and the arm learns to fly at the target at top speed and overshoot — it maximized approach speed, never "be at the goal." Reward "−distance" but forget an energy penalty, and the arm slams the joints to their limits. The optimizer does not read your intent; it reads your formula. This is the same reward-hacking failure the KL anchor fought in RLHF (lesson 16) — here there is no anchor, only the shape you wrote.

The clean fix, when you can afford it, is potential-based shaping: add F = γΦ(s') − Φ(s) for some potential Φ (e.g. Φ = −distance). Ng's theorem says this form leaves the optimal policy unchanged while still densifying the signal — you get the fast training of shaping with a guarantee you have not changed which behavior is best. The widget's "mis-shaped" mode is what you get when you ignore that discipline.

Interactive · reach the cup, and pick your reward

A two-link arm (anchored at the left) is driven by a small continuous policy: it reads the angle-to-target error and emits a torque on each joint, exactly the μ_φ(s) idea from lesson 18. The policy is trained live by a tiny hill-climbing update on whatever reward you select. Watch how fast — or whether — it learns to touch the green target.

Two-link reacher · sparse vs shaped vs mis-shaped reward

Pick a reward mode and press Train. Sparse (1 only at the goal) barely improves. Shaped (−distance) reaches fast. Mis-shaped (reward approach speed) is the bug-is-the-lesson: the arm learns to slam toward the target and overshoot. Toggle the sim gap to perturb the dynamics at test time — a policy tuned in clean "sim" degrades on the perturbed "real" arm (lesson 57).

reward: sim gap (perturb dynamics)

Episodes trained

Reach rate (last 30)

Final dist to goal

—

Max tip speed

—

Show the core JS (≈25 lines)

// state s -> torque a = W · features(s)   (a tiny linear continuous policy, μ_φ from L15)
function torques(W, s){                 // s = {q1,q2, dx,dy} angle errors to target
  return [ W[0]*s.e1 + W[1]*s.e2, W[2]*s.e1 + W[3]*s.e2 ]; // a ∈ ℝ²
}
function reward(mode, dist, prevDist, speed){
  if (mode==='sparse') return dist < GOAL_TOL ? 1 : 0;     // 1 only at the goal
  if (mode==='shaped') return -dist;                       // dense: −distance
  return (prevDist - dist) * 6;        // mis-shaped: reward APPROACH SPEED -> overshoot
}
// hill-climb the policy: perturb W, keep the change if the episode's return went up
function trainEpisode(){
  const cand = W.map(w => w + (Math.random()*2-1)*sigma);
  const Rnew = rollout(cand);          // sum of reward(...) over the episode
  if (Rnew > bestR){ W = cand; bestR = Rnew; }  // accept improvement (off-policy-ish)
}

Three things to try. (1) Reset, choose sparse, train — the reach rate crawls; the arm almost never stumbles onto the reward. (2) Reset, choose shaped, train — reach rate climbs fast and final distance drops toward zero. (3) Reset, choose mis-shaped, train — watch max tip speed shoot up while the arm blows past the target: it optimized the proxy you wrote, not the goal you meant.

The first crack: the simulation gap

Everything above quietly assumed we can take millions of steps. On a real arm you cannot — each step is slow, and a flailing policy breaks hardware. So almost all robot RL trains in a physics simulator, where steps are cheap and crashes are free, then ships the learned policy to the real arm.

The problem is that the simulator is wrong in a thousand small ways: friction, motor latency, link masses, sensor noise, contact dynamics. A policy that exploited the sim's exact numbers — and lesson 18's off-policy optimizers are very good at exploiting exact numbers — can fall apart on the real arm. Flip the widget's sim gap toggle: the same trained policy now runs on perturbed dynamics and its reach rate drops. That degradation is the sim-to-real gap, and closing it (domain randomization, model-based RL, safe exploration) is the entire subject of lesson 57.

Map back to the spine

Place this lesson on the value / policy / model map from orientation:

It is the policy branch reunited with value — actor–critic. The arm is driven by an actor μ_φ(s) emitting a continuous torque, trained by a critic, exactly the continuous control machinery of lesson 18 (DDPG → TD3 → SAC). Manipulation did not need a new algorithm; it needed the one we already had.
The new content is reward, not method. Sparse vs shaped reward is the signal question from orientation, made physical: the agent only ever optimizes the scalar you hand it, so a mis-specified shape produces a confidently wrong policy — reward hacking, the same disease as RLHF (lesson 16) without a KL leash.
It opens the next crack. Continuous control assumed cheap interaction; real robots make interaction expensive and dangerous, so we train in sim and pay the sim-to-real gap — the bridge into lesson 57.

Takeaway

A reaching arm is lesson 18 embodied: state = joint angles/velocities + target, action = a continuous torque vector (so argmax_a is out and SAC/TD3 are in). The algorithm is settled; the danger has moved to the reward — a sparse reward barely trains, a shaped reward (−distance) trains fast, and a mis-shaped reward trains the wrong behavior with full confidence. And because real interaction is costly, we train in simulation and inherit the sim-to-real gap that lesson 57 exists to close.