all lessons / reinforcement learning / 75 · Robotic grasping lesson 75 / 87

Robotic grasping

A gripper has to find an object, decide how to approach it, close on it, and not crush or drop it — all from noisy multimodal sensors, with an action that lives on a curved manifold, under hard safety limits, and a policy trained in simulation that must work on a real arm it has never touched. The binding difficulties here are multimodal partial observability (vision can be occluded, tactile is low-resolution), an action space that is not flat (6-DoF pose lives on SE(3), gripper force and position must share a critic), a reward that conflates "closed" with "grasped well", and a sim-to-real gap that turns a 99% sim policy into a 78% real one. Each names a tool.

The method — five steps, every lesson
Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. For grasping, four bind in turn — perception, action geometry, reward semantics, and the reality gap.

1 · Formulate — the MDP behind a grasp

Intuition. A grasping robot looks (camera, depth, sometimes tactile and an IMU), moves its hand to a pose, closes the fingers, and earns a reward if the object is lifted and held stably. That is a Markov Decision Process: the sensor fusion is the state, the end-effector pose plus gripper command is the action, "picked it up and it didn't slip" is the reward, and the contact physics is the transition. Every difficulty below is one of these four pieces being awkward for a real arm on a real line.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]
PieceFor a 6-DoF grasping armThe awkward part
State SRGB-D / point cloud + language instruction + IMU + tactile pressure, fused to a latentmodalities are heterogeneous and can drop out (occluded camera, 8×8 tactile) → multimodal POMDP
Action A6-DoF end-effector pose Δ(R,t) ∈ SE(3) × gripper open/close × grip forcepose lives on a curved manifold, not ℝ⁶; force-control and position-control are different physical channels
Reward Rgrasp success (lifted) + stability (no slip, low vibration)"fingers closed" ≠ "grasped well"; the obvious reward is hackable and sparse
Transition Pcontact dynamics — friction, mass, backlash, deformationtrained in sim, deployed on a real arm → sim-to-real gap

Same pattern as every domain: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.

2 · The state is a multimodal POMDP — align before you fuse

Intuition. Vision, language, IMU and touch each tell part of the story, in different units, at different rates, and any one can vanish (the camera gets blocked, the tactile pad is coarse). If you just concatenate the raw feature vectors and hand them to the policy, the network spends its capacity learning the correspondence between modalities instead of the task, and it collapses the moment one stream is missing. The fix is to first learn a shared latent space where "the same situation seen by different senses" lands in the same place — then fuse, then act.

Engineering detail — two-stage align-then-finetune. Stage one is self-supervised and offline: collect tens of millions of unlabeled multimodal sequences and train a shared vision–language–audio encoder with MoCo-style momentum contrastive learning — latent dimension 256, temperature τ=0.1, a negative queue of 65,536, driving the InfoNCE loss below ≈0.35 before alignment is called done.

InfoNCE = −log  exp(q·k₊/τ) / Σi exp(q·ki/τ)

Stage two attaches the pretrained encoder to a PPO policy and does a small end-to-end finetune. Two regularizers stop catastrophic forgetting of the alignment: a δ-constraint keeping the new encoder weights within ‖θ − θ₀‖₂ ≤ 0.02 of the pretrained weights, and modal-dropout — randomly masking ~30% of modalities each batch so the policy never depends on any single stream. At inference, the high-rate IMU is compressed by a 1-D CNN to 16 dims, concatenated with the 256-dim visual vector, and pushed through a GRU belief network into a unified 512-dim latent shared by actor and critic. In practice this two-stage scheme improves sample efficiency ≈4.7× over concatenating raw features, cutting downstream convergence from ~48 h to ~10 h.

Why a belief state, not just a fusion
Because the camera can be occluded, the environment is only partially observable: the current frame is not enough. Treating the alignment-network output as a belief state fed to a recurrent aggregator (GRU/LSTM/Transformer over history) lets the policy carry information across frames and ride out a dropout. This is the standard POMDP move — replace the missing observation with an estimate of the hidden state rather than reacting to noise frame-by-frame.

3 · The action is not flat — keep SE(3) on its manifold

Intuition. A 6-DoF pose is a rotation and a translation. Rotations don't add like numbers — "rotate 170° then 20° more" is not "rotate 190° in a straight line through the number line." If the policy outputs six real numbers and you pretend they're a pose, you either drift off the set of valid rotations or you clip and lose gradient. The reparameterization trick (sample noise, scale, shift) is what makes a stochastic policy differentiable — but done naïvely it violates the manifold. The fix is to sample in the flat tangent space (the Lie algebra) and map back onto the curved manifold with an exact exponential map.

Engineering detail — reparameterize in se(3). The policy outputs a 6-vector mean μξ and diagonal σξ. Sample ε ~ 𝒩(0, I₆), form ξ = μξ + σξ ⊙ ε in the tangent space, then push it onto the manifold with the exponential map. The left-Jacobian Jl carries gradients correctly back to μξ, σξ:

T = Exp(ξ) ∈ SE(3),  ξ ∈ se(3);  ∂ℒ/∂μ = Jl−T ∂ℒ/∂T

Variance is limited by a tangent-space norm penalty rather than an SVD clip on the manifold — so you keep the low-variance benefit of reparameterization with zero constraint violation and a path that stays differentiable and pose-legal throughout.

Two manifold traps
(1) Ill-conditioned Jacobian → exploding gradients. When the rotation is large (‖ω‖ > π/2), Jl−T becomes ill-conditioned. Clamp its singular values to a maximum condition number ~1e4 instead of letting "big rotation → singular Jl → gradient blow-up" run unchecked. (2) Quaternion double-cover jitter. Real controllers (ROS) take quaternion+translation, and q and −q are the same rotation. Add an "Exp → quaternion" projection layer and lock the quaternion sign (e.g. a post-projection hook) — otherwise the policy emits ±q flips and the control command chatters.

Sharing one critic across force and position. Gripper force control and end-effector position control are different physical channels but one task. Embed both action types into a shared latent and add the embeddings element-wise to a 256-dim unified action representation aemb; encode the state through a shared 512-dim MLP to a 256-dim semb; combine by element-wise subtraction (a state–action duality) and feed a 2×256→1 Q-head. Train with a composite reward r = 0.6·rpos + 0.3·rforce + 0.1·rslip, and exploration noise matched to each channel — Ornstein–Uhlenbeck for position, truncated Gaussian for force, both decaying with policy entropy. To keep the two channels converging together, track per-channel gradient norms and drop the learning rate on whichever channel's gradient norm exceeds threshold. A shared critic this way improves sample efficiency ≈35%, lifts grasp success from ≈89% to ≈96%, and cuts force overshoot ≈42%.

4 · Reward — "closed" is not "grasped well"

Intuition. The cheap reward is "did the fingers close on something?" But a policy optimizing that learns to slap the object shut — whipping the wrist to scoop the part in (a "tail-sweep" grab) — which closes the gripper but ruins alignment and damages parts. You have to reward alignment and stability, not just closure, and you have to keep the sparse success signal from being drowned out by dense shaping.

Engineering detail — stability shaping + curriculum. Add a vibration/stability term to the reward, but clean it first: a one-pole low-pass plus a 3σ outlier reject on the vibration channel, and feed the last 4 vibration steps through a 1-D CNN to extract a time-domain feature (≈8 dB SNR gain). Train with SAC and automatic temperature α so the stability reward is auto-scaled to the same magnitude as the task reward — otherwise gradients are dominated by the sparse success event. For the sparse 0/1 grasp-success signal, use a curriculum: split the task into difficulty tiers, advance only when a tier passes a "double-80" gate (≥80% on the tier metric for ≥80 consecutive episodes), and keep the network weights and replay buffer across tiers — only the environment config changes — to avoid catastrophic forgetting. Halve the learning rate on each promotion to damp oscillation; if validation success drops >10% or collision rate rises >20%, roll back the tier and cut the learning rate further.

r = rtask − 10·(1 − s) − 5·log(δ − dbound)

where s is a manipulability/singular-value margin and dbound is distance to a joint limit — softly steering the arm away from singular, near-limit poses. To stop "closed-but-misaligned" hacking specifically, run a curriculum regularizer KL(πθ ‖ πθ,prev-tier) with a weight that decays per tier so each new policy can't drift far from the last — this cut tier-rollback rate from ≈35% to ≈8% in practice.

1 · FORMULATE S, A, R, P 6-DoF + gripper 2 · DIAGNOSE multimodal POMDP, SE(3), sim-gap 3 · ENGINEER align, Lie reparam, domain rand. 4 · GUARD guardian net, safe-action cache 5 · ITERATE re-diagnose removing one difficulty exposes the next — re-run the loop

5 · The sim-to-real gap — randomize the physics, then bound the action

Intuition. A policy that lifts every object in simulation can fail on the real arm because the real arm has a different mass distribution, friction, gear backlash, and lighting. The fix has two halves. First, domain randomization: instead of training on one physics, train across a distribution of physics so the policy is forced to be robust — but randomize too widely and the policy becomes a mush that's mediocre everywhere. Second, keep the action legal on the real machine: the policy can still command a pose that's out of reach or violates a joint limit, so you bound the action before it ever moves a motor.

Engineering detail — search the randomization range, don't guess it. If real arms are available, use Safe-BO: treat the randomization vector θ as the Bayesian-optimization variable and an acquisition function that trades policy success against real-machine drop risk, validating a handful of real grasps per iteration. If real hardware is scarce, do sim-to-sim distributional robustification: build a Wasserstein ball around the mass distribution and use an evolution strategy (CMA-ES) to maximize worst-case return inside radius ε until the in-ball return variance is <2%. Output a Pareto front — randomization range on one axis, real-world success on the other — and let the line decide. Typical result: real grasp success rising from ≈78% to ≈93%, with tuning time collapsed from person-weeks to hours.

Engineering detail — differentiable projection + safe-action cache. Three layers keep the action legal. (1) The policy emits a normalized joint increment Δθ̂ ∈ [−1,1]ⁿ, scaled by the true joint-limit matrix. (2) Candidate joints go through a differentiable IK layer (Jacobian-transpose + damped least squares, ~5 iterations); if the resulting pose is inside the reachable workspace and the smallest singular value > 0.08 it passes, otherwise the error is back-propagated so the network learns to avoid that region next time. (3) At inference, a safe-action cache: if interpolation finds the next cycle would exceed a limit, zero-order-hold the previous action and signal the PLC to slow down — bounding the cycle-time loss to <200 ms. In practice this drives out-of-bounds rate from ≈12% to ≈0.3% with only ≈1.7% cycle-time cost.

Don't lean on color — texture randomization can backfire
If you randomize texture but the policy quietly learns to find objects by color, it breaks the day the bin lighting changes. Before deployment, run a color-corruption stress test: take ~1000 real trajectories, systematically sample 64 HSV color perturbations, and if cumulative return drops >5% or policy entropy drops >15%, auto-rollback and inject color-augmented samples into the replay buffer for a hot update. Done right, color-related failure rate falls from ≈3.2% to ≈0.14% for ≈11% more training steps.

6 · Sample efficiency — learn offline before you risk the arm

Intuition. Every real grasp costs time and risks damage, so you want to wring as much policy as possible out of logged data before going online. But a policy trained purely on logged data over-estimates the value of actions it never saw — and on a robot, an over-optimistic out-of-distribution action is a crash. Conservative Q-learning (CQL) pessimistically lowers the value of unseen actions so the offline policy stays inside what the data supports.

Engineering detail — CQL then a warm-started finetune. Pretrain a double-Q network (256×2, Mish, z-scored inputs) with the conservative penalty, approximating the log-sum-exp over the continuous action space by sampling ~10 actions from the current policy per minibatch. Start the conservatism weight α at 1.0 and raise it +0.5 whenever the validation OOD-Q over-estimation rate exceeds 5%, lower it otherwise; cosine-anneal LR 3e-4 → 1e-5; early-stop after 5 stalled evals. To hand off to online: warm-start with a behavior-cloning regularizer (weight 0.1) for the first ~10k steps to prevent policy collapse, then linearly decay α to 0.2 to let the policy explore regions humans never covered. The online/offline mix itself can be scheduled — define αt = Nonline / (Nonline + Noffline) and move it by projected gradient ascent on a risk-penalized objective, with a circuit-breaker that forces αt=0 (pure offline) if online drawdown or OOD-action rate spikes.

Keep augmentation semantically faithful — three layers
Data augmentation multiplies a small grasp dataset, but a careless transform can flip the grasp's meaning. (1) Equivariant geometry. Apply only SE(3) rigid transforms to the point cloud and transform the action vector in lockstep, protecting the real contact surface with a contact-point mask so the friction cone is preserved. (2) Reward consistency. For each augmented trajectory, run a fast ~0.2 s dynamics rollback; if the success criterion changes (the object's centroid no longer enters the gripper's closed region), discard or re-label. (3) Distribution-drift monitoring. Train an adversarial discriminator; if an augmented sample is flagged "synthetic" with >90% confidence it has left the real distribution — evict it from the buffer. These three layers let one team expand data ~8× and lift real grasp success from ≈82% to ≈91% with zero on-line incidents.

7 · Guard in production — detect, fall back, hot-update

Intuition. A grasping cell runs for hours unattended; the failure you must engineer for is not "the model is slightly worse" but "the model is silently wrong while the line keeps cycling." Guarding means a cheap online detector, a conservative fallback the instant it fires, and a path back to safe operation.

Engineering detail. A lightweight Guardian network watches the real grasp-success rate online; the moment it dips it switches the controller to a conservative scripted policy rather than the learned one. The safe-action cache (section 5) is the per-cycle backstop — zero-order-hold plus a PLC slow-down whenever the next action would breach a limit. When a modality drops out (the camera is occluded), a modality-compensation path maps language + IMU to a pseudo-visual latent through a cross-modal Transformer and keeps the policy running — gated by a confidence threshold, below which it falls back to the conservative strategy. All of this is auditable: plot the vibration-vs-success Pareto front so a process engineer can sign off the operating point, and snapshot every model version for traceability.

The through-line
Every section is one row of the MDP table turned into a mechanism: multimodal POMDP → contrastive alignment + belief state + modal-dropout; SE(3) action → Lie-algebra reparameterization + shared force/position critic; "closed ≠ grasped" → stability shaping + curriculum + KL regularizer; sim-to-real gap → searched domain randomization + differentiable IK + safe-action cache, with offline CQL to spend real grasps sparingly. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

8 · The central tension — how wide to randomize?

Intuition. Domain randomization is the dial that decides whether your sim policy survives contact with reality. Too narrow and the policy overfits the simulator: great in sim, brittle on the real arm. Too wide and the policy is asked to be robust to physics it will never see, so it learns a cautious, mediocre grasp that works everywhere and excels nowhere. The real-world success rate is a hump: it climbs as you cover the true reality gap, then falls as over-randomization dilutes the policy. The widget below is the napkin math — find the top of the hump for a given reality gap.

preal ≈ pmax · [1 − e−(w / g)] · e−κ·w

where w is the randomization width, g the true reality gap you must cover, and κ the dilution cost of over-randomizing. The first factor rewards covering the gap; the second penalizes spreading capacity too thin.

Domain-randomization width → real-world grasp success

Widen the randomization until it just covers the reality gap, then stop — past the peak you are paying capacity to be robust to physics that never occurs. Search this front with Safe-BO or CMA-ES instead of guessing.

gap coverage
dilution factor
real success
verdict

Further considerations