Robotic grasping

A gripper has to find an object, decide how to approach it, close on it, and not crush or drop it — all from noisy multimodal sensors, with an action that lives on a curved manifold, under hard safety limits, and a policy trained in simulation that must work on a real arm it has never touched. The binding difficulties here are multimodal partial observability (vision can be occluded, tactile is low-resolution), an action space that is not flat (6-DoF pose lives on SE(3), gripper force and position must share a critic), a reward that conflates "closed" with "grasped well", and a sim-to-real gap that turns a 99% sim policy into a 78% real one. Each names a tool.

The method — five steps, every lesson

Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. For grasping, four bind in turn — perception, action geometry, reward semantics, and the reality gap.

1 · Formulate — the MDP behind a grasp

Intuition. A grasping robot looks (camera, depth, sometimes tactile and an IMU), moves its hand to a pose, closes the fingers, and earns a reward if the object is lifted and held stably. That is a Markov Decision Process: the sensor fusion is the state, the end-effector pose plus gripper command is the action, "picked it up and it didn't slip" is the reward, and the contact physics is the transition. Every difficulty below is one of these four pieces being awkward for a real arm on a real line.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ]

Piece	For a 6-DoF grasping arm	The awkward part
State S	RGB-D / point cloud + language instruction + IMU + tactile pressure, fused to a latent	modalities are heterogeneous and can drop out (occluded camera, 8×8 tactile) → multimodal POMDP
Action A	6-DoF end-effector pose Δ(R,t) ∈ SE(3) × gripper open/close × grip force	pose lives on a curved manifold, not ℝ⁶; force-control and position-control are different physical channels
Reward R	grasp success (lifted) + stability (no slip, low vibration)	"fingers closed" ≠ "grasped well"; the obvious reward is hackable and sparse
Transition P	contact dynamics — friction, mass, backlash, deformation	trained in sim, deployed on a real arm → sim-to-real gap

Same pattern as every domain: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.

2 · The state is a multimodal POMDP — align before you fuse

Intuition. Vision, language, IMU and touch each tell part of the story, in different units, at different rates, and any one can vanish (the camera gets blocked, the tactile pad is coarse). If you just concatenate the raw feature vectors and hand them to the policy, the network spends its capacity learning the correspondence between modalities instead of the task, and it collapses the moment one stream is missing. The fix is to first learn a shared latent space where "the same situation seen by different senses" lands in the same place — then fuse, then act.

Engineering detail — two-stage align-then-finetune. Stage one is self-supervised and offline: collect tens of millions of unlabeled multimodal sequences and train a shared vision–language–audio encoder with MoCo-style momentum contrastive learning — latent dimension 256, temperature τ=0.1, a negative queue of 65,536, driving the InfoNCE loss below ≈0.35 before alignment is called done.

ℒ_InfoNCE = −log exp(q·k₊/τ) / Σ_i exp(q·k_i/τ)

Stage two attaches the pretrained encoder to a PPO policy and does a small end-to-end finetune. Two regularizers stop catastrophic forgetting of the alignment: a δ-constraint keeping the new encoder weights within ‖θ − θ₀‖₂ ≤ 0.02 of the pretrained weights, and modal-dropout — randomly masking ~30% of modalities each batch so the policy never depends on any single stream. At inference, the high-rate IMU is compressed by a 1-D CNN to 16 dims, concatenated with the 256-dim visual vector, and pushed through a GRU belief network into a unified 512-dim latent shared by actor and critic. In practice this two-stage scheme improves sample efficiency ≈4.7× over concatenating raw features, cutting downstream convergence from ~48 h to ~10 h.

Why a belief state, not just a fusion

Because the camera can be occluded, the environment is only partially observable: the current frame is not enough. Treating the alignment-network output as a belief state fed to a recurrent aggregator (GRU/LSTM/Transformer over history) lets the policy carry information across frames and ride out a dropout. This is the standard POMDP move — replace the missing observation with an estimate of the hidden state rather than reacting to noise frame-by-frame.

3 · The action is not flat — keep SE(3) on its manifold

Intuition. A 6-DoF pose is a rotation and a translation. Rotations don't add like numbers — "rotate 170° then 20° more" is not "rotate 190° in a straight line through the number line." If the policy outputs six real numbers and you pretend they're a pose, you either drift off the set of valid rotations or you clip and lose gradient. The reparameterization trick (sample noise, scale, shift) is what makes a stochastic policy differentiable — but done naïvely it violates the manifold. The fix is to sample in the flat tangent space (the Lie algebra) and map back onto the curved manifold with an exact exponential map.

Engineering detail — reparameterize in se(3). The policy outputs a 6-vector mean μ_ξ and diagonal σ_ξ. Sample ε ~ 𝒩(0, I₆), form ξ = μ_ξ + σ_ξ ⊙ ε in the tangent space, then push it onto the manifold with the exponential map. The left-Jacobian J_l carries gradients correctly back to μ_ξ, σ_ξ:

T = Exp(ξ) ∈ SE(3), ξ ∈ se(3); ∂ℒ/∂μ = J_l^−T ∂ℒ/∂T

Variance is limited by a tangent-space norm penalty rather than an SVD clip on the manifold — so you keep the low-variance benefit of reparameterization with zero constraint violation and a path that stays differentiable and pose-legal throughout.

Two manifold traps

(1) Ill-conditioned Jacobian → exploding gradients. When the rotation is large (‖ω‖ > π/2), J_l^−T becomes ill-conditioned. Clamp its singular values to a maximum condition number ~1e4 instead of letting "big rotation → singular J_l → gradient blow-up" run unchecked. (2) Quaternion double-cover jitter. Real controllers (ROS) take quaternion+translation, and q and −q are the same rotation. Add an "Exp → quaternion" projection layer and lock the quaternion sign (e.g. a post-projection hook) — otherwise the policy emits ±q flips and the control command chatters.

Sharing one critic across force and position. Gripper force control and end-effector position control are different physical channels but one task. Embed both action types into a shared latent and add the embeddings element-wise to a 256-dim unified action representation a_emb; encode the state through a shared 512-dim MLP to a 256-dim s_emb; combine by element-wise subtraction (a state–action duality) and feed a 2×256→1 Q-head. Train with a composite reward r = 0.6·r_pos + 0.3·r_force + 0.1·r_slip, and exploration noise matched to each channel — Ornstein–Uhlenbeck for position, truncated Gaussian for force, both decaying with policy entropy. To keep the two channels converging together, track per-channel gradient norms and drop the learning rate on whichever channel's gradient norm exceeds threshold. A shared critic this way improves sample efficiency ≈35%, lifts grasp success from ≈89% to ≈96%, and cuts force overshoot ≈42%.

4 · Reward — "closed" is not "grasped well"

Intuition. The cheap reward is "did the fingers close on something?" But a policy optimizing that learns to slap the object shut — whipping the wrist to scoop the part in (a "tail-sweep" grab) — which closes the gripper but ruins alignment and damages parts. You have to reward alignment and stability, not just closure, and you have to keep the sparse success signal from being drowned out by dense shaping.

Engineering detail — stability shaping + curriculum. Add a vibration/stability term to the reward, but clean it first: a one-pole low-pass plus a 3σ outlier reject on the vibration channel, and feed the last 4 vibration steps through a 1-D CNN to extract a time-domain feature (≈8 dB SNR gain). Train with SAC and automatic temperature α so the stability reward is auto-scaled to the same magnitude as the task reward — otherwise gradients are dominated by the sparse success event. For the sparse 0/1 grasp-success signal, use a curriculum: split the task into difficulty tiers, advance only when a tier passes a "double-80" gate (≥80% on the tier metric for ≥80 consecutive episodes), and keep the network weights and replay buffer across tiers — only the environment config changes — to avoid catastrophic forgetting. Halve the learning rate on each promotion to damp oscillation; if validation success drops >10% or collision rate rises >20%, roll back the tier and cut the learning rate further.

r = r_task − 10·(1 − s) − 5·log(δ − d_bound)

where s is a manipulability/singular-value margin and d_bound is distance to a joint limit — softly steering the arm away from singular, near-limit poses. To stop "closed-but-misaligned" hacking specifically, run a curriculum regularizer KL(π_θ ‖ π_θ,prev-tier) with a weight that decays per tier so each new policy can't drift far from the last — this cut tier-rollback rate from ≈35% to ≈8% in practice.

5 · The sim-to-real gap — randomize the physics, then bound the action

Intuition. A policy that lifts every object in simulation can fail on the real arm because the real arm has a different mass distribution, friction, gear backlash, and lighting. The fix has two halves. First, domain randomization: instead of training on one physics, train across a distribution of physics so the policy is forced to be robust — but randomize too widely and the policy becomes a mush that's mediocre everywhere. Second, keep the action legal on the real machine: the policy can still command a pose that's out of reach or violates a joint limit, so you bound the action before it ever moves a motor.

Engineering detail — search the randomization range, don't guess it. If real arms are available, use Safe-BO: treat the randomization vector θ as the Bayesian-optimization variable and an acquisition function that trades policy success against real-machine drop risk, validating a handful of real grasps per iteration. If real hardware is scarce, do sim-to-sim distributional robustification: build a Wasserstein ball around the mass distribution and use an evolution strategy (CMA-ES) to maximize worst-case return inside radius ε until the in-ball return variance is <2%. Output a Pareto front — randomization range on one axis, real-world success on the other — and let the line decide. Typical result: real grasp success rising from ≈78% to ≈93%, with tuning time collapsed from person-weeks to hours.

Engineering detail — differentiable projection + safe-action cache. Three layers keep the action legal. (1) The policy emits a normalized joint increment Δθ̂ ∈ [−1,1]ⁿ, scaled by the true joint-limit matrix. (2) Candidate joints go through a differentiable IK layer (Jacobian-transpose + damped least squares, ~5 iterations); if the resulting pose is inside the reachable workspace and the smallest singular value > 0.08 it passes, otherwise the error is back-propagated so the network learns to avoid that region next time. (3) At inference, a safe-action cache: if interpolation finds the next cycle would exceed a limit, zero-order-hold the previous action and signal the PLC to slow down — bounding the cycle-time loss to <200 ms. In practice this drives out-of-bounds rate from ≈12% to ≈0.3% with only ≈1.7% cycle-time cost.

Don't lean on color — texture randomization can backfire

If you randomize texture but the policy quietly learns to find objects by color, it breaks the day the bin lighting changes. Before deployment, run a color-corruption stress test: take ~1000 real trajectories, systematically sample 64 HSV color perturbations, and if cumulative return drops >5% or policy entropy drops >15%, auto-rollback and inject color-augmented samples into the replay buffer for a hot update. Done right, color-related failure rate falls from ≈3.2% to ≈0.14% for ≈11% more training steps.

6 · Sample efficiency — learn offline before you risk the arm

Intuition. Every real grasp costs time and risks damage, so you want to wring as much policy as possible out of logged data before going online. But a policy trained purely on logged data over-estimates the value of actions it never saw — and on a robot, an over-optimistic out-of-distribution action is a crash. Conservative Q-learning (CQL) pessimistically lowers the value of unseen actions so the offline policy stays inside what the data supports.

Engineering detail — CQL then a warm-started finetune. Pretrain a double-Q network (256×2, Mish, z-scored inputs) with the conservative penalty, approximating the log-sum-exp over the continuous action space by sampling ~10 actions from the current policy per minibatch. Start the conservatism weight α at 1.0 and raise it +0.5 whenever the validation OOD-Q over-estimation rate exceeds 5%, lower it otherwise; cosine-anneal LR 3e-4 → 1e-5; early-stop after 5 stalled evals. To hand off to online: warm-start with a behavior-cloning regularizer (weight 0.1) for the first ~10k steps to prevent policy collapse, then linearly decay α to 0.2 to let the policy explore regions humans never covered. The online/offline mix itself can be scheduled — define α_t = N_online / (N_online + N_offline) and move it by projected gradient ascent on a risk-penalized objective, with a circuit-breaker that forces α_t=0 (pure offline) if online drawdown or OOD-action rate spikes.

Keep augmentation semantically faithful — three layers

Data augmentation multiplies a small grasp dataset, but a careless transform can flip the grasp's meaning. (1) Equivariant geometry. Apply only SE(3) rigid transforms to the point cloud and transform the action vector in lockstep, protecting the real contact surface with a contact-point mask so the friction cone is preserved. (2) Reward consistency. For each augmented trajectory, run a fast ~0.2 s dynamics rollback; if the success criterion changes (the object's centroid no longer enters the gripper's closed region), discard or re-label. (3) Distribution-drift monitoring. Train an adversarial discriminator; if an augmented sample is flagged "synthetic" with >90% confidence it has left the real distribution — evict it from the buffer. These three layers let one team expand data ~8× and lift real grasp success from ≈82% to ≈91% with zero on-line incidents.

7 · Guard in production — detect, fall back, hot-update

Intuition. A grasping cell runs for hours unattended; the failure you must engineer for is not "the model is slightly worse" but "the model is silently wrong while the line keeps cycling." Guarding means a cheap online detector, a conservative fallback the instant it fires, and a path back to safe operation.

Engineering detail. A lightweight Guardian network watches the real grasp-success rate online; the moment it dips it switches the controller to a conservative scripted policy rather than the learned one. The safe-action cache (section 5) is the per-cycle backstop — zero-order-hold plus a PLC slow-down whenever the next action would breach a limit. When a modality drops out (the camera is occluded), a modality-compensation path maps language + IMU to a pseudo-visual latent through a cross-modal Transformer and keeps the policy running — gated by a confidence threshold, below which it falls back to the conservative strategy. All of this is auditable: plot the vibration-vs-success Pareto front so a process engineer can sign off the operating point, and snapshot every model version for traceability.

The through-line

Every section is one row of the MDP table turned into a mechanism: multimodal POMDP → contrastive alignment + belief state + modal-dropout; SE(3) action → Lie-algebra reparameterization + shared force/position critic; "closed ≠ grasped" → stability shaping + curriculum + KL regularizer; sim-to-real gap → searched domain randomization + differentiable IK + safe-action cache, with offline CQL to spend real grasps sparingly. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

8 · The central tension — how wide to randomize?

Intuition. Domain randomization is the dial that decides whether your sim policy survives contact with reality. Too narrow and the policy overfits the simulator: great in sim, brittle on the real arm. Too wide and the policy is asked to be robust to physics it will never see, so it learns a cautious, mediocre grasp that works everywhere and excels nowhere. The real-world success rate is a hump: it climbs as you cover the true reality gap, then falls as over-randomization dilutes the policy. The widget below is the napkin math — find the top of the hump for a given reality gap.

p_real ≈ p_max · [1 − e^{−(w / g)}] · e^−κ·w

where w is the randomization width, g the true reality gap you must cover, and κ the dilution cost of over-randomizing. The first factor rewards covering the gap; the second penalizes spreading capacity too thin.

Further considerations

Tactile geometry completion. When the point cloud is missing (occlusion, a transparent or specular surface), low-resolution touch can fill it in: make the action an active sliding trajectory and use information gain as an intrinsic reward, letting the agent "trace the edge" exactly as a human finger reconstructs a shape it cannot see. A diffusion prior can sit underneath as a hierarchical policy — high-level RL decides where to touch, the diffusion model decides what to fill in.
Multi-arm shared workspace. When two RL arms share ~50% of their workspace, each policy can oscillate against the other in a game-theoretic loop. Encode the dynamic collision body into a central critic so the value function sees both arms' intents, rather than letting two independent policies fight.
Deformable objects break equivariance. SE(3)-equivariant augmentation assumes rigidity; for cloth or cables it fails. Switch to material-aware differentiable augmentation — perturb Young's modulus / Poisson ratio in the simulator to generate semantically equivalent but dynamically different trajectories — and use meta-learning (MAML) so the policy adapts fast on contact.
Quantify backlash before buying hardware. Measure the value decay ΔV and action-jitter entropy H(a) between compensated (≤0.05°) and uncompensated (0.3°) backlash via a difference-in-differences design. Translate ΔV into scrap cost per part so the choice between a better harmonic drive and an algorithmic compensator becomes an ROI decision, not a taste.
Persistent alignment under drift. In a non-stationary environment (day → night), the multimodal latent distribution drifts. Treat the alignment network's KL change as an intrinsic reward and let SAC re-tune the encoder online, holding KL to the original alignment below ~0.05 — self-adaptive alignment with no human in the loop.