UAV path planning

A drone that plans its own route through a real sky has to reconcile five awkward facts at once: its position estimate degrades when GPS drops out, its action is a continuous thrust/heading vector that must stay physically flyable, its reward braids path length against energy against regulatory risk, the wind field is non-stationary, and in a swarm it must agree with neighbors over a flaky radio link. Each fact names a tool — and reaching for the tool before the fact bites is how you over-engineer.

The method — five steps, every lesson

Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. Here the binding difficulty is that the agent's own state estimate, its action's physical feasibility, and the air around it are all uncertain at the same time.

1 · Formulate — the MDP behind an autonomous drone

Intuition. A drone reads its sensors, commands its motors, and is rewarded for reaching a goal cheaply and safely. That is already a Markov Decision Process: the fused navigation estimate is the state, the thrust-and-heading command is the action, the mix of progress, energy and risk is the reward, and rigid-body dynamics plus wind is the transition. Every section below is one of these four pieces turning out to be awkward in real airspace.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ]

Piece	For an autonomous UAV	The awkward part
State S	fused position, velocity, attitude from IMU + GPS + vision	GPS multipath and occlusion → the state estimate itself is uncertain and sometimes unobservable
Action A	continuous speed / heading, ultimately 4 motor thrusts	most of the raw action vector is physically infeasible — motors saturate, the airframe stalls
Reward R	reach the goal: short path, low energy, no incursions	three competing objectives, one of them a hard legal constraint, not a soft cost
Transition P	rigid-body dynamics + atmospheric wind	wind is non-stationary — the same action lands you somewhere different tomorrow

The rightmost column is the lesson. We will work the rows in order: a confidence-aware state (§2–3), a feasibility-projected action (§4), a regulation-anchored multi-objective reward (§5), a wind-robust transition (§6), and finally swarm consensus (§7).

2 · State — fuse the sensors, then hand the policy its own confidence

Intuition. Raw GPS jitters; raw IMU drifts. Neither alone is a trustworthy state. The classical fix is a filter that fuses them — but the deeper point for RL is that the policy must not be told "you are here" as if it were certain. It must be told "you are probably here, and here is how sure I am." A policy that knows its position is shaky can lean on relative cues instead of absolute coordinates exactly when GPS is failing.

Engineering detail. Use an error-state EKF (ES-EKF), loosely coupled, in four stages. (1) IMU integration: compensate the current accelerometer bias b_a and gyro bias b_g, then forward-integrate at 100 Hz to get the nominal position, velocity and attitude. (2) GPS correction: when a GPS frame arrives, form the residual z = p_GPS − p_IMU, with the measurement covariance R set by the fix quality — R = 0.01 m² for an RTK fixed solution, 2 m² for float, 9 m² with no differential. (3) Robust update: run a chi-squared test on the residual; if χ² > 9.5 the frame is treated as a multipath jump and R is inflated 100×, which gracefully degrades the fusion to nearly-pure IMU and keeps the state smooth.

(4) RL state output is the part that matters for the policy. Do not feed the network the full filter state. Compress it to an 8-dimensional vector: 3-D position, 3-D velocity, a GPS-status flag (0 = invalid, 1 = float, 2 = fixed), and the trace of the position covariance trace(P_pos). This both shrinks the input dimension and makes the policy explicitly aware of its own confidence, so it automatically reduces its reliance on absolute position when GPS is lost. The whole module runs on an embedded MCU at <0.3 ms per frame, comfortably inside a 100 Hz control loop.

s_RL = [ p_x,y,z, v_x,y,z, gps_flag, trace(P_pos) ] ∈ ℝ⁸

3 · Partial observability — surviving a GPS blackout

Intuition. Loose coupling is clean but brittle: lose the fix entirely and you lose all GPS information. The job under occlusion is not to pretend you know where you are — it is to bound the drift and degrade to a conservative policy rather than fly confidently into a wall.

Engineering detail. Pair the high-rate inertial estimate with a low-rate visual anchor — for example 200 Hz IMU inference + 10 Hz visual keyframes — which fits in roughly 30% of one embedded GPU's CPU budget. Measured through a 100 s GPS dropout this holds lateral error to about 0.27 m, inside the tolerance road-test regulations require. The trick is that the covariance trace from §2 is already rising, so the same policy that was tracking a waypoint now sees a high-uncertainty state and shifts toward the conservative branch on its own.

Why feed confidence, not just the estimate

Tightly coupling at the raw-measurement level (pseudorange + Doppler) keeps partial constraints even when only three satellites are visible for half a minute, holding drift under 0.3% where loose coupling would lose GPS entirely. But even loose coupling becomes far more robust the moment trace(P_pos) is in the state vector: the policy learns the equivalent of "when unsure, slow down and trust relative motion" without anyone hand-coding that rule.

4 · Action — project onto the feasible set, don't hope

Intuition. A policy head emitting a continuous thrust/heading vector will happily request commands the airframe cannot deliver: motors past their saturation speed, torques the battery can't source. If you let those commands through, the real drone clips them — and that clipping is invisible to the gradient, so training learns from actions that never actually happened. The cure is to make every action physically feasible before it leaves the policy.

Engineering detail — three layers that lock the action into the feasible set. First, replace the policy's final fully-connected layer with a feasibility-projection layer: map the raw 4-D motor command to total thrust and three-axis torque, then solve a bounded, minimum-norm quadratic program

min ‖u − u_raw‖² s.t. 0 ≤ ω_i ≤ ω_max, i = 1…4

An embedded QP solver clears this in <80 µs, sustaining 400 Hz on the flight MCU. Second, add a soft-constraint reward term −10·(‖τ‖₂ − τ_max)² so that early in training the policy learns on its own not to ride the limit, reducing the gradient noise that frequent QP truncation injects. Third, use the same motor model in simulation and on the real airframe — including the battery-voltage sag curve and ESC dead-zone — so the constraint boundary is calibrated to within ±2% in sim and sim-to-real never suddenly hits a ceiling. With all three layers the policy output is physically feasible by construction.

Train-time soft, inference-time hard

A subtle but load-bearing asymmetry. If you use a differential-flatness parameterization — the network outputs only the flat outputs z = [x, y, z, ψ] and their derivatives, then an analytic inverse map u = Ψ(z, ż, z̈) recovers thrust and torque — insert a differentiable projection afterward:

u_proj = u + Jᵀ(JJᵀ)⁻¹(b − Au)₊ (Au ≤ b are the saturation bounds)

At training time, a slight violation is allowed and a penalty κ·‖ReLU(Au − b)‖² is added to the immediate reward so the policy gradient stays smooth. At inference time, switch to hard projection so the command is strictly feasible — which is what safety audits demand. Keeping the network responsible only for "smooth flat outputs" while the layer enforces physics cuts complexity and lifts sample efficiency by 30%+.

5 · Reward — path length vs energy vs regulatory risk

Intuition. "Reach the goal" hides three objectives that pull against each other: the shortest route, the lowest-energy route, and the safest route. The first two are soft costs you can trade off. The third — staying out of a no-fly zone — is not a cost, it is a legal constraint, and a reward that lets the agent "buy" an incursion with enough path savings is a misspecified reward. The design has to make the constraint dominate exactly when it is violated.

Engineering detail — a regulation-anchored, energy-coupled penalty. Model the no-fly zone as an ellipsoid and define a normalized distance that is ≥1 outside and <1 inside:

d(x_t) = √[(x−x₀)²+(y−y₀)²]/R_h + |z−z₀|/R_v

with the horizontal/vertical radii R_h, R_v taken straight from the official airspace notice. The penalty is a continuous, differentiable, three-segment function whose coefficients are calibrated to the actual fine schedule so that the agent "sees" risk in the same units regulators do:

r_t = 0 (d ≥ 1) | −λ₁(1−d)(1+E_k/30kJ) (0.5 ≤ d < 1) | −λ₂(1−d)²(1+E_k/30kJ) − c_crash (d < 0.5)

with λ₁ = 2×10⁵ (the linear map onto a minimum administrative fine), λ₂ = 1×10⁶ (squared amplification once inside the core zone, mapping to the maximum fine), and c_crash = 5×10⁶ added when kinetic energy E_k ≥ 30 kJ so a fatal-energy collision reads as catastrophic. E_k uses the maximum takeoff mass and a conservative ground-speed-plus-wind estimate, so the agent is trained against the worst case. The energy multiplier (1 + E_k/30kJ) is what couples the risk objective to the energy objective — a fast, heavy incursion is punished harder than a slow drift across the line.

Three engineering details that make the reward trainable

(1) Penalty normalization. Z-score r_t inside the replay buffer so the huge fine-scale magnitudes don't distort Q-value estimation. (2) Safety backstop. If the policy's action would drive d(x_t+1) < 0.2, a control-barrier-function layer truncates it and assigns r_t = −1×10⁷ — preventing the gradient pathway from ever rewarding the lethal region. (3) Audit interface. Each episode emits an "incursion-duration / peak-penalty / cited-regulation" triple table for automated regulatory comparison. On a 1000-flight simulation benchmark this drove the incursion rate from 3.2% to 0.02% with convergence speed unchanged versus a no-penalty baseline.

Engineering detail — when shortest still wins, constrain it. A reward that merely adds a risk penalty can still let "go short" dominate. Train instead with PPO-Lagrangian, where the multiplier λ on the safety cost is adjusted by dual gradient ascent so every update satisfies a KL-plus-cost double constraint and you never hand-tune the penalty weight. Add a quadratic barrier on the rolling-average cost Ĉ:

r = r_length − λ·c − k·(safety_budget − Ĉ)² (k ≈ 10³)

so that as cost approaches the budget the negative reward blows up and the gradient itself suppresses the shortcut impulse. If the multiplier oscillates and destabilizes training, slow λ's update to a quarter of the policy-update period and smooth it with an EMA, monitoring the phase lag between λ and cost to trigger early stopping.

6 · Transition — a wind field that won't sit still

Intuition. The same heading command at the same place gives a different trajectory in different wind. If you train on a fixed wind snapshot, the policy overfits to one day's weather. You need the distribution of wind fields in the loop, and you need the policy to be robust to drawing from it.

Engineering detail — generate the wind, then make the policy indifferent to it. A practical pipeline: run a high-fidelity fluid simulation to collect many non-stationary wind snapshots; compress them with POD (proper orthogonal decomposition), truncating at 99% energy to a few hundred basis modes; fit a conditional VAE over the time-coefficients so you can sample fresh, physically-plausible 600 s wind sequences conditioned on a measured mean wind; and append a divergence-free projection (a differentiable Poisson layer) so every generated field is physically consistent. Wrap that as a gym-style environment: each reset samples a latent z and rolls out a wind sequence; each step interpolates hub-height wind and turbulence to the drone's position and feeds the policy. Finally, add a domain-randomization penalty in the policy loss

max_θ min_z E[ r(θ, z) ]

so the policy is trained against the worst wind latent, not the average — the source reports a real 5 MW deployment lifting yield 3.8% while staying inside structural-load safety margins.

Engineering detail — adapt fast to a brand-new site. When a genuinely new environment appears, meta-RL gives a 72-hour path to a usable policy in three tiers. (1) Offline meta-training on a library of historical sites with a proximal meta-RL method, the policy taking an extended state (winds, rates, plus a task vector m). (2) A lightweight context encoder — a GRU under 30k parameters — compresses a 10-minute sliding window of telemetry into an 8-D latent z spliced into the policy input, giving zero-gradient instant adaptation. (3) One-step gradient correction: collect ~12 trajectories on day one and take a single MAML inner-loop step (lr 1e-3) on just the last two layers, regularized by digital-twin replay. This is enough to raise a power-curve fill rate from 89% to 93% with no manual re-tuning. Crucially, if the context encoder reports high uncertainty (σ_z > 0.35) on an out-of-distribution event like a passing storm, the system degrades to a preset safe policy rather than extrapolating.

7 · Guard — stall protection and swarm consensus in production

Intuition. Two failure classes dominate in flight: a single airframe driving itself into a stall, and a swarm losing agreement when the radio link degrades. Both are guarded the same way — a fast, certifiable safety layer underneath the learned policy that can override it within milliseconds.

Engineering detail — anti-stall, two guard layers. Wrap the policy in an exploration-uncertainty escort: an ensemble of small Bayesian models predicts the next-state distribution, and if the predicted angle-of-attack variance exceeds 0.5° the step is judged out-of-bounds and the agent switches to a conservative Safe Recovery Policy, with that trajectory blacklisted so it cannot poison the training data. In production add a shadow-mode dual monitor: the online policy and an offline-trained safe policy infer in parallel, and when their action gap or the CBF-violation probability exceeds threshold (>1%), hardware forces a switch to the safe policy and raises a stall warning within 2 ms, inside the ≤3 ms loss-of-control protection bound. Across a large simulation-plus-flight campaign this reached a zero-stall record.

Engineering detail — swarm, three layers. Decompose into safety / coordination / execution. The safety layer is a control barrier function that hard-corrects whenever inter-drone distance d_ij drops below 1.5 m, independent of any reward shaping. The coordination layer uses centralized training, decentralized execution (MAPPO): each drone observes its own pose plus the relative poses of up to 6 neighbors (graph pruned dynamically), outputs a continuous 3-D acceleration from a Gaussian policy, and is rewarded by r = r_safe + 10·r_form + 0.1·r_smooth − 0.01·r_energy with curriculum-scheduled weights — survive first, then hold formation. MAPPO is chosen because PPO's clipped ratio quantizes to INT8 better than SAC on embedded NPUs, giving <8 ms per frame. The execution layer does sim-to-real with full domain randomization plus residual RL on the real airframe (train only the last layer, converging in ~5 minutes of data). A reported deployment flew 6 multirotors at 2 m spacing and 8 m/s with zero collisions and mean formation error <0.18 m.

When the radio link drops

Make communication event-triggered: neighbors exchange a 2-byte compressed vector only when relative speed exceeds 0.5 m/s. Measured under 50% packet loss, formation error rises only 6%. Under purely local communication, run a distributed policy-gradient scheme (D-PPO) with a consensus step every 10 updates using a doubly-stochastic weight matrix and a gradient-tracking variable, plus Top-K(1%) sparsification and 4-bit quantization for >100× compression — converging even with asymmetric links and 30% loss. If graph connectivity (Fiedler value λ₂) drops below 0.05, trigger relay election so the swarm graph stays connected.

8 · The trade-off you actually tune — energy under nonlinear drag

Intuition. Of the three reward objectives, energy is the one with a sharp nonlinearity: aerodynamic drag grows with the square of airspeed-relative-to-wind. Flying faster shortens the path but the energy cost climbs quadratically, and a headwind shifts the whole curve. The widget below is that trade-off — pick a cruise speed against a wind and watch path time fight energy.

Engineering detail. Shape the energy reward as a standardized log reward so the quadratic drag term is retained but a high-speed reward explosion is compressed:

r_t = −log( 1 + ( F_roll + ½·ρ·C_d·A·(v+v_wind)² + F_grade )·v / (η·P_ref) )

and train with a model-augmented PPO: a light physics sub-model gives the deterministic next state, a Gaussian-process residual network outputs the drag-prediction mean μ and variance σ², and imagined Dyna-style rollouts sample the drag term from 𝒩(μ, σ²) so the policy meets extreme high-drag events before it ever flies them — improving sample efficiency and cutting online exploration risk.

Further considerations

Tight coupling and factor graphs. When only three satellites are visible for tens of seconds, a loosely-coupled filter loses GPS entirely; a tightly-coupled formulation still uses pseudorange and Doppler to keep partial constraints, bounding drift under 0.3%. The cost is a heavier, harder-to-tune estimator.
Learn the filter, not just the policy. Make the Kalman gain K a policy output and add a reward term for "estimated-versus-post-hoc-RTK error," training with PPO. The policy can then learn behaviors hand-tuning struggles to express — e.g. pre-emptively down-weighting GPS at a tunnel mouth. Across cities, viaducts and underground sites, a meta-RL context vector c lets one π(a|s,c) emit the appropriate R and Q matrices rather than maintaining three hand-tuned filters.
Variable payload shifts the feasible set. Dropping cargo changes mass m and inertia J in real time, moving the QP constraint boundary. Make the projection-layer parameters θ=[m,J] a network input so the policy senses and adapts online; feed battery voltage and motor temperature so it predicts a tightening thrust limit before hitting the wall, avoiding the policy oscillation that hard truncation causes.
Markov wind-error, not white noise. If forecast error is modeled as a Markov process rather than Gaussian white noise, the value-function error propagation changes: introduce an error-state variable and augment the MDP, so the Bellman equation recurses on the joint state (s, e). For spatially-correlated errors across neighboring turbines or drones, graph RL embeds the correlation directly into the value function.
Multi-agent credit and dynamic no-fly zones. When only some drones in a swarm intrude, decompose the total penalty back to each by marginal contribution (Shapley value or a counterfactual baseline) to avoid "collective punishment" pushing the whole policy conservative. When a no-fly zone expands for a major event, hot-update the penalty function within minutes via meta-learning treating the penalty parameters as a task vector — without catastrophic forgetting.
Explainability for certification. Auditors increasingly require an explainable safety boundary: visualize the CBF gradient to the operator's HMI, log B-spline coefficients and the flat-output path, and recover the algebraic u↔z relation post-hoc by symbolic regression. For wind-shear escape, a SHAP module reporting "which state component triggered the switch" is becoming a hard airworthiness gate.