Smart-logistics AGV scheduling

A warehouse floor is a fleet of automated guided vehicles (AGVs) that must pick, carry, yield at intersections, and recharge — all while orders arrive on a schedule that drifts hour by hour. The binding difficulties of this domain are different from a game's: the reward is a multi-agent yielding conflict that can deadlock, the order stream is non-stationary so a fixed-horizon policy ages out by noon, path and charging are two coupled timescales, the charging queue is partially observable, and the whole thing only learns fast enough if you can keep thousands of simulators feeding a GPU. Each one names a tool.

The method — five steps, every lesson

Applied RL is the same loop in every domain. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one or two properties that make this MDP hard. (3) Engineer the mechanism that removes exactly that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. For an AGV fleet the difficulties stack — multi-agent conflict, non-stationarity, coupled timescales, partial observation, throughput — so we run the loop once per row of the table.

1 · Formulate — the MDP behind a yielding, recharging AGV

Intuition. One AGV sees its local map, the priority tags and predicted paths of nearby vehicles, its own battery, and the layout of pick aisles. It chooses how to move and when to peel off to charge. It is rewarded for making progress without forcing a higher-priority vehicle to slam on the brakes, without colliding, and without running flat. The transition is the floor physics plus every other vehicle — which is where the trouble starts.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] subject to collision-rate ≤ ε

Piece	For one AGV in the fleet	The awkward part
State S	local map + yield zones + neighbour priority vector p∈{0,1,2,3} + predicted neighbour trajectories + own battery / SOC	you only see neighbours within a comm radius → the joint state is too big and partially observed
Action A	continuous velocity (vₓ, v_y), plus discrete macro-actions: yield, drop-package, go-charge	atomic control and minute-long deliveries live on two different timescales
Reward R	progress + yield + safety (TTC) + comfort + traffic-rule terms, summed per step	"who yields to whom" is a social-dilemma; a flat win/loss reward is far too sparse
Transition P	floor dynamics + the other N−1 learning vehicles + the charging-queue server	order arrivals drift over the day → non-stationary; queues are hidden → POMDP

Same pattern as every lesson: the MDP table writes itself, and the rightmost column is the lesson. Each awkward row below gets the one mechanism that answers it — and nothing more.

2 · Reward shaping for yielding — make the social goal locally perceptible

Intuition. If you only reward "reached the goal" (+100) and punish "crashed" (−1000), an AGV in dense traffic must stumble through an exponential number of steps before it ever sees a positive sample. Yielding is invisible to such a reward — it is just a delay with no payoff. The fix is to decompose the social goal into small, per-step, perceptible sub-goals: did I make progress, did I cause a high-priority vehicle to brake, am I about to collide, am I driving smoothly, did I break a rule.

Engineering detail — five per-step terms. The local reward is computed every step as a sum of five components:

r = R_progress + R_yield + R_safety + R_comfort + R_rule

R_progress: +1 per metre of legal forward progress; −0.1 per step while stopped to yield — small enough that yielding pays off, large enough that "never move" does not.
R_yield: with priority vector p (lower = higher priority), if your action forces a higher-priority vehicle into emergency braking (deceleration > 3 m/s²), an immediate −10; if you slow inside the yield zone (<15 m from the conflict point) so the priority vehicle never has to brake, +2.
R_safety: a continuous time-to-collision penalty R_safety = −5·exp(−TTC/1.0), active when TTC < 2.5 s; a collision adds −1000 and truncates the episode.
R_comfort: a quadratic penalty −0.05·a² − 0.02·j² on acceleration a and jerk j, clipped at −2, so the agent can't exploit "floor it" maneuvers.
R_rule: traffic rules as boolean features — run a red light −500, fail to keep right −1 per step, turn without yielding to oncoming traffic moving >5 km/h −20, fail to yield on turn −15.

Why decomposition beats a sparse reward

The sparse version (+100 success / −1000 collision) needs exponentially many exploration steps to find a positive sample in dense urban-style traffic. Breaking "yield" into perceptible per-step targets raises sample efficiency by roughly 5–10×. The shaping terms feed only the state vector (yield zones, priority tags, predicted neighbour trajectories from the HD map) — never future ground truth — so the reward stays causal and does not leak information the agent could not have at decision time.

3 · Safety as a constraint, not a reward weight — PPO-Lagrangian

Intuition. You could fold "don't crash" into the reward with a big negative weight, but then the agent silently trades a little safety for a lot of efficiency whenever the arithmetic says so. In logistics that trade is unacceptable: the policy must learn to not crash first, then optimize throughput. That is a constrained problem, not a weighting problem.

Engineering detail. Train with PPO-Lagrangian: set the collision rate as an explicit constraint (≤ 0.1%) and introduce a Lagrange multiplier λ on the constraint. The objective becomes the usual PPO surrogate minus λ times the constraint violation, and λ is updated by gradient ascent on the dual:

max_θ min_λ≥0 E[ A^PPO ] − λ·( E[collision] − ε ) , ε = 0.001

When collisions exceed the budget, λ rises and the safety terms dominate the gradient; once the policy is safe, λ relaxes and efficiency terms take over. This enforces a priority ordering that a fixed reward weight cannot. The same emitter is hot-pluggable: when a city piloted a "must-stop-on-right-turn" rule, the team added one boolean term (−50) to R_rule with no re-collection of data — the reward function is explainable and modular, and the per-step terms plus the RSS safety envelope are written into a safety-audit log for regulatory review.

4 · Multi-agent conflict — mean-field neighbours and deadlock intervention

Intuition. The transition depends on every other vehicle, and the joint action space explodes with fleet size. But one AGV does not need each neighbour individually — it needs the distribution of what neighbours are doing nearby. A mean-field approximation replaces "N−1 specific vehicles" with one averaged neighbour, collapsing the combinatorics.

Engineering detail — estimating the neighbour distribution. First fix the neighbour set N(i) by comm radius. Then pick an estimator by data regime: with abundant simulation, an empirical average μ₋ᵢ(a) = (1/|N(i)|) Σ I[aⱼ=a] over the batch; with limited sampling, a parameterized network μ₋ᵢ = φ_η(s_N(i)) trained end-to-end with the policy under a KL loss to the empirical distribution; and in a non-stationary floor, a momentum moving average μ₋ᵢ,ₜ₊₁ = (1−α)μ₋ᵢ,ₜ + α·batch_mean with α∈[0.01, 0.1]. Apply stop-gradient to μ₋ᵢ so it is a fixed target, feed the global state as an extra input to reduce local-observation bias, and importance-weight replayed neighbour actions to correct distribution drift. Acceptance: distribution error L1 = E[|μ₋ᵢ − true frequency|] < 0.05, and the mean-field policy beats the independent one by ≥ 8% within three iterations.

Deadlock: detect a cycle, intervene centrally, hand control back

Even with good local policies, a fleet can gridlock — A waits for B waits for C waits for A. Build a directed graph from the recent joint state-action sequence and run Tarjan's algorithm to find strongly-connected components; a cycle of length > 10 with average reward < 95% of baseline is flagged as a second-degree deadlock. A coordinator pushes a sub-1 KB intervention packet (<50 ms to deliver): heat up exploration (ε 0.05 → 0.3 for 5 episodes), add a potential term −η·‖a − a_dead‖² that repels the most-repeated cyclic action (η = 0.1), and mask the top-2 looping actions for just 50 steps. Each agent then measures KL(π_new ‖ π_inter); when KL < 0.01 and local TD-error < 0.05 it sends a FIN and reverts to distributed learning. At ≥ 80% FIN the deadlock is cleared. If < 80% FIN within 10 s, the coordinator degrades: roll the cycle's agents back to their last local checkpoint (≤ 5 MB) and exponentially back off intervention strength so throughput is never starved. Zero human intervention; in one rollout this cut deadlock frequency from ~3/day to ~1 per two weeks and lifted throughput ~12%.

5 · Non-stationarity — augment the state with time, adapt the horizon

Intuition. Orders do not arrive at a constant rate; a lunch rush looks nothing like 3 a.m. A policy that assumes one stationary distribution is fighting the calendar. Two fixes work together: tell the agent what time it is so the non-stationarity becomes part of the (now stationary) augmented state, and shorten the planning horizon when the world is changing fast so stale value estimates don't poison decisions.

Engineering detail — periodic state augmentation. Encode the hour-of-day h as a 4-dim sine-cosine vector instead of a one-hot (preserves cyclic continuity, avoids dimension blow-up), and feed "same-hour volume yesterday" as an external feature so the model locks onto the periodic pattern quickly. Train with PPO on a periodically-augmented state, replaying the last 7 days in time order with γ = 0.995 (≈ 6 h half-life, so intraday decisions dominate). Online, append the current hour h_t to the state vector — no engine change — and incrementally update the value network every 30 min, hot-swapping the policy every 2 h to meet the "updated before the lunch peak" operational requirement. On a sudden promotion spike, a context-switch branch appends a "promo flag" and enters a meta-policy π(a | s, z) for hour-scale adaptation. A time-ordered A/B lifted the lunch-peak 2-hour fulfilment rate by 3.8% and cut overnight empty-running by 2.1%.

Engineering detail — change-point-driven discount. A fixed γ is wrong when the regime shifts. Run online change-point detection (a CUSUM-style statistic g_t with threshold h calibrated to FPR ≤ 1% by a Neyman-Pearson rule, plus double-window confirmation to suppress false alarms). On a detected change-point, do not restart the network; instead decay the discount back toward its long-horizon value as the regime stabilizes:

γ_t = γ_min + (γ_max − γ_min)·exp(−λ·Δt)

where Δt is frames since the change-point and λ is estimated online by Bayesian regression. Feed γ_t straight into GAE so the advantage horizon shortens automatically right after a shift and lengthens as things settle, and clear replay across the change-point so out-of-distribution samples don't inflate variance.

6 · Coupled timescales — options and hierarchical RL

Intuition. "Deliver this package" is one decision to a human but thousands of 10 Hz velocity commands to a controller. Treating delivery as a sequence of atomic steps makes the horizon enormous and the reward hopelessly sparse. The options framework lets the high level pick a macro-action ("deliver") that runs a pre-trained low-level controller until a termination condition fires.

Engineering detail — defining the "deliver" option. The option is a triple ⟨initiation set, low-level policy, termination β⟩. The low-level policy acts on atomic velocities plus a discrete "drop" action, pre-trained with PPO + HER in a sub-region simulator (reward = −path-length − 10·collision − 50·package-loss) and run at 10 Hz closed-loop. The termination function β(s) = 1 iff (a) package horizontal error < 20 cm and no-contact timeout > 3 s, or (b) the robot has retreated > 1.5 m and received a cloud "delivery confirmed" message; a hard cap T_max = 1200 steps (2 min) forces β = 1 to prevent infinite loops. At the semi-MDP level "deliver" is a macro-action taking k steps with discount γᵏ, and the external reward is given only at β = 1: +200 success, −50 timeout, −100 loss. Internal pseudo-rewards train the low level only and never contaminate the external return. In one deployment this compressed the average atomic-step count for a 1.8 km route from 4200 to 120 and cut training time ~38%.

Engineering detail — aligning path and charging layers. The path layer and charging layer run at different time constants. Coordinate them with a potential-based shaping term F = α·remaining-distance + β·battery-deviation so that the low level's per-step reward and the high level's accumulated return (e.g. 5 s later) are linearly additive — which, by the potential-shaping theorem, leaves the optimal policy unchanged while removing the cross-layer non-stationarity. The high level writes goals to a lock-free ring queue; the low level reads them zero-copy. Cross-attention fuses the two feature streams: treat battery as the Query and priority as Key-Value, Attn = softmax(QKᵀ/√d)V with d = 16, residual back into the battery channel then LayerNorm, so the critical "low-battery + high-priority" combination gets explicit weight. Ablations show that flattening priority to a raw scalar drops PPO return ~18%, and bucketizing battery slows convergence ~1.9×.

7 · Partial observability — believe the charging queue

Intuition. An AGV deciding whether to wait at a charger or drive to another cannot see the full queue — only a partial image and the charger's status. Reservations, other vehicles' SOC, and power draw are hidden. That makes wait-time a belief, not a measurement.

Engineering detail. Model the charging station as a POMDP: observation = partial queue image + charger state, hidden state = reservations / power / vehicle SOC, action = predicted wait time, reward = negative absolute error plus an overtime penalty. A convolutional + GRU belief network maps raw image, text, and CAN signals to a 128-dim belief vector, then a head outputs the wait-time distribution parameters μ, σ. Train offline on counterfactually-augmented trajectories with importance sampling + Retrace(λ) to control bias, then online with recurrent DDPG (a KL regularizer keeps the actor close to the old policy), and MAML-finetune the belief net every 30 min to track non-stationary arrivals. After the prediction, a 3σ truncation safety layer recommends an alternate station when predicted wait > 45 min. In one field test this cut mean absolute wait-time error from 11.4 min to 4.7 min.

8 · Throughput — keep the GPU fed (the central trade-off)

Intuition. None of the above learns fast enough if your simulators can't feed the GPU. With 1000+ AGV environments the bottleneck is rarely the GPU's math — it's whether enough trajectories arrive each millisecond to keep the learner's batches full. Too few samplers and the GPU idles; too many and the experience queue overflows memory. The right number balances sampler throughput against GPU appetite, with backpressure as the safety valve.

Engineering detail. A working topology: a sampling layer of CPU servers, each process owning one AGV simulator; an inference layer of GPU nodes running a TensorRT FP16 engine (batch 128, ~3 ms); and one learner node using a decoupled A-PPO variant (minibatch 4096, gradient accumulation 8) to hold GPU-util > 90%. Samplers pack (s, a, logp, r, done) into a 48-byte contiguous struct written to a lock-free POSIX-shm queue; inference reads it via CUDA-IPC and writes actions back. Dynamic speed control watches SM and Tensor-Core utilization: below 85% for 5 s, request +10% environments; when the queue exceeds a threshold, raise a backpressure signal and have samplers sleep 1 ms to keep memory bounded. For async learners, aggregate gradients over a K-window (K = 8 ≈ 120 ms at 8 GPUs), drop gradients staler than Δ_max = 80 steps (keeps drop-rate < 3%), and weight by w_i = exp(−0.05·Δ_i). In a 1024-AGV / 256-dim-state task this sustained ~130k frames/s at 92–96% GPU-util, training 100M steps in ~6.5 h. The widget makes the napkin math concrete: feel how the sampler count sets GPU utilization.

9 · Guard in production — robustness testing and safe degradation

Intuition. A policy that is perfect in simulation can fail on a worn gearbox or a dead sensor. You cannot wait for production to discover this; you inject the faults during training and measure how gracefully the policy degrades.

Engineering detail — domain randomization with a curriculum. Take a real parameter like the motor reduction ratio κ: estimate its 6σ credible interval (e.g. [0.92κ₀, 1.08κ₀]) by maximum likelihood + Fisher information, write it to a JSON config, and sample it during training from a mixed distribution — a truncated normal core plus a uniform long tail. Use a curriculum: the first 2M steps sample only the core so the policy learns common deviations, then gradually widen the tail over the next 3M steps until the value-function variance under extreme κ stays < 5%. Train with PPO-Clip + GAE(λ=0.95), and triple the sample weight of any episode whose stopping-accuracy error exceeds 10 mm to force the policy to meet the spec.

Worst-case return, fault injection, and graceful degradation

Worst-case objective. Define an adversarial fault budget and solve a max-min: maximize the policy's return against an adversary that injects the worst faults within budget; an in-loop adversarial critic approximates the inner inf so every update targets the worst case rather than re-estimating it after the fact. Before release, run 10⁵-scale random + directed fault injection in a digital twin and require the 5th-percentile return to stay above the safety red line. Degradation tiers. Embed a Safety-Critic Q_safe that emits a real-time safety level; below a calibrated threshold it triggers tiers — mild: cap speed at 60% and lateral acceleration < 2 m/s²; medium: hazards on, move to the rightmost lane within 5 s; severe: emergency brake at 0.9 g to a stop. Quantify it. Inject burst packet-loss on LiDAR, thermal drift on the IMU, stuck-at-0 on the camera at an ISO-26262 random-hardware-failure rate, run 10⁶ km equivalent, and report safety-rate ≥ 99.7%, mean-time-to-safe-fallback ≤ 300 ms, and ΔReturn ≥ −500 vs the nominal policy. A representative report: safety-rate 99.82%, MTTSF 247 ms, ΔReturn 95% CI [−480, −310].

10 · Iterate — the through-line

The through-line

Every section is one row of the MDP table turned into a mechanism: sparse social reward → five perceptible per-step terms; safety → a constraint via PPO-Lagrangian, not a weight; multi-agent conflict → mean-field neighbours + Tarjan deadlock intervention; non-stationary arrivals → sine-cosine state augmentation + change-point-driven γ; coupled timescales → the options framework with potential-based cross-layer shaping; partial observation → a convolutional-GRU belief net; throughput → shm sampling with backpressure and staleness-weighted gradient windows; robustness → curriculum randomization, worst-case training, and tiered safe degradation. You never reached for a tool until a row demanded it. Removing one difficulty exposed the next — which is the whole loop.

Further considerations

The yielding social dilemma. When every vehicle trains with RL, "whoever yields loses" can emerge. Add an altruism coefficient α that folds neighbours' progress reward into your own, steering the fleet toward a system-optimal Nash equilibrium rather than locally greedy gridlock.
Heterogeneous agents and stale beliefs. If neighbours' action spaces differ in dimension, a shared encoder maps them to one latent space before averaging, dodging the curse of dimensionality. Under communication delay (e.g. ~20 ms jitter on 5G), model μ₋ᵢ as a delay-conditioned distribution and use an RNN/Transformer to complete the sequence. Mean-field error scales O(1/√n) in neighbour count, so for n < 10 add a pairwise correction term.
When periodicity is not enough. A bi-weekly promotion cycle would blow up a one-hot augmentation 336×; use a hierarchical MDP where a slow option-level variable selects sub-policies while the low level stays low-dimensional. For minute-scale spikes (a flash-sale at the top of the hour) an hour-granular embedding is too coarse — a temporal convolutional network over raw timestamps captures multi-scale periodicity. For non-periodic shocks (a sudden lockdown) periodic augmentation fails outright; pair change-point detection with meta-learning fast-finetune, or train with transition-matrix perturbations (robust RL) for unseen regimes.
Dynamic options and re-planning. If a recipient changes the drop address mid-delivery, a "re-planning gate" lets the high level swap options online instead of retraining from scratch. In crowdsourced last-mile delivery, couriers and AGVs share one option space and need a shared β and joint initiation set to avoid conflicts — a centralized semi-MDP-Q can learn the coordination policy.
Safety filter and forced recharge. A regulation requiring "battery < 10% must force a return-to-charge" is enforced by a safety-filter layer that zeros illegal-action probabilities and renormalizes after the policy head — this leaves the policy gradient untouched yet can drive the violation rate from ~0.7% to 0. Transfer to a cold-chain sorting site (priority redefined from order-deadline to temperature-control class) by retraining only the embedding layer and freezing the CNN + attention, reusing the policy with ~5k new samples.