Smart-manufacturing scheduling

A factory floor is a sequential decision problem: every shift, an algorithm decides which job runs on which machine, in what order, and when. The MDP writes itself — but four difficulties bind hard and all at once. The action space is a huge, structured, mostly-illegal set governed by process-precedence constraints that can deadlock shared resources. The state hides a physically-unobservable variable (tool wear). And the reward is a multi-objective trade-off with hard constraints — delivery rate against changeover cost against worker overtime — where naïve weighting silently violates the one limit you cannot cross. Each names a tool.

The method — five steps, every lesson

Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. On a factory floor several bind together, so we run the loop on each.

1 · Formulate — the MDP behind a scheduler

Intuition. A dispatcher looks at the shop (open orders, machine states, AGV positions), picks the next job-to-machine assignment, and learns later whether the order shipped on time and what it cost. That is an MDP: the shop snapshot is the state, the dispatch decision is the action, the RMB-denominated delivery-minus-cost is the reward, and the physical plant plus its other machines is the transition.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] s.t. E[ Σₜ cₜ ] ≤ d

Piece	For a shop-floor scheduler	The awkward part
State S	open orders, machine/AGV status, queue lengths, plus tool-wear and remaining-life of each spindle	tool wear is physically unmeasurable online → partially observable; raw feature vector is huge and sparse
Action A	"feed sequence + machine dispatch + AGV path" — which operation, on which machine, next	combinatorial and structured; most assignments are illegal under process precedence, and some legal-looking ones deadlock
Reward R	on-time delivery − changeover cost − energy − inventory penalty	multi-objective with a hard overtime/safety limit; naïve weights either go short-sighted or violate the limit
Transition P	the plant + other workshops + a stream of rush orders inserted mid-episode	non-stationary; order distribution drifts and spikes, so a fixed policy forgets

As always, the rightmost column is the lesson. Sections 2–5 each take one awkward row and turn it into a named mechanism — and you reach for the mechanism only once the row demands it.

2 · Action — precedence masking, not reward weights

Intuition. A casting must be machined before it is polished; a furnace above 1200 °C must not be switched mid-heat. These are rules of physics and process, not preferences. The tempting shortcut is to leave every action legal and add a big negative reward for violations — but then you are hand-tuning penalty weights against the objective forever, and the agent still occasionally violates while it explores. The clean fix is to make illegal actions impossible to sample: build a precedence mask each step and zero the logits of any operation whose predecessors are not yet complete.

logits = logits.masked_fill(mask == 0, −1e8) → softmax ⇒ P(illegal) ≡ 0

Engineering detail. The mask is recomputed from the live process-DAG every decision step. For discrete dispatch the mask sits before the categorical head; for continuous control (e.g. choosing a chemical reaction's next-operation start-time curve) you cannot mask logits, so you append a differentiable projection layer: interpret the raw action as a proposed start-time, then run a topological-sort operator that projects any illegal time-offset onto the feasible region. The projection is autograd-friendly, so PPO's clipped objective is unchanged and gradients still flow. Eliminating violations before they happen — rather than penalising them after — is what let one line cut its constraint-violation rate from 0.12% to an absolute 0 while shortening the mean cycle by 11.4% versus manual scheduling, and pass a process-safety audit on the first attempt.

When the process graph itself changes

Insertions and machine faults add or remove operations, so the DAG is dynamic. Upgrade the static mask to an online GNN: a GraphSAGE pass over the current process graph re-derives each node's "unlocked / locked" status in milliseconds, refreshing the mask without retraining. In multi-agent shops that share equipment, fuse precedence with an interlock mask (two agents cannot seize the same machine) into one joint mask, and gate its refresh on a deterministic real-time scheduling hook so the update latency stays under 1 ms.

3 · Deadlock — the legal action that strands the whole floor

Intuition. Masking enforces local legality, but a sequence of individually-legal moves can still wedge the system: AGV A holds aisle 1 waiting for aisle 2, AGV B holds aisle 2 waiting for aisle 1, nobody releases, throughput goes to zero. No single action looks wrong — the trap is a global resource cycle. So you defend at four layers, from soft to hard.

Engineering detail — the four-piece kit. (1) Reward guidance: a potential-based shaping term Φ steers the agent away from states with low resource slack, without changing the optimal policy. (2) Policy regularisation: penalise policies that drive utilisation into known cycle-prone regions. (3) State augmentation: feed held-vs-waited resource flags into the state so the value function can see an incipient cycle. (4) System-level rollback & simulation: a deadlock-detector thread watches for repeated states with no resource release; on trip it rolls back to the most recent safe snapshot and writes the offending trajectory into the replay buffer with negative weight, forcing the Q-network to mark that path as bad. In parallel, a Petri-net offline analysis enumerates the high-frequency policy paths and pre-labels unreachable/unsafe states, so online you intercept them with a microsecond table lookup. This four-piece kit — "reward guidance + policy regularisation + state augmentation + system rollback" — drove the deadlock rate on an overhead-rail panel-transport system from 0.7% to 0, while raising mean transport tempo by 4.3%.

Two failure modes the deadlock guard must survive

(1) Multi-agent deadlock. When several agents compete for shared resources, each agent's individually-"optimal" policy can still trigger a system-level lock. A per-agent safety layer is not enough — you need a centralised safety value function or an explicit consensus protocol over resource acquisition order. (2) Continual-learning drift. Once the plant's parameters drift online, a potential function Φ tuned offline can go stale and stop repelling cycles. Use meta-learning to re-fit Φ quickly, or robust-RL that models resource-availability uncertainty as an adversarial disturbance, so the guard degrades gracefully instead of silently failing.

4 · State — estimating the hidden variable (tool wear)

Intuition. The single most decision-relevant quantity — how worn the cutting tool is — has no online sensor. You only ever see proxies (spindle current, vibration, acoustic emission), and the true wear is a hidden state you must infer. That makes the problem a POMDP: maintain a belief over wear, and act on the belief, not a point estimate.

Engineering detail — digital-twin pretrain, on-site finetune. Negative samples (a tool actually breaking) are far too rare to learn from in production. So generate them in simulation: a physics-based material model (Johnson–Cook constitutive law) synthesises ~10,000 wear curves across operating conditions, and the belief-encoder + actor pretrain on that. On the real machine, a short 5-minute online adaptation finetunes with the learning rate decayed 1e-4 → 1e-5 so the network adapts without catastrophic forgetting of the twin's priors. Deploy as a lightweight ONNX model on an edge box at <8 ms inference, and bolt on a control-barrier-function safety layer: the moment the wear-belief variance exceeds 0.05 mm, trigger a feed-hold to protect tool and operator. Over a six-month batch validation this cut tool-wear estimation error from a traditional empirical model's 0.12 mm to 0.028 mm and dropped unexpected tool-breakage events by 73% with no added downtime.

Compressing the rest of the state. The non-hidden part of the state is still huge and sparse. Two moves keep it learnable: drop near-constant / one-to-one features (clear physical meaning keeps the scheduling team able to interpret it), then jointly train an end-to-end representation — a Transformer encoder maps the remaining high-dimensional time-series into a 64-dim latent fed to PPO, with an information-bottleneck regulariser (β-VAE form) holding mutual information I(s;z) below ~25 bits to prevent over-fitting. Before promotion, run a policy-equivalence check: on thousands of replayed fault trajectories the action-difference rate between full and compressed state must stay <2% and recovery-time difference <0.1 s.

5 · Reward — the multi-objective trade-off with a hard limit

Intuition. Delivery, cost, and people pull against each other. Push delivery and you switch product mix constantly (each changeover costs real money) and you burn worker overtime (which has a legal cap). The classic mistake is to fold everything into one weighted sum: tune the weights once, ship, and discover the agent maximised the soft objective by quietly blowing through the hard overtime ceiling. The right frame is a constrained MDP — maximise return subject to a hard limit — solved with a Lagrangian multiplier the algorithm tunes itself.

Engineering detail — changeover cost as state, then as penalty. First make changeover cost visible: align with operations on the average cost of one switch (say 12,000 RMB) and write it straight into the reward, R_t = output_t − 12000·𝟙[a_t ≠ a_t−1]. Crucially, augment the state with the previous action a_t−1 — that adds exactly one dimension, so historical data is still reusable and you need no fresh data collection. With continuous actions, use PPO-Clip with a cost penalty in the policy loss, λ initialised at the switch cost and auto-tuned against the daily financial report until switches fall to ≤ 2 per shift. A furnace above 1200 °C must never switch: enforce that with an action mask (logits → −10⁸), not a penalty. On an 8-million-tonne/yr hot-rolling line this cut changeover count by 42% and per-tonne cost by 3.8 RMB.

The hard limit — PPO-Lagrangian. For the overtime cap, define cost c per forced-overtime unit and the constraint E[c] ≤ 36 h/month/worker. Solve the constrained objective with a self-tuning multiplier updated every few steps:

λ_k+1 = λ_k + α·(ĉ − 36) , maximise E[r_b] − λ·E[c]

where ĉ is sliding-window mean overtime. λ starts near 0.1 and rises automatically until the constraint is met — no manual weight-tuning. Shape the cost into a soft wall with an exponential, c′ = 1.2^c − 1, so gradients steepen near the cap, and terminate any episode that exceeds it with a large negative penalty so the agent stops probing past the limit. Live, this held GMV flat (+0.8%) while cutting monthly per-head overtime from 41.3 h to 33.7 h, dropping violators by 62%; the multiplier settled around 0.42 — the constraint is "tightly bound," the signature of a healthy constrained optimum.

6 · The delivery-vs-overtime trade-off — feel the multiplier

Intuition. The whole constrained-RL story collapses to one feeling: as you push the delivery target harder, overtime climbs, and the Lagrangian multiplier rises to hold the cap — but past a point the cap simply cannot be met without slashing throughput. The widget is the napkin math. Push the delivery target; watch overtime, the auto-tuned λ, and whether you stay inside the 36-hour ceiling.

7 · Guard — drift, forgetting, and rollback in production

Intuition. A factory is non-stationary: rush orders spike, product mix shifts, machines age. A policy that was optimal last month silently rots. Production safety is therefore three reflexes — detect drift, resist forgetting when you adapt, and roll back instantly when a metric crosses a red line.

Engineering detail. Detect: monitor the KL divergence between the online state distribution and the offline training distribution; when KL exceeds ~0.1, auto-trigger incremental finetuning so the compressed representation stays valid. For rush orders, split detection from adjustment: a lightweight temporal-CNN flags urgent orders (an order enters the re-queue only if urgency p > 0.8), then a constrained-MDP solver (SAC with a Lagrangian λ that rises whenever the cost estimate goes positive) re-prioritises while holding the on-time rate at ~98%. Resist forgetting when you online-update: PopArt-normalise the value head to lock old-task return scale, hold the target network at Polyak 0.995 (a soft constraint ≈ a multi-million-step sliding window), and add online EWC — a recent-window Fisher diagonal penalty (λ≈10⁴) that cut old-task degradation from 8.7% to 1.2% for <6% extra training time. Roll back: ship every update by shadow mode first (RL policy runs in parallel with the incumbent rule engine; only promote after it beats the rule by ≥5% for three straight days) and gate gray-release on metrics — divert 5% of flow, and if the violation rate rises >0.2 pp, revert immediately to the last compliant version snapshot.

The through-line

Every section is one row of the MDP table turned into a mechanism: structured/illegal action → precedence + interlock masking and a differentiable topological projection; resource cycles → the four-piece deadlock kit with Petri-net pre-labelling; hidden tool wear → POMDP belief, digital-twin pretrain, CBF feed-hold; multi-objective reward with a hard cap → constrained MDP solved by self-tuning PPO-Lagrangian; non-stationary order stream → KL-drift detection, PopArt + EWC anti-forgetting, shadow-mode rollback. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

Further considerations

Distributed multi-workshop MARL. Each workshop is an agent; the joint action is "feed rhythm + machine dispatch + AGV path." Use hierarchical multi-agent PPO — a meta-agent decides cross-workshop load balance every ~30 min on plant-level KPIs, while shop-floor agents decide machine/AGV moves every ~100 ms on local state, communicating only a target load-rate. A shared global critic solves credit assignment, and parameter-sharing with role embeddings shrank one model from 200 MB to 38 MB so it runs on an edge box. When workshops belong to different legal entities, reward-split with Shapley values turns the objective into negotiable "internal profit" and the policy converges near a Nash equilibrium.
Communication-limited consistency. Under bandwidth or packet-loss limits, keep agents' policies aligned with a three-layer defence: a delay-tolerant Bellman operator that writes the communication lag τ into the TD-error (convergence slows only by a τ²-order constant); a federated-distillation layer that skips uploads when KL to the last global policy < 0.01, else Top-10% gradient sparsification + INT8 quantisation (≈20× less traffic); and an event-triggered gossip-consensus critic whose trigger threshold ε tracks the topology's algebraic connectivity, bounding policy divergence δ even at 5% packet loss.
Discount factor as a delay-penalty knob. γ encodes how much late delivery is punished: estimate it from the task's effective horizon (≈200 steps ⇒ 1/(1−γ) ≈ 200 ⇒ γ ≈ 0.995), then grid-search {0.98, 0.99, 0.995, 0.999} watching value-variance and policy entropy. If Q-values diverge, lower γ by 0.005 before touching the learning rate. For finance-aligned audits lock γ to the risk-free discount factor so the RL objective matches NPV; γ = 1 is only valid for finite-horizon, zero-terminal-reward tasks.
Multi-process shared policy. When one line machines both titanium blades and aluminium housings with very different wear dynamics, train a context-based meta-RL policy π(a | s, c) with material coefficients as context c, for zero-shot adaptation across part types — and couple wear estimation to remaining-useful-life via hierarchical RL (an upper policy decides "replace tool?", a lower policy decides "how to measure?") to minimise total maintenance cost.