Resource scheduling — cloud computing to logistics
Which job goes on which machine next? Which stop does the truck visit next? These are sequential decisions over a combinatorial action space, and they are everywhere — cluster schedulers, bin-packers, delivery routers. This lesson is about why the value-based argmax we leaned on in the foundations breaks here, and why policy gradient with a structured policy is the move that makes RL practical for combinatorial optimization.
- The MDP framing — lesson 01: state, action, reward, next state. We'll show that "schedule these jobs" is a perfectly ordinary sequential decision process — it just has a colossal discrete action space.
- Policy gradient — lesson 10: ∇θ J = 𝔼[ Σt ∇θ log πθ(at|st) · At ]. Because PG never needs to enumerate or argmax over actions — it only needs the log-probability of the action it actually took — it survives action spaces that value-based methods choke on.
Three problems, one shape
Cloud scheduling, bin-packing, and vehicle routing look like different fields. Structurally they are the same MDP: a stream of items, a set of resources, and a sequence of "assign this to that" decisions whose order and grouping determine the total cost.
| Problem | State s | Action a | Reward (to maximize) |
|---|---|---|---|
| Job scheduling on machines | pending jobs + current load of each machine | place next job on a machine | − makespan (time the last machine finishes) |
| Bin-packing | items left + remaining space in each bin | put next item in a bin | − number of bins opened |
| Vehicle routing / delivery | stops left + truck's current location | drive to the next stop | − total distance / latency |
In every row the reward is a negative cost, and — crucially — it is only fully known at the end: you don't know the makespan until the last job lands, you don't know the route length until the truck returns. This is the delayed, terminal-reward regime that the return Gt = Σ γk rt+k from lesson 01 was built for. The classic operations-research framing is to solve these as one-shot integer programs; the RL framing is to build the solution one decision at a time and let a learned policy pick each decision.
arriving jobs: [#3] [#5] [#2] [#8] ...
|
v action a_t = "which machine?"
┌────────┬────────┬────────┐
machine │ M0 │ M1 │ M2 │ state s_t = current loads + next job size
loads │ ████ │ ██ │ ███████│ reward at end = − max(load) (the makespan)
└────────┴────────┴────────┘
Why value-based argmax breaks
Recall the value-based recipe from lesson 05: learn Q(s,a), then act by a* = argmaxa Q(s,a). That single argmax is the hidden assumption — it presumes you can enumerate every action and score it. In a gridworld with four moves, fine. In scheduling, the action space detonates.
There are two escapes, and they map onto the fork from orientation:
- Factor the decision. Don't pick the whole assignment at once — pick one job-to-machine step at a time. Now each step's action space is just "which of the m machines," small enough to argmax. But the state still has to summarize a variable-length set of remaining jobs, and greedy per-step argmax of a myopic value is exactly the hand-crafted heuristic we're trying to beat (see below).
- Parameterize the policy directly. This is the lesson 10 move. A policy πθ(a|s) outputs a distribution over the next choice. Policy gradient never enumerates the full action space — the gradient estimator only ever evaluates log πθ at the action you sampled. The combinatorial space is traversed by sampling a sequence of cheap local choices, never by scoring all of it.
Pointer networks: a policy that outputs a sequence of choices
Here is the structural twist that makes PG work for combinatorial problems. The action set isn't fixed — it's the set of remaining items, which shrinks as you decide. A standard softmax head with a fixed number of outputs can't handle "choose one of the inputs, whose count changes." The fix (Vinyals et al., 2015) is a pointer network: instead of scoring a fixed vocabulary, the policy uses attention to point at one of the current candidate items.
The query encodes "what I've committed to so far" (the partial schedule / partial route); each itemi is a candidate next choice; the softmax over the current candidates gives a valid distribution no matter how many remain. Roll this autoregressively and the model emits a full sequence of choices — a complete schedule or tour — one pointer at a time. The whole construction factorizes exactly like the trajectory log-prob from lesson 10:
and so the policy-gradient theorem applies verbatim. Train it by REINFORCE with a baseline (Bello et al., 2016, "Neural Combinatorial Optimization"): sample a tour, measure its negative cost as the return, push up the log-probability of the choices that produced cheaper-than-baseline solutions. No labels, no expert solutions — just a cost function you can evaluate. That is the whole appeal: the reward is computable (you can always measure a makespan), even when the optimal solution is unknown and the action space is astronomical.
The appeal: learned heuristics beat hand-crafted ones
For decades these problems were solved with hand-crafted heuristics: "shortest-processing-time first," "best-fit decreasing," "nearest-neighbor tour." Each is a human guess at a good per-step rule. A PG-trained policy is also a per-step rule — but it's learned from the actual workload distribution. If your cluster's jobs are bimodal (a few huge jobs among many tiny ones), the policy can learn to reserve a machine for the giants instead of greedily load-balancing them into a straggler. DeepMind's data-center work and the Decima Spark scheduler (Mao et al., 2019) showed exactly this: a learned policy adapting to the workload beat the tuned heuristic. The policy is not solving one instance — it's amortizing across the whole distribution of instances it will face.
Interactive · learned packing vs greedy load-balancing
Jobs of varying size arrive in a stream. You must place each on one of three machines as it arrives; you can't see the future. The KPI is makespan — when the busiest machine finishes — so lower is better, and you want all three machines to finish at nearly the same time (high, even utilization). Two policies place the same job stream:
- Greedy load-balance — the natural hand-crafted heuristic: always drop the next job on the currently least-loaded machine. The bug-is-the-lesson: being myopically fair creates stragglers — when a big job arrives late, greedy has already filled every machine evenly and is forced to stack the giant on top of a full machine, spiking the makespan.
- Learned packing — a policy that can see the pending jobs (they're part of the state — see the table above) and has learned the workload tends to spike late. While a big job is still queued, it keeps one machine's headroom in reserve as a landing pad and packs the small jobs onto the others; once the last big job has been placed, it reverts to greedy. The late giants land on the reserve instead of being stacked on top of an already-balanced cluster. Same jobs, lower makespan.
Notice what the widget is really showing: the greedy rule is a fixed per-step argmax of a myopic value ("minimize load right now"). The learned rule conditions on the same state but encodes knowledge of the distribution — that big jobs are coming — which no greedy single-step value can see. Crank the big-job fraction and you are widening the gap between "optimize this step" and "optimize the trajectory." That gap is exactly what policy gradient, optimizing the terminal return rather than a per-step proxy, is built to close.
Mapping back to the spine
Hold the value / policy / model map from orientation in your head and place each piece:
| Scheduling piece | Core idea | Place on the map |
|---|---|---|
| "Assign jobs / route trucks" as a stream of decisions | Sequential decision MDP (L01) | ordinary MDP — state, action, terminal reward |
| Action = which machine / which stop next | Huge discrete action space | where value-based argmax breaks (mn, n!) |
| Build the solution one choice at a time | Pointer-network policy πθ | policy, learned directly — never enumerates actions |
| Train on sampled solutions by cost | REINFORCE + baseline (L07) | policy gradient over a combinatorial action space |
| Reward = −makespan / −cost / −latency | Verifiable terminal reward | the signal — exact, no reward model needed |
| Beat the hand-crafted heuristic | Amortize over the workload distribution | the payoff: a learned per-step rule that sees the whole distribution |
Read top to bottom, scheduling answers one question: how do you do RL when the action space is too big to argmax? You refuse to enumerate it. The MDP framing of lesson 01 says the problem is a sequence of decisions; the policy gradient of lesson 10 says you can train a policy on that sequence using only the log-probability of the choices you actually sampled. The pointer network is the architecture that makes "choose one of a changing set" expressible, and the negative cost is a reward you can always compute. Combinatorial optimization, which looked like a different universe from gridworlds and bandits, is just policy gradient pointed at a very large discrete action space.