all lessons / reinforcement learning / 61 · Resource scheduling lesson 61 / 87

Resource scheduling — cloud computing to logistics

Which job goes on which machine next? Which stop does the truck visit next? These are sequential decisions over a combinatorial action space, and they are everywhere — cluster schedulers, bin-packers, delivery routers. This lesson is about why the value-based argmax we leaned on in the foundations breaks here, and why policy gradient with a structured policy is the move that makes RL practical for combinatorial optimization.

What this lesson reuses
This is an applications lesson — it invents no new algorithm. It is the combinatorial-optimization payoff of two ideas we already built:

Three problems, one shape

Cloud scheduling, bin-packing, and vehicle routing look like different fields. Structurally they are the same MDP: a stream of items, a set of resources, and a sequence of "assign this to that" decisions whose order and grouping determine the total cost.

ProblemState sAction aReward (to maximize)
Job scheduling on machinespending jobs + current load of each machineplace next job on a machine makespan (time the last machine finishes)
Bin-packingitems left + remaining space in each binput next item in a bin number of bins opened
Vehicle routing / deliverystops left + truck's current locationdrive to the next stop total distance / latency

In every row the reward is a negative cost, and — crucially — it is only fully known at the end: you don't know the makespan until the last job lands, you don't know the route length until the truck returns. This is the delayed, terminal-reward regime that the return Gt = Σ γk rt+k from lesson 01 was built for. The classic operations-research framing is to solve these as one-shot integer programs; the RL framing is to build the solution one decision at a time and let a learned policy pick each decision.

arriving jobs:   [#3]  [#5]  [#2]  [#8]  ...
                   |
                   v   action a_t = "which machine?"
        ┌────────┬────────┬────────┐
machine │  M0    │  M1    │  M2    │     state s_t = current loads + next job size
 loads  │ ████   │ ██     │ ███████│     reward at end = − max(load)   (the makespan)
        └────────┴────────┴────────┘

Why value-based argmax breaks

Recall the value-based recipe from lesson 05: learn Q(s,a), then act by a* = argmaxa Q(s,a). That single argmax is the hidden assumption — it presumes you can enumerate every action and score it. In a gridworld with four moves, fine. In scheduling, the action space detonates.

Combinatorial explosion — the arithmetic
Assigning n jobs to m machines is mn joint assignments. For a routing tour over n stops there are n! orders. With n=30 stops that is 30! ≈ 2.6 × 1032 — more tours than atoms in your body. A Q-network that outputs one value per action would need an output layer with 1032 heads. You cannot store it, you cannot argmax over it, you cannot even visit a millionth of it. Value-based RL, as stated, is dead on arrival.

There are two escapes, and they map onto the fork from orientation:

Pointer networks: a policy that outputs a sequence of choices

Here is the structural twist that makes PG work for combinatorial problems. The action set isn't fixed — it's the set of remaining items, which shrinks as you decide. A standard softmax head with a fixed number of outputs can't handle "choose one of the inputs, whose count changes." The fix (Vinyals et al., 2015) is a pointer network: instead of scoring a fixed vocabulary, the policy uses attention to point at one of the current candidate items.

πθ(at = itemi | st) = softmaxi( scoreθ(queryt, itemi) )

The query encodes "what I've committed to so far" (the partial schedule / partial route); each itemi is a candidate next choice; the softmax over the current candidates gives a valid distribution no matter how many remain. Roll this autoregressively and the model emits a full sequence of choices — a complete schedule or tour — one pointer at a time. The whole construction factorizes exactly like the trajectory log-prob from lesson 10:

log πθ(solution | problem) = Σt log πθ(at | a<t, problem)

and so the policy-gradient theorem applies verbatim. Train it by REINFORCE with a baseline (Bello et al., 2016, "Neural Combinatorial Optimization"): sample a tour, measure its negative cost as the return, push up the log-probability of the choices that produced cheaper-than-baseline solutions. No labels, no expert solutions — just a cost function you can evaluate. That is the whole appeal: the reward is computable (you can always measure a makespan), even when the optimal solution is unknown and the action space is astronomical.

Where the reward comes from — and why it's clean here
Unlike RLHF (lesson 16), scheduling needs no learned reward model and no human labels. The reward is a verifiable number: simulate the schedule, read off makespan; trace the route, read off distance. This puts scheduling in the same comfortable regime as the verifiable-reward tasks the sibling systems course trains on — the signal is exact, only the policy is hard.

The appeal: learned heuristics beat hand-crafted ones

For decades these problems were solved with hand-crafted heuristics: "shortest-processing-time first," "best-fit decreasing," "nearest-neighbor tour." Each is a human guess at a good per-step rule. A PG-trained policy is also a per-step rule — but it's learned from the actual workload distribution. If your cluster's jobs are bimodal (a few huge jobs among many tiny ones), the policy can learn to reserve a machine for the giants instead of greedily load-balancing them into a straggler. DeepMind's data-center work and the Decima Spark scheduler (Mao et al., 2019) showed exactly this: a learned policy adapting to the workload beat the tuned heuristic. The policy is not solving one instance — it's amortizing across the whole distribution of instances it will face.

Interactive · learned packing vs greedy load-balancing

Jobs of varying size arrive in a stream. You must place each on one of three machines as it arrives; you can't see the future. The KPI is makespan — when the busiest machine finishes — so lower is better, and you want all three machines to finish at nearly the same time (high, even utilization). Two policies place the same job stream:

Job scheduler — greedy load-balance vs learned packing
Same arriving job stream fed to both policies. Watch the machine bars fill. The dashed line on each panel is that policy's makespan (the tallest bar). The "big-job fraction" knob makes the workload spikier — crank it up and greedy's straggler problem gets worse, while the learned policy's reserve pays off.
Jobs placed
0 / 0
Greedy makespan
Learned makespan
Learned advantage
Show the two placement policies (≈18 lines)
// Greedy load-balance: always the currently least-loaded machine.
function greedyPick(loads) {
  let best = 0;
  for (let m = 1; m < loads.length; m++)
    if (loads[m] < loads[best]) best = m;
  return best;                       // myopically fair -> creates stragglers
}

// Learned packing: the policy sees the PENDING jobs (part of the state), so it
// knows whether a big job is still queued. While one is, it protects the
// emptiest machine as a landing pad and packs small jobs onto the 2nd-emptiest;
// once the last big job has passed it reverts to greedy. A PG-trained pointer
// net converges to this kind of rule; we hard-code it for reproducibility.
function learnedPick(loads, jobSize, bigStillComing) {
  if (jobSize >= BIG_THRESH) return greedyPick(loads);  // big -> least-loaded
  if (bigStillComing) {                                  // keep a pad in reserve
    const order = [0,1,2].sort((a,c) => loads[a] - loads[c]);
    return order[1];                                     // pack onto 2nd-emptiest
  }
  return greedyPick(loads);                              // no more bigs -> greedy
}

Notice what the widget is really showing: the greedy rule is a fixed per-step argmax of a myopic value ("minimize load right now"). The learned rule conditions on the same state but encodes knowledge of the distribution — that big jobs are coming — which no greedy single-step value can see. Crank the big-job fraction and you are widening the gap between "optimize this step" and "optimize the trajectory." That gap is exactly what policy gradient, optimizing the terminal return rather than a per-step proxy, is built to close.

Mapping back to the spine

Hold the value / policy / model map from orientation in your head and place each piece:

Scheduling pieceCore ideaPlace on the map
"Assign jobs / route trucks" as a stream of decisionsSequential decision MDP (L01)ordinary MDP — state, action, terminal reward
Action = which machine / which stop nextHuge discrete action spacewhere value-based argmax breaks (mn, n!)
Build the solution one choice at a timePointer-network policy πθpolicy, learned directly — never enumerates actions
Train on sampled solutions by costREINFORCE + baseline (L07)policy gradient over a combinatorial action space
Reward = makespan / cost / latencyVerifiable terminal rewardthe signal — exact, no reward model needed
Beat the hand-crafted heuristicAmortize over the workload distributionthe payoff: a learned per-step rule that sees the whole distribution

Read top to bottom, scheduling answers one question: how do you do RL when the action space is too big to argmax? You refuse to enumerate it. The MDP framing of lesson 01 says the problem is a sequence of decisions; the policy gradient of lesson 10 says you can train a policy on that sequence using only the log-probability of the choices you actually sampled. The pointer network is the architecture that makes "choose one of a changing set" expressible, and the negative cost is a reward you can always compute. Combinatorial optimization, which looked like a different universe from gridworlds and bandits, is just policy gradient pointed at a very large discrete action space.

Takeaway
Job scheduling, bin-packing, and vehicle routing are sequential-decision MDPs (L01) with action spaces so large — mn assignments, n! tours — that the value-based argmax of L02 is impossible to even represent. The escape is to parameterize the policy directly (L07): a pointer-network πθ emits the solution one choice at a time, trained by REINFORCE on the negative cost (makespan / distance) — a reward you can always measure. Policy gradient wins because it never enumerates the action space; it only needs log πθ at the action it sampled. The widget shows the payoff: greedy load-balancing is myopically fair and creates stragglers when a big job arrives late, while a learned policy that reserves headroom for big jobs packs tighter and finishes sooner — a learned heuristic beating the hand-crafted one by seeing the whole workload distribution. Scheduling = policy gradient over a combinatorial action space.