Resource scheduling — cloud computing to logistics

Which job goes on which machine next? Which stop does the truck visit next? These are sequential decisions over a combinatorial action space, and they are everywhere — cluster schedulers, bin-packers, delivery routers. This lesson is about why the value-based argmax we leaned on in the foundations breaks here, and why policy gradient with a structured policy is the move that makes RL practical for combinatorial optimization.

What this lesson reuses

This is an applications lesson — it invents no new algorithm. It is the combinatorial-optimization payoff of two ideas we already built:

The MDP framing — lesson 01: state, action, reward, next state. We'll show that "schedule these jobs" is a perfectly ordinary sequential decision process — it just has a colossal discrete action space.
Policy gradient — lesson 10: ∇_θ J = 𝔼[ Σ_t ∇_θ log π_θ(a_t|s_t) · A_t ]. Because PG never needs to enumerate or argmax over actions — it only needs the log-probability of the action it actually took — it survives action spaces that value-based methods choke on.

Three problems, one shape

Cloud scheduling, bin-packing, and vehicle routing look like different fields. Structurally they are the same MDP: a stream of items, a set of resources, and a sequence of "assign this to that" decisions whose order and grouping determine the total cost.

Problem	State s	Action a	Reward (to maximize)
Job scheduling on machines	pending jobs + current load of each machine	place next job on a machine	− makespan (time the last machine finishes)
Bin-packing	items left + remaining space in each bin	put next item in a bin	− number of bins opened
Vehicle routing / delivery	stops left + truck's current location	drive to the next stop	− total distance / latency

In every row the reward is a negative cost, and — crucially — it is only fully known at the end: you don't know the makespan until the last job lands, you don't know the route length until the truck returns. This is the delayed, terminal-reward regime that the return G_t = Σ γ^k r_t+k from lesson 01 was built for. The classic operations-research framing is to solve these as one-shot integer programs; the RL framing is to build the solution one decision at a time and let a learned policy pick each decision.

arriving jobs:   [#3]  [#5]  [#2]  [#8]  ...
                   |
                   v   action a_t = "which machine?"
        ┌────────┬────────┬────────┐
machine │  M0    │  M1    │  M2    │     state s_t = current loads + next job size
 loads  │ ████   │ ██     │ ███████│     reward at end = − max(load)   (the makespan)
        └────────┴────────┴────────┘

Why value-based argmax breaks

Recall the value-based recipe from lesson 05: learn Q(s,a), then act by a* = argmax_a Q(s,a). That single argmax is the hidden assumption — it presumes you can enumerate every action and score it. In a gridworld with four moves, fine. In scheduling, the action space detonates.

Combinatorial explosion — the arithmetic

Assigning n jobs to m machines is mⁿ joint assignments. For a routing tour over n stops there are n! orders. With n=30 stops that is 30! ≈ 2.6 × 10³² — more tours than atoms in your body. A Q-network that outputs one value per action would need an output layer with 10³² heads. You cannot store it, you cannot argmax over it, you cannot even visit a millionth of it. Value-based RL, as stated, is dead on arrival.

There are two escapes, and they map onto the fork from orientation:

Factor the decision. Don't pick the whole assignment at once — pick one job-to-machine step at a time. Now each step's action space is just "which of the m machines," small enough to argmax. But the state still has to summarize a variable-length set of remaining jobs, and greedy per-step argmax of a myopic value is exactly the hand-crafted heuristic we're trying to beat (see below).
Parameterize the policy directly. This is the lesson 10 move. A policy π_θ(a|s) outputs a distribution over the next choice. Policy gradient never enumerates the full action space — the gradient estimator only ever evaluates log π_θ at the action you sampled. The combinatorial space is traversed by sampling a sequence of cheap local choices, never by scoring all of it.

Pointer networks: a policy that outputs a sequence of choices

Here is the structural twist that makes PG work for combinatorial problems. The action set isn't fixed — it's the set of remaining items, which shrinks as you decide. A standard softmax head with a fixed number of outputs can't handle "choose one of the inputs, whose count changes." The fix (Vinyals et al., 2015) is a pointer network: instead of scoring a fixed vocabulary, the policy uses attention to point at one of the current candidate items.

π_θ(a_t = item_i | s_t) = softmax_i( score_θ(query_t, item_i) )

The query encodes "what I've committed to so far" (the partial schedule / partial route); each item_i is a candidate next choice; the softmax over the current candidates gives a valid distribution no matter how many remain. Roll this autoregressively and the model emits a full sequence of choices — a complete schedule or tour — one pointer at a time. The whole construction factorizes exactly like the trajectory log-prob from lesson 10:

log π_θ(solution | problem) = Σ_t log π_θ(a_t | a_<t, problem)

and so the policy-gradient theorem applies verbatim. Train it by REINFORCE with a baseline (Bello et al., 2016, "Neural Combinatorial Optimization"): sample a tour, measure its negative cost as the return, push up the log-probability of the choices that produced cheaper-than-baseline solutions. No labels, no expert solutions — just a cost function you can evaluate. That is the whole appeal: the reward is computable (you can always measure a makespan), even when the optimal solution is unknown and the action space is astronomical.

Where the reward comes from — and why it's clean here

Unlike RLHF (lesson 16), scheduling needs no learned reward model and no human labels. The reward is a verifiable number: simulate the schedule, read off −makespan; trace the route, read off −distance. This puts scheduling in the same comfortable regime as the verifiable-reward tasks the sibling systems course trains on — the signal is exact, only the policy is hard.

The appeal: learned heuristics beat hand-crafted ones

For decades these problems were solved with hand-crafted heuristics: "shortest-processing-time first," "best-fit decreasing," "nearest-neighbor tour." Each is a human guess at a good per-step rule. A PG-trained policy is also a per-step rule — but it's learned from the actual workload distribution. If your cluster's jobs are bimodal (a few huge jobs among many tiny ones), the policy can learn to reserve a machine for the giants instead of greedily load-balancing them into a straggler. DeepMind's data-center work and the Decima Spark scheduler (Mao et al., 2019) showed exactly this: a learned policy adapting to the workload beat the tuned heuristic. The policy is not solving one instance — it's amortizing across the whole distribution of instances it will face.

Interactive · learned packing vs greedy load-balancing

Jobs of varying size arrive in a stream. You must place each on one of three machines as it arrives; you can't see the future. The KPI is makespan — when the busiest machine finishes — so lower is better, and you want all three machines to finish at nearly the same time (high, even utilization). Two policies place the same job stream:

Greedy load-balance — the natural hand-crafted heuristic: always drop the next job on the currently least-loaded machine. The bug-is-the-lesson: being myopically fair creates stragglers — when a big job arrives late, greedy has already filled every machine evenly and is forced to stack the giant on top of a full machine, spiking the makespan.
Learned packing — a policy that can see the pending jobs (they're part of the state — see the table above) and has learned the workload tends to spike late. While a big job is still queued, it keeps one machine's headroom in reserve as a landing pad and packs the small jobs onto the others; once the last big job has been placed, it reverts to greedy. The late giants land on the reserve instead of being stacked on top of an already-balanced cluster. Same jobs, lower makespan.

Job scheduler — greedy load-balance vs learned packing

Same arriving job stream fed to both policies. Watch the machine bars fill. The dashed line on each panel is that policy's makespan (the tallest bar). The "big-job fraction" knob makes the workload spikier — crank it up and greedy's straggler problem gets worse, while the learned policy's reserve pays off.

big-job fraction: 0.25

Jobs placed

0 / 0

Greedy makespan

—

Learned makespan

—

Learned advantage

—

Show the two placement policies (≈18 lines)

// Greedy load-balance: always the currently least-loaded machine.
function greedyPick(loads) {
  let best = 0;
  for (let m = 1; m < loads.length; m++)
    if (loads[m] < loads[best]) best = m;
  return best;                       // myopically fair -> creates stragglers
}

// Learned packing: the policy sees the PENDING jobs (part of the state), so it
// knows whether a big job is still queued. While one is, it protects the
// emptiest machine as a landing pad and packs small jobs onto the 2nd-emptiest;
// once the last big job has passed it reverts to greedy. A PG-trained pointer
// net converges to this kind of rule; we hard-code it for reproducibility.
function learnedPick(loads, jobSize, bigStillComing) {
  if (jobSize >= BIG_THRESH) return greedyPick(loads);  // big -> least-loaded
  if (bigStillComing) {                                  // keep a pad in reserve
    const order = [0,1,2].sort((a,c) => loads[a] - loads[c]);
    return order[1];                                     // pack onto 2nd-emptiest
  }
  return greedyPick(loads);                              // no more bigs -> greedy
}

Notice what the widget is really showing: the greedy rule is a fixed per-step argmax of a myopic value ("minimize load right now"). The learned rule conditions on the same state but encodes knowledge of the distribution — that big jobs are coming — which no greedy single-step value can see. Crank the big-job fraction and you are widening the gap between "optimize this step" and "optimize the trajectory." That gap is exactly what policy gradient, optimizing the terminal return rather than a per-step proxy, is built to close.

Mapping back to the spine

Hold the value / policy / model map from orientation in your head and place each piece:

Scheduling piece	Core idea	Place on the map
"Assign jobs / route trucks" as a stream of decisions	Sequential decision MDP (L01)	ordinary MDP — state, action, terminal reward
Action = which machine / which stop next	Huge discrete action space	where value-based argmax breaks (mⁿ, n!)
Build the solution one choice at a time	Pointer-network policy π_θ	policy, learned directly — never enumerates actions
Train on sampled solutions by cost	REINFORCE + baseline (L07)	policy gradient over a combinatorial action space
Reward = −makespan / −cost / −latency	Verifiable terminal reward	the signal — exact, no reward model needed
Beat the hand-crafted heuristic	Amortize over the workload distribution	the payoff: a learned per-step rule that sees the whole distribution

Read top to bottom, scheduling answers one question: how do you do RL when the action space is too big to argmax? You refuse to enumerate it. The MDP framing of lesson 01 says the problem is a sequence of decisions; the policy gradient of lesson 10 says you can train a policy on that sequence using only the log-probability of the choices you actually sampled. The pointer network is the architecture that makes "choose one of a changing set" expressible, and the negative cost is a reward you can always compute. Combinatorial optimization, which looked like a different universe from gridworlds and bandits, is just policy gradient pointed at a very large discrete action space.

Takeaway

Job scheduling, bin-packing, and vehicle routing are sequential-decision MDPs (L01) with action spaces so large — mⁿ assignments, n! tours — that the value-based argmax of L02 is impossible to even represent. The escape is to parameterize the policy directly (L07): a pointer-network π_θ emits the solution one choice at a time, trained by REINFORCE on the negative cost (−makespan / −distance) — a reward you can always measure. Policy gradient wins because it never enumerates the action space; it only needs log π_θ at the action it sampled. The widget shows the payoff: greedy load-balancing is myopically fair and creates stragglers when a big job arrives late, while a learned policy that reserves headroom for big jobs packs tighter and finishes sooner — a learned heuristic beating the hand-crafted one by seeing the whole workload distribution. Scheduling = policy gradient over a combinatorial action space.