Controller — the orchestrator
The only file that sees every role. The whole training loop fits in seven steps.
What the controller is
By now we have six roles: rollout, environment, reference, algorithm, trainer, weight-sync. Each has a single responsibility and a tight interface. The controller is the one piece of code that holds references to all of them and calls them in order. In production it's a small Python driver (tens of lines) that dispatches work to Ray actors; in this in-process framework it's a dataclass with a step() method. Same interface either way.
def step(self):
trajs = self.rollout.generate(self.env, K) # ❶ sample K with old_logp, reward
self.reference.score(trajs) # ❷ fill ref_logp
self.algorithm.compute_advantages([trajs]) # ❸ assign advantage per token
if all(t.info.get("degenerate") for t in trajs): # skip if no signal
return degenerate_metrics(trajs)
batch = collate_trajectories(trajs, ...) # ❹ response-local → padded full-seq
m = self.trainer.train_step(batch, self.algorithm.compute_loss) # ❺ forward+backward+step
self.syncer.sync(self.trainer, self.rollout) # ❻ push fresh weights to rollout
return m # ❼ metrics for logging
That's the whole training loop. Every SOTA framework (verl, OpenRLHF, NeMo-RL, TRL, SLIME) has this same loop at its core. Differences between them are in where each step runs (which process, which GPU, which rank) and how data moves between them (NCCL collectives, Ray actors, shared memory) — not in what the loop does.
Interactive · run one full step
Hit Step to advance through the seven phases. Each phase lights up the relevant role in the diagram and shows what's filling in on the data side. The bottom strip shows the running trajectory list and what fields are populated.
What the controller does not do
The controller never:
- Touches model parameters directly. (That's the trainer's and syncer's job.)
- Computes a reward. (Environment.)
- Sees a log-probability or a logit. (Rollout, reference, trainer, algorithm — never the controller.)
- Knows which RL algorithm is running. (Algorithm.)
- Knows what task is being trained on. (Environment.)
This is by design — when something breaks, the breakage is in one role's file, not scattered across the orchestrator. If the controller ever grows past 200 lines of real logic, it's a sign one of the sub-roles is leaking responsibility upward.
Degenerate-group skip
One subtlety in the loop above: if every rollout in a group received the same reward, the GRPO advantage is zero for every token in every trajectory. The optimizer step would just be numerical noise. We saw this in lesson 2's K-rollout widget and lesson 4's advantage widget. The controller filters it explicitly:
if all(t.info.get("degenerate", False) for t in trajs):
return degenerate_metrics(trajs) # skip forward, skip optimizer step
This matters in two ways: it saves one forward + backward pass per degenerate step, and it prevents the optimizer's variance estimator from accumulating noise during periods when the policy is uniformly succeeding (or uniformly failing) on the current prompt. Real frameworks do the same at batch level — keep only the groups with non-degenerate advantage.
Why this one file is so small
The controller's brevity is the framework's biggest win. Look at what it doesn't have to deal with:
- It doesn't know if rollout is colocated, disaggregated, or async (lesson 6's three patterns).
- It doesn't know if the algorithm is GRPO, DAPO, or Dr.GRPO (lesson 4).
- It doesn't know if the env is single-turn or multi-turn (lesson 8).
- It doesn't know if the trainer is single-GPU or FSDP-sharded across 64 GPUs (lesson 5).
- It doesn't know if sync is an in-process copy or a 70B NCCL broadcast.
It only knows seven interfaces — one per role plus the collate utility. Each is one method call. That decoupling is what lets the same orchestrator drive a CPU laptop demo and a multi-node FSDP+vLLM production run, unchanged.
The exit conditions
This framework's controller has no explicit stop-on-convergence. The outer loop is just
for step in range(cfg.steps):
metrics = self.step()
...
Real frameworks add: maximum wall-time, target reward EMA, eval-on-held-set every N steps, checkpoint every M steps. Each is one extra dispatch the controller makes. None of them require changing the sub-roles. The seven-step loop above is the load-bearing wall.