Controller — the orchestrator

The only file that sees every role. The whole training loop fits in seven steps.

What the controller is

By now we have six roles: rollout, environment, reference, algorithm, trainer, weight-sync. Each has a single responsibility and a tight interface. The controller is the one piece of code that holds references to all of them and calls them in order. In production it's a small Python driver (tens of lines) that dispatches work to Ray actors; in this in-process framework it's a dataclass with a step() method. Same interface either way.

def step(self):
    trajs  = self.rollout.generate(self.env, K)        # ❶ sample K with old_logp, reward
    self.reference.score(trajs)                        # ❷ fill ref_logp
    self.algorithm.compute_advantages([trajs])         # ❸ assign advantage per token
    if all(t.info.get("degenerate") for t in trajs):   #    skip if no signal
        return degenerate_metrics(trajs)
    batch  = collate_trajectories(trajs, ...)          # ❹ response-local → padded full-seq
    m      = self.trainer.train_step(batch, self.algorithm.compute_loss)  # ❺ forward+backward+step
    self.syncer.sync(self.trainer, self.rollout)       # ❻ push fresh weights to rollout
    return m                                           # ❼ metrics for logging

That's the whole training loop. Every SOTA framework (verl, OpenRLHF, NeMo-RL, TRL, SLIME) has this same loop at its core. Differences between them are in where each step runs (which process, which GPU, which rank) and how data moves between them (NCCL collectives, Ray actors, shared memory) — not in what the loop does.

Interactive · run one full step

Hit Step to advance through the seven phases. Each phase lights up the relevant role in the diagram and shows what's filling in on the data side. The bottom strip shows the running trajectory list and what fields are populated.

One controller iteration, phase by phase

Each click advances one phase. The diagram on the left highlights the active role; the right shows the data structure being filled in.

Phase: — (idle)

Click Next phase to begin one controller iteration.

traj	resp_ids	old_logp	reward	ref_logp	adv

batch.full_ids

—

loss

—

syncs

What the controller does not do

The controller never:

Touches model parameters directly. (That's the trainer's and syncer's job.)
Computes a reward. (Environment.)
Sees a log-probability or a logit. (Rollout, reference, trainer, algorithm — never the controller.)
Knows which RL algorithm is running. (Algorithm.)
Knows what task is being trained on. (Environment.)

This is by design — when something breaks, the breakage is in one role's file, not scattered across the orchestrator. If the controller ever grows past 200 lines of real logic, it's a sign one of the sub-roles is leaking responsibility upward.

Degenerate-group skip

One subtlety in the loop above: if every rollout in a group received the same reward, the GRPO advantage is zero for every token in every trajectory. The optimizer step would just be numerical noise. We saw this in lesson 2's K-rollout widget and lesson 4's advantage widget. The controller filters it explicitly:

if all(t.info.get("degenerate", False) for t in trajs):
    return degenerate_metrics(trajs)        # skip forward, skip optimizer step

This matters in two ways: it saves one forward + backward pass per degenerate step, and it prevents the optimizer's variance estimator from accumulating noise during periods when the policy is uniformly succeeding (or uniformly failing) on the current prompt. Real frameworks do the same at batch level — keep only the groups with non-degenerate advantage.

Why this one file is so small

The controller's brevity is the framework's biggest win. Look at what it doesn't have to deal with:

It doesn't know if rollout is colocated, disaggregated, or async (lesson 6's three patterns).
It doesn't know if the algorithm is GRPO, DAPO, or Dr.GRPO (lesson 4).
It doesn't know if the env is single-turn or multi-turn (lesson 8).
It doesn't know if the trainer is single-GPU or FSDP-sharded across 64 GPUs (lesson 5).
It doesn't know if sync is an in-process copy or a 70B NCCL broadcast.

It only knows seven interfaces — one per role plus the collate utility. Each is one method call. That decoupling is what lets the same orchestrator drive a CPU laptop demo and a multi-node FSDP+vLLM production run, unchanged.

The exit conditions

This framework's controller has no explicit stop-on-convergence. The outer loop is just

for step in range(cfg.steps):
    metrics = self.step()
    ...

Real frameworks add: maximum wall-time, target reward EMA, eval-on-held-set every N steps, checkpoint every M steps. Each is one extra dispatch the controller makes. None of them require changing the sub-roles. The seven-step loop above is the load-bearing wall.

Takeaway

Every modern post-training RL system reduces to the same seven-step orchestrator: rollout → reference → advantages → collate → train_step → weight_sync → log. Frameworks differ in where each step runs, not in what the steps are. If you've understood this loop, you've understood the system.