Algorithm — the plugin interface
The only piece of the system that's actually different across PPO, GRPO, RLOO, DAPO, and Dr.GRPO. Everything else is identical. This lesson defines the interface; Part II walks each algorithm in depth.
The two-method contract
In this framework, an Algorithm answers exactly two questions:
- Advantage: Given a group of trajectories with rewards, what advantage does each token get?
- Loss: Given a batch with advantages, old log-probs, and reference log-probs, what is the loss as a function of the current policy's log-probs?
That's it. The rollout engine doesn't know which algorithm is running. The trainer doesn't either. Swapping GRPO for DAPO is one file change — exactly the design pressure that makes algorithm.py a plugin and not a tangle of conditionals throughout the trainer.
algorithm.pyclass Algorithm(Protocol):
def compute_advantages(self, groups: list[list[Trajectory]]) -> None:
"""Write traj.advantage on every Trajectory in place."""
def compute_loss(self, batch: Batch, new_logp: Tensor) -> tuple[Tensor, dict]:
"""Return (scalar_loss, metrics_dict). new_logp flows gradient."""
Why this is one role, not five
Look at any two algorithms in RL/algorithms/ side by side — GRPO vs DAPO, or REINFORCE vs Dr.GRPO. They differ in:
- What the advantage is. (Raw reward? r − mean? Divided by std? Leave-one-out? GAE from a value head?)
- What the loss looks like. (Multiply A · log π? Use a PPO ratio? Clip symmetrically? Asymmetrically? Normalize per-token or per-sequence?)
They don't differ in: how rollouts are sampled, what the reference is for, where gradients flow, when weights sync. Those live in other roles. The whole point of factoring algorithm out is that everything algorithmic sits behind these two methods, and the trainer / rollout / controller never need to know about it.
The shared signature — what flows in
By the time compute_advantages is called, every Trajectory has:
| Field | Set by | Shape |
|---|---|---|
prompt_ids, response_ids | rollout | list of ints |
response_mask | rollout | (R,) — 1 for assistant tokens, 0 for tool / EOS-pad |
old_logp | rollout (at sampling time, lesson 02) | (R,) |
ref_logp | reference (lesson 03) | (R,) |
reward | env.verify (lesson 03) | scalar (single-turn) or per-turn sum (agentic, lesson 08) |
The algorithm's job is to fill in one more field — traj.advantage — then later, when given a packed Batch and the trainer's fresh new_logp, produce a scalar loss.
Interactive · feel the advantage assignment
Three rollouts. You set their rewards. The table shows what advantage each algorithm assigns. This is a preview of lesson 11's widget — read the depth there. The point here: this table is the only thing different across algorithms.
The loss — same skeleton, different fillings
Once advantages are written, every algorithm produces its loss the same way:
# Simplified from RL/framework/algorithm.py — the shared skeleton
log_ratio = (new_logp - batch.old_logp) * mask
ratio = torch.exp(log_ratio)
# ── ONE OF THESE LINES is the algorithm-specific part ──
# REINFORCE: pg_tok = -batch.advantage * new_logp
# GRPO / PPO: pg_tok = -torch.min(ratio*A, clip(ratio, 1-eps, 1+eps)*A)
# DAPO: same min, but clip uses asymmetric (1-eps_low, 1+eps_high)
# Dr.GRPO: same min, but A is built without /std
pg_loss = _masked_mean(pg_tok, mask) # token-level
kl_loss = beta * _masked_mean(compute_kl_k3(new_logp, ref_logp), mask)
loss = pg_loss + kl_loss
The KL anchor (lesson 03) is added by every algorithm — it doesn't really "belong to" any of them but we apply it inside compute_loss because new_logp and ref_logp are already in the batch. Same arithmetic, one fewer call site for the masking.
(per_token_loss * mask).sum() / mask.sum(). The alternative — sum within a sequence then average across sequences — introduces a length bias. DAPO and Dr.GRPO (L13, L14) make this their headline fix. The token-level form is what this framework uses everywhere.
Where to read each algorithm
| Algorithm | One-line change vs predecessor | Depth |
|---|---|---|
| REINFORCE | baseline = 0 (no patches; the floor) | Lesson 09 |
| PPO | add value head + importance ratio + symmetric clip | Lesson 10 |
| GRPO | drop value head; baseline = group mean / std | Lesson 11 |
| RLOO | baseline = mean of other K-1 rollouts | Lesson 12 |
| DAPO | 4 patches: clip-higher, dynamic sampling, token-level loss, overlong shaping | Lesson 13 |
| Dr.GRPO | delete /std and /|y_i| from GRPO | Lesson 14 |
compute_advantages and compute_loss. Six algorithms in this curriculum, one interface, one trainer that doesn't know which one is running. The depth — variance arguments, clip mechanics, length-bias debates — lives in Part II. Read this lesson as "where the plugin fits"; read Part II as "what fills the plugin".