Agentic RL — multi-turn + tool masking

When the trajectory contains tokens the model didn't generate, the algorithm doesn't need to change — but the mask does.

The shift from single-turn to agentic

Lessons 1–7 lived in a single-turn world: prompt in, response out, reward at the end. Modern agentic systems — Claude with tools, OpenAI function calling, SWE-bench solvers, Voyager — are multi-turn. The model emits, the environment responds with tool output or new observations, the model emits again, and so on. The trajectory is a sequence of alternating assistant turns and environment turns.

turn 0 (assistant):   "let me compute   c(123+456+789)#"
turn 1 (env tool):    "=1368#"
turn 2 (assistant):   "the answer is 1368#"   ← terminal turn
                                              reward = verify("1368") = 1

The trajectory is one continuous token stream. From the model's perspective it has emitted some tokens and "observed" others. The framework's job is to record per-token old_logp on tokens the model emitted and mark the rest as not-trainable.

The one-line key idea

Why tool-output tokens get response_mask = 0

We want the policy to learn when to call a tool, not to imitate the tool's output. Masking tool tokens makes the loss include exactly the assistant-generated tokens — the gradient flows only through what the model could have done differently.

Without this mask, two failure modes appear:

Tool mimicry. The model learns to guess the tool output from the query alone — defeating the point of having a tool, and failing to generalize to inputs the model didn't see during training.
Gradient pollution. Tool-output tokens carry the same advantage as the rest of the trajectory but the model didn't produce them — moving their log-probabilities is just noise on the gradient.

The fix is one line in agent_loop.MultiTurnRolloutEngine._one_trajectory:

response_mask.extend([0.0] * len(obs_ids))    # ← the key line: tool tokens, not trainable
old_logp     .extend([0.0] * len(obs_ids))    #    logp of tool tokens is undefined anyway

Interactive · the trajectory and its mask

Below is a complete multi-turn trajectory on the calculator-arithmetic task. Each tile is one token, colored by who emitted it. Toggle masking and watch the loss change — when masking is off, the model is being trained to imitate =1368#.

Multi-turn trajectory · tokens, mask, contribution to loss

Blue = assistant (mask=1, contributes to loss). Orange = tool output (mask=0, ignored). Toggle the mask to see what would break.

apply tool mask (correct)

PROMPT

TURN 0 · ASSISTANT (sampled, mask=1)

TURN 1 · ENV TOOL OUTPUT (injected, mask=0)

TURN 2 · ASSISTANT (sampled, mask=1)

TERMINAL REWARD

tokens scored

—

assistant tokens

—

tool tokens

—

expected learning

—

What just broke

With masking off, the loss gradient is also acting on tokens like =1368# — and since this trajectory was rewarded, the gradient is pushing π_θ to predict =1368# after c(123+456+789)#. The model is now memorizing tool outputs, exactly the failure we wanted to avoid.

What the rest of the framework didn't have to change

This is the prize. Going from single-turn to agentic RL required:

A new environment type (MultiTurnEnv) with step() in addition to reset().
A new rollout engine (MultiTurnRolloutEngine) that drives the assistant/env back-and-forth.

That's it. Everything else is unchanged:

The reference engine scores ref_logp the same way — masked positions get zero-ed out and ignored.
The algorithm's advantage computation is unchanged — the scalar advantage broadcasts to all (assistant) response tokens via response_mask.
The trainer's forward and loss don't know whether they're looking at single-turn or multi-turn — the mask handles it.
The weight syncer doesn't care.
The controller's seven-step loop is unchanged.

This is the framework's value proposition delivering on a real workload: add a new task shape, and the algorithms you already have just work.

One subtlety: per-turn rewards

The single-turn case has one terminal reward. Multi-turn allows per-turn rewards: maybe the environment gives partial credit for an intermediate observation, or penalizes a malformed tool call. In our toy task the only reward comes at the terminal turn, but MultiTurnRolloutEngine accumulates rewards across turns into traj.reward:

for _turn in range(cfg.max_turns):
    gen_ids, gen_logp = self._generate_assistant(ids)
    obs_text, reward, done, info = env.step(decode(gen_ids), info)
    reward_total += reward          # ← accumulated across turns
    if done: break
return Trajectory(..., reward = reward_total)

For dense per-turn rewards (process reward models, e.g.) you'd want a per-token reward field instead of a scalar — and a different advantage estimator (GAE-style) that does temporal credit assignment. That's a one-field-on-Trajectory change and a new Algorithm subclass; the rest of the framework is unchanged. Same pattern.

Why the mask "belongs to" the trajectory, not the env

A design question worth answering: why does the rollout engine emit a per-token mask, instead of the algorithm or the trainer figuring out which tokens are assistant tokens?

Because only the rollout engine knows. By the time the trajectory reaches the trainer, all tokens look the same — a tensor of ids. The fact that token 47 was generated by the policy and token 48 came from env.step is information that exists at sampling time and would be expensive (and bug-prone) to reconstruct later.

Storing the mask on the trajectory at sampling time is the canonical solution. Every algorithm that needs to be "correct" on agentic data simply multiplies by the mask everywhere — exactly what algorithm.py's _masked_mean already does.

Reading the code

Two files implement the agentic surface:

environment.py — defines MultiTurnEnv (the protocol) and CalculatorArithmeticEnv (the concrete task). The env's step() takes the assistant text, detects a c(...) tool call via regex, evaluates it, and returns the result as an injection.
agent_loop.py — defines MultiTurnRolloutEngine. The _one_trajectory method is the loop: sample an assistant turn until EOS or budget, call env.step, append the env's response with mask=0, repeat.

The whole agentic addition is ~200 lines. Most of it is bookkeeping around the per-token mask.

Where the framework ends and the next steps begin

The eight lessons cover the full surface of the framework: six roles plus the orchestrator plus the agentic extension. Beyond this, the natural next questions are:

Critic / value head for PPO. Adds a value field to Trajectory, a second head on the model, a second term in compute_loss. See rl_sota/01_ppo.py.
Reward model as a role. RLHF (vs RLVR) uses a learned reward model — a RewardModelEngine with the same forward-only shape as ReferenceEngine.
Process reward / per-turn rewards. Per-token reward field + a different advantage estimator (GAE).
Async rollout. Versioned weights and a weights_generation tag on every trajectory; weight_sync.py's header describes the pattern.
FSDP / tensor parallel. Wrap the trainer's model; reference and rollout pick their own (possibly different) sharding.

Each of these is a single-file change because of the role boundaries — which is the entire framework's whole value proposition. The neighboring folders flesh out specific pieces: rl_sota/ for the algorithm space, vllm/ for the rollout backend, gpt_mini/ for the full SFT → CoT → DPO → RLVR pipeline.

Takeaway

Agentic RL is single-turn RL plus one bit of bookkeeping: which tokens did the model produce, and which did the environment inject? Get the mask right and every existing algorithm — REINFORCE, GRPO, DAPO, Dr.GRPO — just works on multi-turn trajectories without modification.

You've now seen the complete framework. Open RL/framework/ — every file should read as a single role with a single responsibility, and the controller's seven-step loop should feel inevitable. Part II of the series (lessons 09–14) drills into the algorithms that live inside algorithm.py — start with REINFORCE to see what the loss can actually be.