The playbook — match the framework to the wall you measured
The three previous lessons did the work: 09a named the five resources and the goodput equation, 09b derived the weight-sync wall, 09c derived the rollout wall and its staleness cost. This lesson is the payoff — a single linear path from workload shape → arithmetic → binding wall → framework. The framework is never the starting point; it is the last decision, and it follows the number.
1 · The path from workload to wall
Before any framework name, run the four readings — each is an arithmetic from 09a–09c, and each points at a specific term of τstep:
| You read… | It sets… | The arithmetic | From |
|---|---|---|---|
| Model size N | τT (16N train state) and τS (2N × copies broadcast) | 140 GB sync at 70B, 810 GB at 405B; learner OOM and MFU | 09b §2 |
| Output length L̄ and tail c | τR and the straggler tax | Lmax(K) ≈ μ·exp(s√(2lnK)−s²/2); ~5× at K=16 | 09c §1 |
| Environment cost | τV and the p95 tail | verifier ms vs tool/sandbox seconds; off-GPU latency | 09a §2 |
| Staleness tolerance | placement and the freshness wall | ρ = πθ/πθ−Δ; wall at Δ ≈ 4–8 | 09c §3 |
The largest term is the binding wall, and the binding wall — not the framework's reputation — is what you optimize first. Everything below is just a lookup keyed on that term.
2 · Frameworks as optimization bets
No framework is "best"; each places a bet that one resource is your wall and optimizes hard against it. Read the "primary bet" column as "this framework wins when that term dominates" — the mechanisms are the ones derived in 09b–09c:
| Framework | Primary bet (the wall it attacks) | Best fit | Watch out for |
|---|---|---|---|
| TRL | Algorithm ergonomics inside Hugging Face — not a systems bet. | Single-node research, small GRPO/PPO/RLOO, learning the algorithm. | Not built to push 70B+ async rollout efficiency. |
| OpenRLHF | Practical disaggregation: Ray + vLLM + ZeRO, scheduling around the four sync ops (09b §4). | 7B–70B RLHF/RLVR with familiar engines, fast iteration. | Engine-version compatibility and weight-update plumbing. |
| verl / HybridFlow | Layout conversion: 3D-HybridEngine reshards the actor in place, killing the duplicate copy (09b §3). | Large open RL, MoE, backend flexibility, research-to-scale path. | More knobs; you need a clear placement + backend plan. |
| slime | Thin layer over SGLang rollout + Megatron training with bucketed weight updates (09b op 3). | Teams already on SGLang + Megatron, frequent weight updates. | Less insulation; backend expertise carries the weight. |
| NeMo-RL | NVIDIA-integrated stack + speculative decode to shrink τR losslessly (09c §5). | NVIDIA clusters, multimodal, speculative rollout, production pipelines. | Best when infra is already close to NVIDIA's stack. |
| LlamaRL | Weight sync at scale: direct-memory (RDMA) sync + off-policy correction (09b op 3). | Very large models where τS and async overlap dominate. | Hardware coupling and staleness-algorithm complexity. |
| AReaL | Break the global barrier: fully async generation/training, staleness-aware PPO (09c §3). | Reasoning workloads blocked by longest-output sync. | You must monitor reward at fixed compute, not utilization. |
| Laminar | Trajectory-level async + relay workers + repacking — attacks the max-over-K tail (09c §1–2). | Long-tail rollouts where the batch boundary wastes the cluster. | More moving parts in versioning and relay recovery. |
| AsyncFlow / Relax | Streaming TransferQueue + service decoupling + a continuous staleness knob. | Heterogeneous engines, multimodal/agentic services, modularity. | The queue policy becomes part of correctness. |
| Agent Lightning | Training-agent disaggregation: a unified transition interface for arbitrary agents. | Existing agent runtimes too costly to rewrite as an RL trainer. | Credit assignment and token-faithful logging get hard. |
3 · Choose by bottleneck
The direct lookup: measure the wall (09a's triage), then read across. This table is the playbook's core — every other section is context for it.
| If the wall is… | You'll see… | Architecture move | Framework direction |
|---|---|---|---|
| Rollout generation (τR) | Actors busy, learner idle, long output p95. | Disaggregate actors; vLLM/SGLang, continuous batching, speculative decode. | OpenRLHF, verl, slime, NeMo-RL; AReaL/Laminar if the tail is severe. |
| Weight sync (τS) | Large fraction of each step reshaping/broadcasting weights. | Colocate/hybrid if you can; else bucket, DMA, relay, or loosen freshness. | verl HybridFlow, slime, LlamaRL, Laminar, Relax. |
| Learner memory/FLOPs (τT) | Low MFU, OOM, tiny microbatches. | Megatron/FSDP/ZeRO, sequence packing, recompute, LoRA, fp8 rollout. | verl, NeMo-RL, OpenRLHF; TRL for small experiments. |
| Env/reward latency (τV) | Reward p95 dominates; actors blocked on tools/tests. | Remote sandbox pool, async reward queue, caching, timeouts, replayable logs. | Relax/AsyncFlow service graph; Agent Lightning for existing agents. |
| Long-tail skew (high c) | p95/p50 rollout ratio high; batch waits for stragglers. | Trajectory-level async, partial rollout, repacking, abort/retract. | Laminar, AReaL, OpenRLHF async/partial, SGLang RL controls. |
| Sparse-reward exploration | K samples rarely succeed; PPO epochs don't help. | Scale search actors, prioritize high-reward/recency, off-policy objective. | TBA-style search/learning decoupling around verl/OpenRLHF. |
| Agent observability | Token IDs, tool events, rewards can't be reconstructed. | Instrument the runtime; log token-faithful transitions; split runner from trainer. | Agent Lightning; OpenRLHF token-in-token-out agents. |
4 · Workload recipes
| Workload | Reasonable starting design | Upgrade when… |
|---|---|---|
| One-node 7B prototype | TRL or a simple verl/OpenRLHF recipe; prioritize reward correctness and evals. Colocate — τS is a pointer swap. | You wait on rollout more than you debug the algorithm. |
| 7B–32B RLVR research | OpenRLHF or verl, vLLM/SGLang rollout, FSDP/ZeRO learner. | Output lengths go long-tailed or weight sync shows up in traces. |
| 70B reasoning model | Disaggregated actors + learner pool, versioned trajectory queue, bounded async. | Sync barriers dominate or actor/learner want independent sizing. |
| 405B / MoE | Megatron/FSDP-aware framework with an explicit reshard/sync plan — never naive full fanout (810 GB). | Use relay, bucketed update, or DMA sync before adding actor copies. |
| Long-CoT math/code | Actor-heavy layout, continuous batching, dynamic sampling, length/time controls. | Add Laminar/AReaL trajectory async when the p95 tail wastes the cluster. |
| Agentic coding/web/tool RL | Separate env-runner fleet, sandbox logs, token-faithful traces, async reward queues. | Move to Agent Lightning / Relax service decoupling if the agent runtime is large. |
| Multimodal RL | Framework with VLM data processors, media transport, env isolation, modality-aware batching. | NeMo-RL, verl-omni, or Relax-style service roles get more attractive. |
5 · The minimum production design
Independent of framework, a serious RL platform must expose these controls — each is a hook the earlier lessons proved you need:
- Versioned weights — every actor knows which policy version it served; the learner records which it trained on. (09c §4: needed to compute ρ.)
- Versioned trajectories — token IDs, logprobs, rewards, masks, env events, verifier logs, all replayable. (09c §4.)
- Role-level utilization — actor, reward, learner, queue, sync metrics separated. (09a: you can't find the wall without the split.)
- Freshness SLO — max policy lag configurable and evaluated against reward quality. (09c §3.)
- Tail controls — timeout, abort, retract, repack, length-bucket policies, explicit. (09c §1–2.)
- Weight-sync trace — gather, convert, broadcast, load, visibility timestamps measured separately. (09b §2: four operations, four levers.)
- Reward audits — hidden evals and verifier consistency checks run continuously, not after the expensive run.
Interactive · framework design selector
Set the workload shape. The selector names the likely architecture pattern and framework bias — the output of running §1's readings for you. It is a design hint, not a universal winner; real selection also weighs team expertise, cluster topology, and how much framework code you can debug.
6 · Anti-patterns
| Anti-pattern | Why it fails | Better |
|---|---|---|
| Pick the PPO framework before measuring the rollout wall | The algorithm is rarely the bottleneck; generation usually is (09a). | Trace actor, reward, learner, sync separately on a small run. |
| Naive full weight reload every update | At 70B+ this can eat the whole iteration or force tiny actor fleets (09b §2). | Reshard in place, bucket, direct sync, or sync less often. |
| Async with no policy-version accounting | You can't tell real learning from stale-data noise — ρ is uncomputable (09c §4). | Log policy version and stale-by-version on every trajectory. |
| Drop long samples silently | You train away the long reasoning you wanted (09c §2). | Make timeout/length filters explicit, sampled, auditable. |
| Retokenize agent transcripts later | Token/logprob drift corrupts the RL loss and ρ. | Capture generated token IDs and logprobs at the server boundary. |
| Scale envs without sandbox isolation | One flaky tool poisons throughput and reward labels. | Hermetic sandboxes, retries, deterministic artifacts, verifier logs. |
What carries forward
- Framework choice is the last decision, not the first: read model size, output length, env cost, staleness tolerance → those name the binding wall → the wall names the framework.
- Every framework is a bet on one wall. verl/HybridFlow bets on layout conversion; LlamaRL on τS; AReaL/Laminar on the rollout tail; Relax/AsyncFlow on service decoupling; Agent Lightning on agent observability.
- The escalation path is fixed: synchronous (validate reward + loss) → disaggregated (when τR binds) → bounded async (only after measuring reward at fixed GPU-hours).
- The minimum platform is the same regardless of framework: versioned weights + trajectories, role-level utilization, a freshness SLO, tail controls, a sync trace, and continuous reward audits.
- The whole Part-IV arc in one line: RL post-training is an inference system and a training system wired in a loop; you design it by finding which term of τstep binds, and the framework is whoever optimizes that term best.
Sources used
| Source | Optimization lesson used |
|---|---|
| verl / HybridFlow | Hybrid controller, FSDP/Megatron + vLLM/SGLang backends, 3D-HybridEngine reshard. |
| OpenRLHF docs / paper | Ray + vLLM + DeepSpeed/ZeRO, async/partial rollout, agent execution. |
| slime / SGLang RL | SGLang-native rollout, Megatron training, bucketed updates, abort/retract. |
| NeMo-RL / spec-decode paper | Multimodal post-training and system-integrated speculative decoding. |
| LlamaRL | Fully async PyTorch, direct-memory weight sync, large-model speedups. |
| AReaL | Fully async generation/training and staleness-aware PPO. |
| Laminar | Trajectory-level asynchrony, relay workers, dynamic repacking. |
| AsyncFlow / Relax | Streaming TransferQueues, service decoupling, explicit staleness control. |
| Agent Lightning | Training-agent disaggregation for arbitrary agent runtimes. |
| Trajectory Balance with Asynchrony | Decoupled search and learning for sparse-reward post-training. |