all_lessons / ml_system_design / 09d · framework playbook RL systems 4 / 4

The playbook — match the framework to the wall you measured

The three previous lessons did the work: 09a named the five resources and the goodput equation, 09b derived the weight-sync wall, 09c derived the rollout wall and its staleness cost. This lesson is the payoff — a single linear path from workload shape → arithmetic → binding wall → framework. The framework is never the starting point; it is the last decision, and it follows the number.

The linearized decision, in one breath
Model size sets the learner and weight-sync problem (τT, τS). Output length sets the actor problem (τR and its tail). Environment cost sets the data-plane problem (τV). Staleness tolerance sets whether sync, near-on-policy, or fully async is even allowed. Measure those four, and the framework choice is almost forced.

1 · The path from workload to wall

Before any framework name, run the four readings — each is an arithmetic from 09a–09c, and each points at a specific term of τstep:

You read…It sets…The arithmeticFrom
Model size NτT (16N train state) and τS (2N × copies broadcast)140 GB sync at 70B, 810 GB at 405B; learner OOM and MFU09b §2
Output length L̄ and tail cτR and the straggler taxLmax(K) ≈ μ·exp(s√(2lnK)−s²/2); ~5× at K=1609c §1
Environment costτV and the p95 tailverifier ms vs tool/sandbox seconds; off-GPU latency09a §2
Staleness toleranceplacement and the freshness wallρ = πθθ−Δ; wall at Δ ≈ 4–809c §3

The largest term is the binding wall, and the binding wall — not the framework's reputation — is what you optimize first. Everything below is just a lookup keyed on that term.

2 · Frameworks as optimization bets

No framework is "best"; each places a bet that one resource is your wall and optimizes hard against it. Read the "primary bet" column as "this framework wins when that term dominates" — the mechanisms are the ones derived in 09b–09c:

FrameworkPrimary bet (the wall it attacks)Best fitWatch out for
TRLAlgorithm ergonomics inside Hugging Face — not a systems bet.Single-node research, small GRPO/PPO/RLOO, learning the algorithm.Not built to push 70B+ async rollout efficiency.
OpenRLHFPractical disaggregation: Ray + vLLM + ZeRO, scheduling around the four sync ops (09b §4).7B–70B RLHF/RLVR with familiar engines, fast iteration.Engine-version compatibility and weight-update plumbing.
verl / HybridFlowLayout conversion: 3D-HybridEngine reshards the actor in place, killing the duplicate copy (09b §3).Large open RL, MoE, backend flexibility, research-to-scale path.More knobs; you need a clear placement + backend plan.
slimeThin layer over SGLang rollout + Megatron training with bucketed weight updates (09b op 3).Teams already on SGLang + Megatron, frequent weight updates.Less insulation; backend expertise carries the weight.
NeMo-RLNVIDIA-integrated stack + speculative decode to shrink τR losslessly (09c §5).NVIDIA clusters, multimodal, speculative rollout, production pipelines.Best when infra is already close to NVIDIA's stack.
LlamaRLWeight sync at scale: direct-memory (RDMA) sync + off-policy correction (09b op 3).Very large models where τS and async overlap dominate.Hardware coupling and staleness-algorithm complexity.
AReaLBreak the global barrier: fully async generation/training, staleness-aware PPO (09c §3).Reasoning workloads blocked by longest-output sync.You must monitor reward at fixed compute, not utilization.
LaminarTrajectory-level async + relay workers + repacking — attacks the max-over-K tail (09c §1–2).Long-tail rollouts where the batch boundary wastes the cluster.More moving parts in versioning and relay recovery.
AsyncFlow / RelaxStreaming TransferQueue + service decoupling + a continuous staleness knob.Heterogeneous engines, multimodal/agentic services, modularity.The queue policy becomes part of correctness.
Agent LightningTraining-agent disaggregation: a unified transition interface for arbitrary agents.Existing agent runtimes too costly to rewrite as an RL trainer.Credit assignment and token-faithful logging get hard.

3 · Choose by bottleneck

The direct lookup: measure the wall (09a's triage), then read across. This table is the playbook's core — every other section is context for it.

If the wall is…You'll see…Architecture moveFramework direction
Rollout generation (τR)Actors busy, learner idle, long output p95.Disaggregate actors; vLLM/SGLang, continuous batching, speculative decode.OpenRLHF, verl, slime, NeMo-RL; AReaL/Laminar if the tail is severe.
Weight sync (τS)Large fraction of each step reshaping/broadcasting weights.Colocate/hybrid if you can; else bucket, DMA, relay, or loosen freshness.verl HybridFlow, slime, LlamaRL, Laminar, Relax.
Learner memory/FLOPs (τT)Low MFU, OOM, tiny microbatches.Megatron/FSDP/ZeRO, sequence packing, recompute, LoRA, fp8 rollout.verl, NeMo-RL, OpenRLHF; TRL for small experiments.
Env/reward latency (τV)Reward p95 dominates; actors blocked on tools/tests.Remote sandbox pool, async reward queue, caching, timeouts, replayable logs.Relax/AsyncFlow service graph; Agent Lightning for existing agents.
Long-tail skew (high c)p95/p50 rollout ratio high; batch waits for stragglers.Trajectory-level async, partial rollout, repacking, abort/retract.Laminar, AReaL, OpenRLHF async/partial, SGLang RL controls.
Sparse-reward explorationK samples rarely succeed; PPO epochs don't help.Scale search actors, prioritize high-reward/recency, off-policy objective.TBA-style search/learning decoupling around verl/OpenRLHF.
Agent observabilityToken IDs, tool events, rewards can't be reconstructed.Instrument the runtime; log token-faithful transitions; split runner from trainer.Agent Lightning; OpenRLHF token-in-token-out agents.

4 · Workload recipes

WorkloadReasonable starting designUpgrade when…
One-node 7B prototypeTRL or a simple verl/OpenRLHF recipe; prioritize reward correctness and evals. Colocate — τS is a pointer swap.You wait on rollout more than you debug the algorithm.
7B–32B RLVR researchOpenRLHF or verl, vLLM/SGLang rollout, FSDP/ZeRO learner.Output lengths go long-tailed or weight sync shows up in traces.
70B reasoning modelDisaggregated actors + learner pool, versioned trajectory queue, bounded async.Sync barriers dominate or actor/learner want independent sizing.
405B / MoEMegatron/FSDP-aware framework with an explicit reshard/sync plan — never naive full fanout (810 GB).Use relay, bucketed update, or DMA sync before adding actor copies.
Long-CoT math/codeActor-heavy layout, continuous batching, dynamic sampling, length/time controls.Add Laminar/AReaL trajectory async when the p95 tail wastes the cluster.
Agentic coding/web/tool RLSeparate env-runner fleet, sandbox logs, token-faithful traces, async reward queues.Move to Agent Lightning / Relax service decoupling if the agent runtime is large.
Multimodal RLFramework with VLM data processors, media transport, env isolation, modality-aware batching.NeMo-RL, verl-omni, or Relax-style service roles get more attractive.

5 · The minimum production design

Independent of framework, a serious RL platform must expose these controls — each is a hook the earlier lessons proved you need:

The staff-level answer
"I'd start synchronous to validate the reward and the loss. Then add disaggregated rollout serving once a trace shows generation is the wall. Then loosen to bounded async — but only after measuring policy-lag quality at fixed GPU-hours, not steps/hour. The framework choice follows that path; it doesn't lead it."

Interactive · framework design selector

Set the workload shape. The selector names the likely architecture pattern and framework bias — the output of running §1's readings for you. It is a design hint, not a universal winner; real selection also weighs team expertise, cluster topology, and how much framework code you can debug.

RL framework selector

The recommendation is a planning hint. It encodes the §3 lookup: size → τT/τS, output → τR, env → τV, staleness → placement.

architecture
-
framework bias
-
first optimization
-
risk
-

6 · Anti-patterns

Anti-patternWhy it failsBetter
Pick the PPO framework before measuring the rollout wallThe algorithm is rarely the bottleneck; generation usually is (09a).Trace actor, reward, learner, sync separately on a small run.
Naive full weight reload every updateAt 70B+ this can eat the whole iteration or force tiny actor fleets (09b §2).Reshard in place, bucket, direct sync, or sync less often.
Async with no policy-version accountingYou can't tell real learning from stale-data noise — ρ is uncomputable (09c §4).Log policy version and stale-by-version on every trajectory.
Drop long samples silentlyYou train away the long reasoning you wanted (09c §2).Make timeout/length filters explicit, sampled, auditable.
Retokenize agent transcripts laterToken/logprob drift corrupts the RL loss and ρ.Capture generated token IDs and logprobs at the server boundary.
Scale envs without sandbox isolationOne flaky tool poisons throughput and reward labels.Hermetic sandboxes, retries, deterministic artifacts, verifier logs.

What carries forward

Sources used

SourceOptimization lesson used
verl / HybridFlowHybrid controller, FSDP/Megatron + vLLM/SGLang backends, 3D-HybridEngine reshard.
OpenRLHF docs / paperRay + vLLM + DeepSpeed/ZeRO, async/partial rollout, agent execution.
slime / SGLang RLSGLang-native rollout, Megatron training, bucketed updates, abort/retract.
NeMo-RL / spec-decode paperMultimodal post-training and system-integrated speculative decoding.
LlamaRLFully async PyTorch, direct-memory weight sync, large-model speedups.
AReaLFully async generation/training and staleness-aware PPO.
LaminarTrajectory-level asynchrony, relay workers, dynamic repacking.
AsyncFlow / RelaxStreaming TransferQueues, service decoupling, explicit staleness control.
Agent LightningTraining-agent disaggregation for arbitrary agent runtimes.
Trajectory Balance with AsynchronyDecoupled search and learning for sparse-reward post-training.