The playbook — match the framework to the wall you measured

The three previous lessons did the work: 09a named the five resources and the goodput equation, 09b derived the weight-sync wall, 09c derived the rollout wall and its staleness cost. This lesson is the payoff — a single linear path from workload shape → arithmetic → binding wall → framework. The framework is never the starting point; it is the last decision, and it follows the number.

The linearized decision, in one breath

Model size sets the learner and weight-sync problem (τ_T, τ_S). Output length sets the actor problem (τ_R and its tail). Environment cost sets the data-plane problem (τ_V). Staleness tolerance sets whether sync, near-on-policy, or fully async is even allowed. Measure those four, and the framework choice is almost forced.

1 · The path from workload to wall

Before any framework name, run the four readings — each is an arithmetic from 09a–09c, and each points at a specific term of τ_step:

You read…	It sets…	The arithmetic	From
Model size N	τ_T (16N train state) and τ_S (2N × copies broadcast)	140 GB sync at 70B, 810 GB at 405B; learner OOM and MFU	09b §2
Output length L̄ and tail c	τ_R and the straggler tax	L_max(K) ≈ μ·exp(s√(2lnK)−s²/2); ~5× at K=16	09c §1
Environment cost	τ_V and the p95 tail	verifier ms vs tool/sandbox seconds; off-GPU latency	09a §2
Staleness tolerance	placement and the freshness wall	ρ = π_θ/π_θ−Δ; wall at Δ ≈ 4–8	09c §3

The largest term is the binding wall, and the binding wall — not the framework's reputation — is what you optimize first. Everything below is just a lookup keyed on that term.

2 · Frameworks as optimization bets

No framework is "best"; each places a bet that one resource is your wall and optimizes hard against it. Read the "primary bet" column as "this framework wins when that term dominates" — the mechanisms are the ones derived in 09b–09c:

Framework	Primary bet (the wall it attacks)	Best fit	Watch out for
TRL	Algorithm ergonomics inside Hugging Face — not a systems bet.	Single-node research, small GRPO/PPO/RLOO, learning the algorithm.	Not built to push 70B+ async rollout efficiency.
OpenRLHF	Practical disaggregation: Ray + vLLM + ZeRO, scheduling around the four sync ops (09b §4).	7B–70B RLHF/RLVR with familiar engines, fast iteration.	Engine-version compatibility and weight-update plumbing.
verl / HybridFlow	Layout conversion: 3D-HybridEngine reshards the actor in place, killing the duplicate copy (09b §3).	Large open RL, MoE, backend flexibility, research-to-scale path.	More knobs; you need a clear placement + backend plan.
slime	Thin layer over SGLang rollout + Megatron training with bucketed weight updates (09b op 3).	Teams already on SGLang + Megatron, frequent weight updates.	Less insulation; backend expertise carries the weight.
NeMo-RL	NVIDIA-integrated stack + speculative decode to shrink τ_R losslessly (09c §5).	NVIDIA clusters, multimodal, speculative rollout, production pipelines.	Best when infra is already close to NVIDIA's stack.
LlamaRL	Weight sync at scale: direct-memory (RDMA) sync + off-policy correction (09b op 3).	Very large models where τ_S and async overlap dominate.	Hardware coupling and staleness-algorithm complexity.
AReaL	Break the global barrier: fully async generation/training, staleness-aware PPO (09c §3).	Reasoning workloads blocked by longest-output sync.	You must monitor reward at fixed compute, not utilization.
Laminar	Trajectory-level async + relay workers + repacking — attacks the max-over-K tail (09c §1–2).	Long-tail rollouts where the batch boundary wastes the cluster.	More moving parts in versioning and relay recovery.
AsyncFlow / Relax	Streaming TransferQueue + service decoupling + a continuous staleness knob.	Heterogeneous engines, multimodal/agentic services, modularity.	The queue policy becomes part of correctness.
Agent Lightning	Training-agent disaggregation: a unified transition interface for arbitrary agents.	Existing agent runtimes too costly to rewrite as an RL trainer.	Credit assignment and token-faithful logging get hard.

3 · Choose by bottleneck

The direct lookup: measure the wall (09a's triage), then read across. This table is the playbook's core — every other section is context for it.

If the wall is…	You'll see…	Architecture move	Framework direction
Rollout generation (τ_R)	Actors busy, learner idle, long output p95.	Disaggregate actors; vLLM/SGLang, continuous batching, speculative decode.	OpenRLHF, verl, slime, NeMo-RL; AReaL/Laminar if the tail is severe.
Weight sync (τ_S)	Large fraction of each step reshaping/broadcasting weights.	Colocate/hybrid if you can; else bucket, DMA, relay, or loosen freshness.	verl HybridFlow, slime, LlamaRL, Laminar, Relax.
Learner memory/FLOPs (τ_T)	Low MFU, OOM, tiny microbatches.	Megatron/FSDP/ZeRO, sequence packing, recompute, LoRA, fp8 rollout.	verl, NeMo-RL, OpenRLHF; TRL for small experiments.
Env/reward latency (τ_V)	Reward p95 dominates; actors blocked on tools/tests.	Remote sandbox pool, async reward queue, caching, timeouts, replayable logs.	Relax/AsyncFlow service graph; Agent Lightning for existing agents.
Long-tail skew (high c)	p95/p50 rollout ratio high; batch waits for stragglers.	Trajectory-level async, partial rollout, repacking, abort/retract.	Laminar, AReaL, OpenRLHF async/partial, SGLang RL controls.
Sparse-reward exploration	K samples rarely succeed; PPO epochs don't help.	Scale search actors, prioritize high-reward/recency, off-policy objective.	TBA-style search/learning decoupling around verl/OpenRLHF.
Agent observability	Token IDs, tool events, rewards can't be reconstructed.	Instrument the runtime; log token-faithful transitions; split runner from trainer.	Agent Lightning; OpenRLHF token-in-token-out agents.

4 · Workload recipes

Workload	Reasonable starting design	Upgrade when…
One-node 7B prototype	TRL or a simple verl/OpenRLHF recipe; prioritize reward correctness and evals. Colocate — τ_S is a pointer swap.	You wait on rollout more than you debug the algorithm.
7B–32B RLVR research	OpenRLHF or verl, vLLM/SGLang rollout, FSDP/ZeRO learner.	Output lengths go long-tailed or weight sync shows up in traces.
70B reasoning model	Disaggregated actors + learner pool, versioned trajectory queue, bounded async.	Sync barriers dominate or actor/learner want independent sizing.
405B / MoE	Megatron/FSDP-aware framework with an explicit reshard/sync plan — never naive full fanout (810 GB).	Use relay, bucketed update, or DMA sync before adding actor copies.
Long-CoT math/code	Actor-heavy layout, continuous batching, dynamic sampling, length/time controls.	Add Laminar/AReaL trajectory async when the p95 tail wastes the cluster.
Agentic coding/web/tool RL	Separate env-runner fleet, sandbox logs, token-faithful traces, async reward queues.	Move to Agent Lightning / Relax service decoupling if the agent runtime is large.
Multimodal RL	Framework with VLM data processors, media transport, env isolation, modality-aware batching.	NeMo-RL, verl-omni, or Relax-style service roles get more attractive.

5 · The minimum production design

Independent of framework, a serious RL platform must expose these controls — each is a hook the earlier lessons proved you need:

Versioned weights — every actor knows which policy version it served; the learner records which it trained on. (09c §4: needed to compute ρ.)
Versioned trajectories — token IDs, logprobs, rewards, masks, env events, verifier logs, all replayable. (09c §4.)
Role-level utilization — actor, reward, learner, queue, sync metrics separated. (09a: you can't find the wall without the split.)
Freshness SLO — max policy lag configurable and evaluated against reward quality. (09c §3.)
Tail controls — timeout, abort, retract, repack, length-bucket policies, explicit. (09c §1–2.)
Weight-sync trace — gather, convert, broadcast, load, visibility timestamps measured separately. (09b §2: four operations, four levers.)
Reward audits — hidden evals and verifier consistency checks run continuously, not after the expensive run.

The staff-level answer

"I'd start synchronous to validate the reward and the loss. Then add disaggregated rollout serving once a trace shows generation is the wall. Then loosen to bounded async — but only after measuring policy-lag quality at fixed GPU-hours, not steps/hour. The framework choice follows that path; it doesn't lead it."

Interactive · framework design selector

Set the workload shape. The selector names the likely architecture pattern and framework bias — the output of running §1's readings for you. It is a design hint, not a universal winner; real selection also weighs team expertise, cluster topology, and how much framework code you can debug.

RL framework selector

The recommendation is a planning hint. It encodes the §3 lookup: size → τ_T/τ_S, output → τ_R, env → τ_V, staleness → placement.

model size output shape env cost staleness backend preference

architecture

framework bias

first optimization

risk

6 · Anti-patterns

Anti-pattern	Why it fails	Better
Pick the PPO framework before measuring the rollout wall	The algorithm is rarely the bottleneck; generation usually is (09a).	Trace actor, reward, learner, sync separately on a small run.
Naive full weight reload every update	At 70B+ this can eat the whole iteration or force tiny actor fleets (09b §2).	Reshard in place, bucket, direct sync, or sync less often.
Async with no policy-version accounting	You can't tell real learning from stale-data noise — ρ is uncomputable (09c §4).	Log policy version and stale-by-version on every trajectory.
Drop long samples silently	You train away the long reasoning you wanted (09c §2).	Make timeout/length filters explicit, sampled, auditable.
Retokenize agent transcripts later	Token/logprob drift corrupts the RL loss and ρ.	Capture generated token IDs and logprobs at the server boundary.
Scale envs without sandbox isolation	One flaky tool poisons throughput and reward labels.	Hermetic sandboxes, retries, deterministic artifacts, verifier logs.

What carries forward

Framework choice is the last decision, not the first: read model size, output length, env cost, staleness tolerance → those name the binding wall → the wall names the framework.
Every framework is a bet on one wall. verl/HybridFlow bets on layout conversion; LlamaRL on τ_S; AReaL/Laminar on the rollout tail; Relax/AsyncFlow on service decoupling; Agent Lightning on agent observability.
The escalation path is fixed: synchronous (validate reward + loss) → disaggregated (when τ_R binds) → bounded async (only after measuring reward at fixed GPU-hours).
The minimum platform is the same regardless of framework: versioned weights + trajectories, role-level utilization, a freshness SLO, tail controls, a sync trace, and continuous reward audits.
The whole Part-IV arc in one line: RL post-training is an inference system and a training system wired in a loop; you design it by finding which term of τ_step binds, and the framework is whoever optimizes that term best.

Sources used

Source	Optimization lesson used
verl / HybridFlow	Hybrid controller, FSDP/Megatron + vLLM/SGLang backends, 3D-HybridEngine reshard.
OpenRLHF docs / paper	Ray + vLLM + DeepSpeed/ZeRO, async/partial rollout, agent execution.
slime / SGLang RL	SGLang-native rollout, Megatron training, bucketed updates, abort/retract.
NeMo-RL / spec-decode paper	Multimodal post-training and system-integrated speculative decoding.
LlamaRL	Fully async PyTorch, direct-memory weight sync, large-model speedups.
AReaL	Fully async generation/training and staleness-aware PPO.
Laminar	Trajectory-level asynchrony, relay workers, dynamic repacking.
AsyncFlow / Relax	Streaming TransferQueues, service decoupling, explicit staleness control.
Agent Lightning	Training-agent disaggregation for arbitrary agent runtimes.
Trajectory Balance with Asynchrony	Decoupled search and learning for sparse-reward post-training.