all lessons / reinforcement learning / 65 · Platforms & tools lesson 65 / 87

Platforms & tools — from OpenAI Gym to Ray

Every equation in this course turns into a loop of two function calls. This lesson is the engineering substrate that makes that loop runnable, reproducible, and — when you have a thousand machines — scalable.

What this lesson reuses

Nothing here is a new algorithm. This is the lesson where the whole course meets a keyboard:

So this lesson is glue. But the glue is where research becomes results — and where the famous reproducibility crisis in RL lives.

The environment API is lesson 01

Recall the loop from orientation. On the left is the abstract loop; on the right is the literal Gymnasium code that implements it. Read them side by side — they are the same five lines.

repeat:
    observe state  s
    choose action  a ~ π(·|s)
    receive r, s'
    learn from (s,a,r,s')
obs, info = env.reset(seed=0)
done = False
while not done:
    action = policy(obs)        # a ~ π(·|s)
    obs, reward, terminated, \
        truncated, info = env.step(action)
    done = terminated or truncated
    agent.learn(obs, reward)    # (s,a,r,s')

The contract is tiny and that is the point — it is the same contract for CartPole, Atari, a robot arm, or an LLM rollout harness:

CallReturnsMaps to (lesson 01)
env.reset(seed)obs, infothe initial state s0 of a fresh episode
env.step(a)obs, reward, terminated, truncated, infoone tick: next state s', reward r, and whether the episode ended
terminated vs truncated — a real bug, not a pedantic distinction
Old Gym returned a single done. Gymnasium splits it: terminated means the MDP reached a true terminal state (the pole fell, the goal was reached) — there is no future return, so you bootstrap with 0. truncated means you hit a time limit but the world would have continued — so you should bootstrap with γ·V(s'), not 0. Collapsing them into one done silently biases your value targets on every time-limited env. The API shape encodes a piece of the math from lesson 11.

Spaces — typing the state and action sets

Lesson 01's s and a need concrete shapes so a neural net can consume them. Every env declares two:

The common space types are exactly the two cases our algorithms split on:

SpaceExampleWhich algorithms (this course)
Discrete(n)4 moves in a gridworldQ-learning / DQN (L02), softmax policy (L03) — argmax over n
Box(low, high, shape)a torque vector ∈ ℝ⁶continuous control: DDPG / TD3 / SAC (L15) — argmax is impossible

This is why lesson 05 and lesson 18 exist as separate lessons: the type of the action space decides whether you can take a max over actions. The platform makes that type a first-class, checkable object — env.action_space.sample() gives you a random legal action, the simplest possible exploration policy.

Vectorized envs — the cheapest parallelism

One env steps once per call. A neural-net policy on a GPU would love to score a batch of states at once. A vectorized env bundles N copies and steps them together:

envs = gym.make_vec("CartPole-v1", num_envs=8)
obs, info = envs.reset(seed=0)          # obs.shape == (8, 4)
actions   = policy(obs)                 # one batched forward pass → (8,)
obs, rewards, term, trunc, info = envs.step(actions)   # 8 transitions at once

Now one policy forward pass produces 8 actions, and you collect 8 transitions per tick. This is pure throughput: the GPU stays busy, and the 8 streams are decorrelated — which, as lesson 09 argued, is what lets on-policy methods learn stably without a replay buffer. Vectorized envs are A3C's decorrelation idea on a single machine.

Ray & RLlib — A3C, productionized

Vectorized envs scale until one machine runs out of cores. Lesson 09's A3C answered "more actors" with asynchronous workers. Ray is a general distributed-execution framework; RLlib is its RL library, and its architecture is lesson 09 drawn at cluster scale:

Rollout worker 1 env.step × many Rollout worker 2 env.step × many Rollout worker N env.step × many Learner (holds θ) gradient step → broadcasts new θ trajectories → ← updated weights θ broadcast back

Workers run rollouts (each one is just the lesson-01 reset/step loop), ship trajectories to the learner, the learner takes a gradient step and broadcasts fresh weights back. Swap "trajectories" for "tokens" and "learner" for "trainer" and you have the LLM post-training topology — which is exactly where the systems course picks up.

Algorithm libraries — don't re-derive, fork

You have spent 29 lessons learning why PPO clips and why SAC adds entropy. In practice you reach for a reference implementation:

LibraryPhilosophyBest for
Stable-Baselines3polished, modular, batteries-included (DQN, PPO, SAC, TD3, …)applying a known-good algorithm to your env
CleanRLsingle-file, no abstraction — one algorithm = one readable scriptunderstanding and modifying an algorithm; research
RLlib (Ray)distributed-first, scales to clustersthroughput, many workers, production

The tradeoff is the usual one: SB3's abstraction saves you typing but hides the loop; CleanRL's single file shows you everything but you copy-paste to extend it. For learning, read CleanRL — its PPO file is the cleanest way to see lessons 10–14 as runnable code.

Seeds & the reproducibility crisis

RL results are notoriously hard to reproduce. The estimator is high-variance (lesson 10 was an entire lesson about that), so two runs that differ only in random seed can land in completely different places — one "solves" the task, one never learns. A 2018 study reran the same algorithm on the same benchmark with different seeds and got results that, taken individually, would support opposite conclusions.

The seeds you must pin
Reproducibility means seeding every source of randomness: the env (env.reset(seed=k)), the action sampling, the network init, and the framework RNG (NumPy / PyTorch). And you must report results over many seeds with their spread — a single learning curve from one lucky seed is not evidence. The variance from lesson 10 is not just a training nuisance; it is a scientific one.

Experiment-tracking tools (TensorBoard, Weights & Biases) exist for the same reason: when a run's outcome depends on a seed, you must log the seed, the exact config, the code version, and the full curve — not just the final number — so a result can be re-run and trusted.

Interactive · a tiny Gym-style env you can step

Below is a real reset()/step()/done contract — a 4×4 gridworld, written in vanilla JS, exposing the exact API above. The agent (blue) starts top-left; the goal (green) is bottom-right; one wall (gray) blocks a cell. Reward is −0.04 per step (a nudge to hurry) and +1.0 on reaching the goal, which terminates the episode. Hitting the 50-step cap truncates it.

Step it by hand with the arrow buttons, let a random policy (action_space.sample()) drive it, or let a greedy policy head for the goal. Watch the transition log: every line is one step() return — (obs, action, reward, terminated, truncated). The done flag is what stops the loop.

Gridworld env · reset() / step() / done
obs = (row, col) flattened to a cell index 0–15. Actions: 0↑ 1→ 2↓ 3←. Reward −0.04/step, +1 at goal. Episode ends on goal (terminated) or 50 steps (truncated).
Step
0
obs (cell)
0
Episode return
0.00
done
false
env.reset() → obs=0
Show the env (the reset/step/done contract, ≈25 lines)
const N = 4, GOAL = 15, WALL = 5, MAX = 50;
let pos, steps;

function reset() { pos = 0; steps = 0; return pos; }   // → obs

function step(a) {                                      // a ∈ {0,1,2,3}
  let r = pos % N, c = Math.floor(pos / N);   // (we index row-major below)
  let nr = r, nc = c;
  if (a === 0) nr--; if (a === 1) nc++;       // ↑  →
  if (a === 2) nr++; if (a === 3) nc--;       // ↓  ←
  const next = nr * N + nc;
  // illegal (off-grid or wall) → stay put
  if (nr>=0 && nr=0 && nc= MAX);
  const reward     = terminated ? 1.0 : -0.04;
  return [pos, reward, terminated, truncated];          // the Gym contract
}
WHERE THIS GOES NEXT (SYSTEMS)
The rollout-worker → learner picture above is the whole story of distributed RL at LLM scale, where each "env" is a generation engine producing token sequences and the learner is a sharded trainer across hundreds of GPUs. For how that topology is actually laid out on a cluster — placement, weight sync, and the rollout/training split — see the systems course: 19 · Topology & placement, and its full index, RL Post-Training — From First Principles.

Mapping back to the spine

This lesson added no theory; it made the theory run:

Takeaway
Platforms turn the course into running, scalable code: the Gym reset/step/done contract is lesson 01's MDP loop made literal, and Ray/RLlib is lesson 09's parallel actors productionized into rollout workers feeding a learner. The algorithms were the hard part; the platforms are what let those algorithms touch a thousand environments at once — and what make a reported result reproducible enough to believe.