Platforms & tools — from OpenAI Gym to Ray

Every equation in this course turns into a loop of two function calls. This lesson is the engineering substrate that makes that loop runnable, reproducible, and — when you have a thousand machines — scalable.

What this lesson reuses

Nothing here is a new algorithm. This is the lesson where the whole course meets a keyboard:

Lesson 01 — the agent–environment loop. observe state → choose action → receive reward and next state → repeat. The Gym / Gymnasium API is that exact loop, frozen into two functions: reset() and step(). We will show they are the same object.
Lesson 09 — A3C's parallel actors. Many workers rolling out in parallel, decorrelating experience, feeding one learner. Ray / RLlib is that idea productionized: rollout workers on many machines feeding a central learner.
Lessons 05–21 — every algorithm. You almost never re-implement DQN, PPO, or SAC from scratch in production. Libraries (Stable-Baselines3, CleanRL) are the reference implementations everyone forks.

So this lesson is glue. But the glue is where research becomes results — and where the famous reproducibility crisis in RL lives.

The environment API is lesson 01

Recall the loop from orientation. On the left is the abstract loop; on the right is the literal Gymnasium code that implements it. Read them side by side — they are the same five lines.

repeat:
    observe state  s
    choose action  a ~ π(·|s)
    receive r, s'
    learn from (s,a,r,s')

obs, info = env.reset(seed=0)
done = False
while not done:
    action = policy(obs)        # a ~ π(·|s)
    obs, reward, terminated, \
        truncated, info = env.step(action)
    done = terminated or truncated
    agent.learn(obs, reward)    # (s,a,r,s')

The contract is tiny and that is the point — it is the same contract for CartPole, Atari, a robot arm, or an LLM rollout harness:

Call	Returns	Maps to (lesson 01)
`env.reset(seed)`	`obs, info`	the initial state s₀ of a fresh episode
`env.step(a)`	`obs, reward, terminated, truncated, info`	one tick: next state s', reward r, and whether the episode ended

terminated vs truncated — a real bug, not a pedantic distinction

Old Gym returned a single done. Gymnasium splits it: terminated means the MDP reached a true terminal state (the pole fell, the goal was reached) — there is no future return, so you bootstrap with 0. truncated means you hit a time limit but the world would have continued — so you should bootstrap with γ·V(s'), not 0. Collapsing them into one done silently biases your value targets on every time-limited env. The API shape encodes a piece of the math from lesson 11.

Spaces — typing the state and action sets

Lesson 01's s and a need concrete shapes so a neural net can consume them. Every env declares two:

env.observation_space — what an obs looks like.
env.action_space — what a valid action looks like.

The common space types are exactly the two cases our algorithms split on:

Space	Example	Which algorithms (this course)
`Discrete(n)`	4 moves in a gridworld	Q-learning / DQN (L02), softmax policy (L03) — argmax over n
`Box(low, high, shape)`	a torque vector ∈ ℝ⁶	continuous control: DDPG / TD3 / SAC (L15) — argmax is impossible

This is why lesson 05 and lesson 18 exist as separate lessons: the type of the action space decides whether you can take a max over actions. The platform makes that type a first-class, checkable object — env.action_space.sample() gives you a random legal action, the simplest possible exploration policy.

Vectorized envs — the cheapest parallelism

One env steps once per call. A neural-net policy on a GPU would love to score a batch of states at once. A vectorized env bundles N copies and steps them together:

envs = gym.make_vec("CartPole-v1", num_envs=8)
obs, info = envs.reset(seed=0)          # obs.shape == (8, 4)
actions   = policy(obs)                 # one batched forward pass → (8,)
obs, rewards, term, trunc, info = envs.step(actions)   # 8 transitions at once

Now one policy forward pass produces 8 actions, and you collect 8 transitions per tick. This is pure throughput: the GPU stays busy, and the 8 streams are decorrelated — which, as lesson 09 argued, is what lets on-policy methods learn stably without a replay buffer. Vectorized envs are A3C's decorrelation idea on a single machine.

Ray & RLlib — A3C, productionized

Vectorized envs scale until one machine runs out of cores. Lesson 09's A3C answered "more actors" with asynchronous workers. Ray is a general distributed-execution framework; RLlib is its RL library, and its architecture is lesson 09 drawn at cluster scale:

Workers run rollouts (each one is just the lesson-01 reset/step loop), ship trajectories to the learner, the learner takes a gradient step and broadcasts fresh weights back. Swap "trajectories" for "tokens" and "learner" for "trainer" and you have the LLM post-training topology — which is exactly where the systems course picks up.

Algorithm libraries — don't re-derive, fork

You have spent 29 lessons learning why PPO clips and why SAC adds entropy. In practice you reach for a reference implementation:

Library	Philosophy	Best for
Stable-Baselines3	polished, modular, batteries-included (DQN, PPO, SAC, TD3, …)	applying a known-good algorithm to your env
CleanRL	single-file, no abstraction — one algorithm = one readable script	understanding and modifying an algorithm; research
RLlib (Ray)	distributed-first, scales to clusters	throughput, many workers, production

The tradeoff is the usual one: SB3's abstraction saves you typing but hides the loop; CleanRL's single file shows you everything but you copy-paste to extend it. For learning, read CleanRL — its PPO file is the cleanest way to see lessons 10–14 as runnable code.

Seeds & the reproducibility crisis

RL results are notoriously hard to reproduce. The estimator is high-variance (lesson 10 was an entire lesson about that), so two runs that differ only in random seed can land in completely different places — one "solves" the task, one never learns. A 2018 study reran the same algorithm on the same benchmark with different seeds and got results that, taken individually, would support opposite conclusions.

The seeds you must pin

Reproducibility means seeding every source of randomness: the env (env.reset(seed=k)), the action sampling, the network init, and the framework RNG (NumPy / PyTorch). And you must report results over many seeds with their spread — a single learning curve from one lucky seed is not evidence. The variance from lesson 10 is not just a training nuisance; it is a scientific one.

Experiment-tracking tools (TensorBoard, Weights & Biases) exist for the same reason: when a run's outcome depends on a seed, you must log the seed, the exact config, the code version, and the full curve — not just the final number — so a result can be re-run and trusted.

Interactive · a tiny Gym-style env you can step

Below is a real reset()/step()/done contract — a 4×4 gridworld, written in vanilla JS, exposing the exact API above. The agent (blue) starts top-left; the goal (green) is bottom-right; one wall (gray) blocks a cell. Reward is −0.04 per step (a nudge to hurry) and +1.0 on reaching the goal, which terminates the episode. Hitting the 50-step cap truncates it.

Step it by hand with the arrow buttons, let a random policy (action_space.sample()) drive it, or let a greedy policy head for the goal. Watch the transition log: every line is one step() return — (obs, action, reward, terminated, truncated). The done flag is what stops the loop.

Gridworld env · reset() / step() / done

obs = (row, col) flattened to a cell index 0–15. Actions: 0↑ 1→ 2↓ 3←. Reward −0.04/step, +1 at goal. Episode ends on goal (terminated) or 50 steps (truncated).

Step

obs (cell)

Episode return

0.00

done

false

env.reset() → obs=0

Show the env (the reset/step/done contract, ≈25 lines)

const N = 4, GOAL = 15, WALL = 5, MAX = 50;
let pos, steps;

function reset() { pos = 0; steps = 0; return pos; }   // → obs

function step(a) {                                      // a ∈ {0,1,2,3}
  let r = pos % N, c = Math.floor(pos / N);   // (we index row-major below)
  let nr = r, nc = c;
  if (a === 0) nr--; if (a === 1) nc++;       // ↑  →
  if (a === 2) nr++; if (a === 3) nc--;       // ↓  ←
  const next = nr * N + nc;
  // illegal (off-grid or wall) → stay put
  if (nr>=0 && nr=0 && nc= MAX);
  const reward     = terminated ? 1.0 : -0.04;
  return [pos, reward, terminated, truncated];          // the Gym contract
}

WHERE THIS GOES NEXT (SYSTEMS)

The rollout-worker → learner picture above is the whole story of distributed RL at LLM scale, where each "env" is a generation engine producing token sequences and the learner is a sharded trainer across hundreds of GPUs. For how that topology is actually laid out on a cluster — placement, weight sync, and the rollout/training split — see the systems course: 19 · Topology & placement, and its full index, RL Post-Training — From First Principles.

Mapping back to the spine

This lesson added no theory; it made the theory run:

The Gym/Gymnasium API (reset/step/done) is lesson 01's agent–environment loop, frozen into code. The terminated/truncated split even encodes the bootstrapping rule from lesson 11.
Spaces (Discrete vs Box) are the typed version of the value-vs-policy fork: discrete actions admit an argmax (DQN, L02), continuous ones force a parameterized policy (SAC, L15).
Vectorized envs and Ray/RLlib are lesson 09's parallel actors — first on one machine, then across a cluster — feeding one learner.
Seeds and tracking are the operational answer to lesson 10's villain: the estimator's variance is a reproducibility problem, not just a training one.

Takeaway

Platforms turn the course into running, scalable code: the Gym reset/step/done contract is lesson 01's MDP loop made literal, and Ray/RLlib is lesson 09's parallel actors productionized into rollout workers feeding a learner. The algorithms were the hard part; the platforms are what let those algorithms touch a thousand environments at once — and what make a reported result reproducible enough to believe.