Anatomy of an environment — spaces, the step/reset contract, and episodes

Lesson 01 wrote the agent–environment loop and gave the environment two jobs: return a reward, return the next state. This lesson opens the environment up. It is not a vague "world" — it is a small, precise object with an interface, two type signatures, and an episode structure. Get that object right and every algorithm in the course has something correct to talk to; get it wrong and they all learn nonsense from a clean-looking bug.

The one idea

The agent owns the policy π. The environment owns everything else in the MDP — the transition P and the reward R — and exposes them through exactly one interface: reset() hands you a first observation, step(a) takes an action and hands back the next observation, a reward, and whether the episode is over. Learn that contract and you can build, swap, or debug any environment.

Who owns what

RL's hardest bugs come from blurring the line between the two halves of lesson 01's loop. So draw it sharply. The agent is the only thing you train; it is a function from observations to actions. The environment is everything else, and crucially it holds the parts of the MDP (𝒮, 𝒜, P, R, γ) the agent is not allowed to see directly:

                 a_t  (the only thing the agent emits)
   ┌─────────┐   ───────────────►   ┌──────────────────────────┐
   │  AGENT  │                      │       ENVIRONMENT          │
   │         │                      │  owns the true state s_t   │
   │   π(a|o)│                      │  owns P(s'|s,a)  (dynamics)│
   │         │                      │  owns R(s,a)     (reward)  │
   └─────────┘   ◄───────────────   └──────────────────────────┘
                 o_{t+1}, r_{t+1}, done
        (an OBSERVATION — maybe not the full state; lesson 03)

Notice the return arrow says observation, not state. In lesson 01 we assumed they were the same — a Markov state. They often aren't, and that gap is the whole of lesson 03. For now hold the boundary: the agent proposes an action; the environment is the sole authority on what happens next and what it's worth.

The contract: reset and step

Lesson 01's loop — "observe, act, receive reward and next state, repeat" — becomes two function calls. This is the contract every environment in every framework implements (we meet the real Gym/Gymnasium version in lesson 65; here is its skeleton):

obs = env.reset()                 # start a fresh episode → first observation
while not done:
    a = policy(obs)               # the AGENT's job
    obs, reward, terminated, truncated, info = env.step(a)
    done = terminated or truncated

Two calls, and they map one-to-one onto the MDP. reset() samples a start state s₀ and returns its observation. step(a) does four things at once: sample s' ∼ P(·|s,a), compute r = R(s,a), decide whether the episode is over, and return the observation of s'. The agent never touches P or R — it only ever sees what step chooses to reveal. That asymmetry is what makes the problem reinforcement learning and not optimization.

One subtlety the contract hides: step samples the next state from P(·|s,a). In general the same action in the same state can land you in different places — wheels slip, a card is drawn, an opponent moves. Our corridor below is deterministic for clarity, but most environments are stochastic, which is exactly why the agent maximizes expected return (lesson 01) and why two runs of one policy rarely match.

Two type signatures: the action and observation spaces

Before you can call step(a) you must know what an a is, and what shape the observation comes back in. Those two types — the action space 𝒜 and the observation space 𝒪 — are the first thing you read off any environment, because the action space decides which family of algorithms you are even allowed to use.

Space	Looks like	Example env	Which methods fit
Discrete(n)	one of n choices: {0,1,…,n−1}	gridworld moves, Atari buttons, "buy/hold/sell"	value-based (you can argmax_aQ) — lesson 05
Box (continuous)	a real vector, e.g. [−1,1]^k	joint torques, steering, portfolio weights	policy-based / actor-critic; argmax is intractable — lesson 18
MultiDiscrete / Dict	several discrete or named sub-spaces	strategy games, tool-calling agents	factored or structured policies

The action space is a fork in the road, not a detail

Value-based methods (lesson 05) need to compute max_a Q(s,a) every step — trivial when 𝒜 is Discrete(4), impossible when 𝒜 is a continuous torque vector (you'd be maximizing a neural net over a real space at every step). That single fact is why continuous control (lesson 18) had to invent a different machine. So when you meet an environment, read its action space first: it has already narrowed your algorithm choices before you've written a line.

Episodes: where the loop stops — and the bug that hides here

Lesson 01 introduced a terminal state that ends the loop and pins its own value to 0. Real environments end episodes for two different reasons, and modern APIs return them as two separate flags because conflating them quietly corrupts learning:

terminated — the MDP itself reached an end: the goal, a win, a crash, death. This is a true terminal state; there is genuinely no future, so its value is 0.
truncated — an external cutoff fired: a step limit, a wall-clock timeout, the simulator was reset for batching. The MDP did not end — the pole was still balancing, the robot was still walking — you just stopped looking.

Why split hairs? Because of how value is learned. From lesson 01, the value of a state is the immediate reward plus the discounted value of where you go next: target = r + γ·V(s'). At a true terminal there is no "next," so the bootstrap term vanishes: target = r. At a truncation, the next state is perfectly real — you just aren't visiting it — so you must keep the bootstrap: target = r + γ·V(s').

The bug is the lesson

Treat a truncation as a termination and you drop the γ·V(s') term on every time-limited episode. The agent concludes that states near the time limit are worth 0 — that the world ends at the horizon — and systematically under-values the future. It's a one-line mistake (done = terminated or truncated fed straight into the value target) that produces a subtly broken, timid policy and no error message. The widget below makes the dropped term visible.

Interactive · the environment contract, one step at a time

A 1-D corridor environment. reset() puts the agent at the left; each step moves it left or right; reaching the right end pays +1 and terminates; hitting the step limit truncates. Step through it and watch the contract tick: observation, reward, and the two end-flags. Then read the two value targets for the final transition — and flip "treat truncation as terminal" to watch the bootstrap term get thrown away exactly when the episode hit the clock, not the goal.

Corridor environment — reset / step, and the truncation bug

Goal at the right end pays +1 (terminated). The step limit truncates. γ = 0.9. The "value target" row shows the value the final state should take by lesson 01's recursion V(s)=r+γV(s′) — correct (keep the bootstrap on truncation) vs the bug (drop it). Here V(s′) is the optimal value of the truncation cell, i.e. the bootstrap that must not be thrown away. They agree on a real goal; they diverge on the clock.

continuous action space step limit T: 6 treat truncation as terminal (the bug)

outcome

—

steps

return G (γ=0.9)

0.00

final-state value target

—

Where this goes next

You now have the environment as a concrete object: a reset/step contract, an action space that pre-selects your algorithm, an observation space, and an episode boundary with two honest end-conditions. Two pieces of that object are deep enough to get their own lessons. The observation it hands back may not be the Markov state lesson 01 assumed — that is lesson 03. And the reward it returns is a design decision that can quietly break everything — that is lesson 04.

Takeaway

An environment is (𝒮,𝒜,P,R,γ) behind a two-call interface: reset()→o₀ and step(a)→(o',r,terminated,truncated). Read its action space first — discrete admits value-based argmax, continuous forces policy-based methods. End episodes for the right reason: terminated kills the bootstrap (target=r), truncated keeps it (target=r+γV(s')). Conflating them under-values the future with no error message.