RL overview — the MDP & the agent–environment loop

One scalar of feedback, a loop that runs forever, and the single recursive equation that ties a future you can't see to a number you can compute. This lesson defines the symbols the other 31 reuse.

Start from the only thing that's different

Supervised learning hands you the answer. For every input there is a label, the gradient points straight at it, and learning is "make the output match." Reinforcement learning takes that crutch away. You act, and the world replies with a single number — a reward — that says better or worse, never here is what you should have done.

That one swap — evaluative feedback instead of instructive feedback — is the entire subject. It is why RL is hard, and it forces three problems that supervised learning never has to face:

Exploration. You only see the reward of the action you took. To know whether something else is better, you must try it.
Credit assignment. Reward often arrives late — a chess game is won at move 40, not 12. Which earlier actions deserve the credit?
Acting vs. learning. Every step you spend exploring is a step you didn't spend exploiting what you already know.

The throughline of the whole course

RL = learning to act from evaluative feedback. Hold onto that. Every method in the next 31 lessons is a different machine for turning a stream of scalar rewards into good behavior — and they split into exactly two families, which we name at the end of this page.

The loop

The world and the learner take turns. The learner is the agent; everything else is the environment. At each timestep t the agent sees a state, picks an action, and the environment answers with a reward and the next state. Then it repeats.

         action a_t
   ┌───────────────────────►┌─────────────┐
┌──┴──┐                      │             │
│agent│                      │ environment │
└──┬──┘                      │             │
   ◄───────────────────────┐└──────┬──────┘
     state s_{t+1},  reward r_{t+1}  │
   (and the loop runs again from s_{t+1})

That picture is the agent–environment loop, and it is all of RL's mechanics. Pseudocode:

observe s_0
for t = 0, 1, 2, ...:
    a_t  ~ π(·|s_t)              # the policy chooses an action
    r_{t+1}, s_{t+1} ~ env       # environment returns reward and next state
    # ... learn from (s_t, a_t, r_{t+1}, s_{t+1}) ...

The five symbols (used identically across the course)

Everything below is reused by lessons 02–32. Learn the notation once here.

Core vocabulary

s ∈ 𝒮 — the state. What the agent observes; enough to decide.
a ∈ 𝒜 — the action. What the agent does.
r — the reward. The scalar the environment returns. We write the expected reward of taking a in s as R(s,a).
P(s′ | s, a) — the transition. The probability the environment lands in s′ after a in s. Together (𝒮, 𝒜, P, R, γ) is a Markov decision process (MDP).
γ ∈ [0, 1) — the discount. How much a reward one step later is worth now. This is the knob the widget below lets you drag.

The Markov property

The "Markov" in MDP is a promise about the state: the future depends on the past only through the present state. Formally,

P(s_t+1 | s_t, a_t, s_t−1, a_t−1, …, s₀) = P(s_t+1 | s_t, a_t)

If that holds, the agent never needs to remember history — the current state is a sufficient statistic. This is what makes the recursive Bellman equation below even possible. When it fails (you can't see everything that matters), you have a partially observable MDP; we'll meet that later (financial trading, lesson 24). For now: a good state is a Markov state.

Trajectory, return, and why we discount

Run the loop and you produce a trajectory τ = (s₀, a₀, r₁, s₁, a₁, r₂, …). The agent does not care about any single reward; it cares about the total it can collect from now on. That total, discounted, is the return:

G_t = r_t+1 + γ r_t+2 + γ² r_t+3 + ⋯ = ∑_k=0^∞ γ^k r_t+k+1

Why discount at all? Two reasons, one mathematical and one human. Mathematically, if rewards never stop and γ < 1, the geometric weighting keeps the sum finite: a bounded reward |r| ≤ r_max gives |G_t| ≤ r_max / (1 − γ). Push γ → 1 in a loop that never ends and that bound explodes. Intuitively, γ sets a horizon: it is roughly how many steps into the future the agent "sees." Small γ is myopic; large γ is far-sighted. The widget makes this concrete — and makes the explosion visible.

Policy

The agent's behavior is its policy: a (possibly random) rule for choosing actions from states.

π(a | s) = Probability the agent takes action a in state s

A policy can be deterministic (always the same action) or stochastic. When we later learn it as a neural net with weights θ, we write π_θ. The agent's entire job is to find a good policy — and "good" needs a number.

Value: turning "good policy" into a number

How good is a policy? Run it and measure the return it earns. Because returns are random (the policy and the environment both flip coins), we take the expectation. Two value functions do this — one keyed on a state, one on a state–action pair.

The state-value V^π(s) — expected return if you start in s and follow π forever:

V^π(s) = 𝔼_π[ G_t | s_t = s ]

The action-value Q^π(s,a) — expected return if you start in s, take a right now (even if π wouldn't have), then follow π:

Q^π(s,a) = 𝔼_π[ G_t | s_t = s, a_t = a ]

They are linked by the obvious averaging step — V is Q averaged over the actions the policy would pick: V^π(s) = ∑_a π(a|s) Q^π(s,a). Keep Q in mind: knowing Q^π lets you compare actions, which is the seed of the value-based family.

The goal: the optimal policy π*

Among all policies there is at least one that is best in every state at once — the optimal policy π*, with value functions V* and Q* that no other policy beats:

V*(s) = max_π V^π(s) for every s. Solving the MDP = finding π*.

That is the entire objective of reinforcement learning, stated once. Every algorithm in this course is a route to π*.

The Bellman expectation equation

The return is an infinite sum reaching into a future you can't see. The escape is a self-reference: the return from now is the immediate reward plus the discounted return from the next state. Taking expectations of G_t = r_t+1 + γ G_t+1 gives the Bellman expectation equation:

V^π(s) = ∑_a π(a|s) [ R(s,a) + γ ∑_s′ P(s′|s,a) V^π(s′) ]

and the same idea for actions:

Q^π(s,a) = R(s,a) + γ ∑_s′ P(s′|s,a) ∑_a′ π(a′|s′) Q^π(s′,a′)

This is the workhorse of the whole field. It turns an infinite-horizon problem into a one-step consistency condition that value must satisfy: the value of where I am must equal what I get now plus the discounted value of where I go next. Lesson 02 swaps the ∑_a π(a|s) averaging for a max_a — the Bellman optimality equation — which removes the policy from the equation entirely and lets us learn Q* directly. That single change is the next lesson.

RL vs. supervised learning, in one line each

Supervised: a teacher gives the right answer; minimize the gap to it; one example never affects which examples you see next. RL: a critic gives a score; maximize expected return; your actions choose your own future data. That feedback loop — your policy shapes the distribution it then learns from — is the source of both RL's power and its instabilities.

Where this goes next (systems)

The sibling course, RL Post-Training, casts an LLM as exactly this loop: the "state" is the prompt-so-far, an "action" is the next token, the "policy" is the language model, and the "reward" is a verifier or preference model at the end. Same MDP, tokens for actions. We hand off to it explicitly in lessons 11–13.

The fork — the spine of everything that follows

There are exactly two ways to extract a good policy from an MDP, and they organize the entire course.

(A) Value-based learns a value function (the Q^π above) and lets the policy fall out of it: in any state, take the action with the highest value. (B) Policy-based skips the value and learns the policy π_θ directly, adjusting θ to push up the return. The two keep reconverging — Actor-Critic, GAE, PPO are all "a learned policy with a value function critiquing it." Lesson 02 takes branch (A); lesson 03 takes branch (B).

Interactive · discounting in a 5×5 gridworld

A robot lives on a 5×5 grid. The green cell is a +1 goal; every other step pays 0. We run policy evaluation (sweeping the Bellman expectation equation above) under a uniform-random policy and shade each cell by its value V^π(s). Drag γ and watch value propagate out from the goal: small γ keeps it local (myopic), large γ spreads it across the whole grid (far-sighted). The number on each cell is its value; the bar shows the effective horizon 1/(1−γ).

Gridworld value under a uniform-random policy — drag γ

Each sweep applies the Bellman expectation update V(s) ← Σ_a π(a|s)[R + γ Σ V(s′)] once per cell. The goal pays +1. Watch value bleed outward as γ grows. Then flip on "looping (non-episodic)" and push γ toward 1 — the bug is the lesson.

γ: 0.900 looping (non-episodic)

Sweeps

Effective horizon 1/(1−γ)

10.0

Max |V|

0.00

Status

stable

Show the core update (≈20 lines)

// One Bellman-expectation sweep, uniform-random policy over 4 moves.
function sweep(V, gamma, loop) {
  const next = V.slice();
  for (let s = 0; s < 25; s++) {
    if (!loop && s === GOAL) { next[s] = 0; continue; } // terminal: absorbs
    let acc = 0;
    for (const a of [UP, DOWN, LEFT, RIGHT]) {
      const sp = move(s, a);          // next cell (clamped at walls)
      const r  = (sp === GOAL) ? 1 : 0;
      acc += 0.25 * (r + gamma * V[sp]);   // π(a|s)=1/4
    }
    next[s] = acc;
  }
  return next;   // with loop=true & gamma→1 this never stops growing
}

Now the misbehavior. In the episodic setting the goal is terminal — reaching it ends the episode, so its value is pinned at 0 and the sweeps settle to a finite answer for any γ < 1. Tick looping (non-episodic) and the goal stops being terminal: you collect +1, get bounced back onto the grid, and can collect it again, forever. Now push γ toward 1 and run sweeps — the values diverge, climbing without bound. There is no finite return to converge to, because ∑ γ^k·1 → ∞ as γ → 1. That is the discount earning its keep: it is the only thing keeping an infinite future finite.

The bug is the lesson

γ isn't a cosmetic knob. In a non-terminating loop, γ = 1 makes the return ∑_k γ^k r a divergent series and V blows up — the Bellman fixed point ceases to exist. Either keep γ < 1, or guarantee the episode ends (a terminal/absorbing state). Many "my values exploded" bugs in real RL are exactly this.

Takeaway

An MDP is (𝒮, 𝒜, P, R, γ): states, actions, transitions, rewards, and a discount that keeps an infinite future finite. The agent runs the loop, earns a discounted return G_t = ∑ γ^k r_t+k+1, and seeks the policy π* that maximizes expected return. Value functions V^π and Q^π score a policy and obey the recursive Bellman expectation equation. From here the road forks: learn values and act greedily (value-based, lesson 02), or learn the policy directly (policy-based, lesson 03). Everything else is a variation on that fork.