rl_foundations / lessons / 01 · rl overview — the MDP lesson 01 / 32

RL overview — the MDP & the agent–environment loop

One scalar of feedback, a loop that runs forever, and the single recursive equation that ties a future you can't see to a number you can compute. This lesson defines the symbols the other 31 reuse.

Start from the only thing that's different

Supervised learning hands you the answer. For every input there is a label, the gradient points straight at it, and learning is "make the output match." Reinforcement learning takes that crutch away. You act, and the world replies with a single number — a reward — that says better or worse, never here is what you should have done.

That one swap — evaluative feedback instead of instructive feedback — is the entire subject. It is why RL is hard, and it forces three problems that supervised learning never has to face:

The throughline of the whole course
RL = learning to act from evaluative feedback. Hold onto that. Every method in the next 31 lessons is a different machine for turning a stream of scalar rewards into good behavior — and they split into exactly two families, which we name at the end of this page.

The loop

The world and the learner take turns. The learner is the agent; everything else is the environment. At each timestep t the agent sees a state, picks an action, and the environment answers with a reward and the next state. Then it repeats.

         action a_t
   ┌───────────────────────►┌─────────────┐
┌──┴──┐                      │             │
│agent│                      │ environment │
└──┬──┘                      │             │
   ◄───────────────────────┐└──────┬──────┘
     state s_{t+1},  reward r_{t+1}  │
   (and the loop runs again from s_{t+1})

That picture is the agent–environment loop, and it is all of RL's mechanics. Pseudocode:

observe s_0
for t = 0, 1, 2, ...:
    a_t  ~ π(·|s_t)              # the policy chooses an action
    r_{t+1}, s_{t+1} ~ env       # environment returns reward and next state
    # ... learn from (s_t, a_t, r_{t+1}, s_{t+1}) ...

The five symbols (used identically across the course)

Everything below is reused by lessons 02–32. Learn the notation once here.

Core vocabulary

The Markov property

The "Markov" in MDP is a promise about the state: the future depends on the past only through the present state. Formally,

P(st+1 | st, at, st−1, at−1, …, s0) = P(st+1 | st, at)

If that holds, the agent never needs to remember history — the current state is a sufficient statistic. This is what makes the recursive Bellman equation below even possible. When it fails (you can't see everything that matters), you have a partially observable MDP; we'll meet that later (financial trading, lesson 24). For now: a good state is a Markov state.

Trajectory, return, and why we discount

Run the loop and you produce a trajectory τ = (s0, a0, r1, s1, a1, r2, …). The agent does not care about any single reward; it cares about the total it can collect from now on. That total, discounted, is the return:

Gt = rt+1 + γ rt+2 + γ2 rt+3 + ⋯ = ∑k=0 γk rt+k+1

Why discount at all? Two reasons, one mathematical and one human. Mathematically, if rewards never stop and γ < 1, the geometric weighting keeps the sum finite: a bounded reward |r| ≤ rmax gives |Gt| ≤ rmax / (1 − γ). Push γ → 1 in a loop that never ends and that bound explodes. Intuitively, γ sets a horizon: it is roughly how many steps into the future the agent "sees." Small γ is myopic; large γ is far-sighted. The widget makes this concrete — and makes the explosion visible.

Policy

The agent's behavior is its policy: a (possibly random) rule for choosing actions from states.

π(a | s) = Probability the agent takes action a in state s

A policy can be deterministic (always the same action) or stochastic. When we later learn it as a neural net with weights θ, we write πθ. The agent's entire job is to find a good policy — and "good" needs a number.

Value: turning "good policy" into a number

How good is a policy? Run it and measure the return it earns. Because returns are random (the policy and the environment both flip coins), we take the expectation. Two value functions do this — one keyed on a state, one on a state–action pair.

The state-value Vπ(s) — expected return if you start in s and follow π forever:

Vπ(s) = 𝔼π[ Gt | st = s ]

The action-value Qπ(s,a) — expected return if you start in s, take a right now (even if π wouldn't have), then follow π:

Qπ(s,a) = 𝔼π[ Gt | st = s, at = a ]

They are linked by the obvious averaging step — V is Q averaged over the actions the policy would pick: Vπ(s) = ∑a π(a|s) Qπ(s,a). Keep Q in mind: knowing Qπ lets you compare actions, which is the seed of the value-based family.

The goal: the optimal policy π*

Among all policies there is at least one that is best in every state at once — the optimal policy π*, with value functions V* and Q* that no other policy beats:

V*(s) = maxπ Vπ(s)   for every s.    Solving the MDP = finding π*.

That is the entire objective of reinforcement learning, stated once. Every algorithm in this course is a route to π*.

The Bellman expectation equation

The return is an infinite sum reaching into a future you can't see. The escape is a self-reference: the return from now is the immediate reward plus the discounted return from the next state. Taking expectations of Gt = rt+1 + γ Gt+1 gives the Bellman expectation equation:

Vπ(s) = ∑a π(a|s) [ R(s,a) + γ ∑s′ P(s′|s,a) Vπ(s′) ]

and the same idea for actions:

Qπ(s,a) = R(s,a) + γ ∑s′ P(s′|s,a) ∑a′ π(a′|s′) Qπ(s′,a′)

This is the workhorse of the whole field. It turns an infinite-horizon problem into a one-step consistency condition that value must satisfy: the value of where I am must equal what I get now plus the discounted value of where I go next. Lesson 02 swaps the a π(a|s) averaging for a maxa — the Bellman optimality equation — which removes the policy from the equation entirely and lets us learn Q* directly. That single change is the next lesson.

RL vs. supervised learning, in one line each
Supervised: a teacher gives the right answer; minimize the gap to it; one example never affects which examples you see next. RL: a critic gives a score; maximize expected return; your actions choose your own future data. That feedback loop — your policy shapes the distribution it then learns from — is the source of both RL's power and its instabilities.
Where this goes next (systems)
The sibling course, RL Post-Training, casts an LLM as exactly this loop: the "state" is the prompt-so-far, an "action" is the next token, the "policy" is the language model, and the "reward" is a verifier or preference model at the end. Same MDP, tokens for actions. We hand off to it explicitly in lessons 11–13.

The fork — the spine of everything that follows

There are exactly two ways to extract a good policy from an MDP, and they organize the entire course.

(A) VALUE-BASED Learn Q(s,a): how good is each action? Then act greedily: π(s) = argmax_a Q. Q-learning → DQN — lesson 02 (B) POLICY-BASED Learn π_θ(a|s) directly: what to do. Nudge θ toward high-return actions. policy gradient → Actor-Critic — lesson 03 ACTOR–CRITIC — the reunion a policy that learns + a value that critiques it

(A) Value-based learns a value function (the Qπ above) and lets the policy fall out of it: in any state, take the action with the highest value. (B) Policy-based skips the value and learns the policy πθ directly, adjusting θ to push up the return. The two keep reconverging — Actor-Critic, GAE, PPO are all "a learned policy with a value function critiquing it." Lesson 02 takes branch (A); lesson 03 takes branch (B).

Interactive · discounting in a 5×5 gridworld

A robot lives on a 5×5 grid. The green cell is a +1 goal; every other step pays 0. We run policy evaluation (sweeping the Bellman expectation equation above) under a uniform-random policy and shade each cell by its value Vπ(s). Drag γ and watch value propagate out from the goal: small γ keeps it local (myopic), large γ spreads it across the whole grid (far-sighted). The number on each cell is its value; the bar shows the effective horizon 1/(1−γ).

Gridworld value under a uniform-random policy — drag γ
Each sweep applies the Bellman expectation update V(s) ← Σa π(a|s)[R + γ Σ V(s′)] once per cell. The goal pays +1. Watch value bleed outward as γ grows. Then flip on "looping (non-episodic)" and push γ toward 1 — the bug is the lesson.
Sweeps
0
Effective horizon 1/(1−γ)
10.0
Max |V|
0.00
Status
stable
Show the core update (≈20 lines)
// One Bellman-expectation sweep, uniform-random policy over 4 moves.
function sweep(V, gamma, loop) {
  const next = V.slice();
  for (let s = 0; s < 25; s++) {
    if (!loop && s === GOAL) { next[s] = 0; continue; } // terminal: absorbs
    let acc = 0;
    for (const a of [UP, DOWN, LEFT, RIGHT]) {
      const sp = move(s, a);          // next cell (clamped at walls)
      const r  = (sp === GOAL) ? 1 : 0;
      acc += 0.25 * (r + gamma * V[sp]);   // π(a|s)=1/4
    }
    next[s] = acc;
  }
  return next;   // with loop=true & gamma→1 this never stops growing
}

Now the misbehavior. In the episodic setting the goal is terminal — reaching it ends the episode, so its value is pinned at 0 and the sweeps settle to a finite answer for any γ < 1. Tick looping (non-episodic) and the goal stops being terminal: you collect +1, get bounced back onto the grid, and can collect it again, forever. Now push γ toward 1 and run sweeps — the values diverge, climbing without bound. There is no finite return to converge to, because ∑ γk·1 → ∞ as γ → 1. That is the discount earning its keep: it is the only thing keeping an infinite future finite.

The bug is the lesson
γ isn't a cosmetic knob. In a non-terminating loop, γ = 1 makes the return k γk r a divergent series and V blows up — the Bellman fixed point ceases to exist. Either keep γ < 1, or guarantee the episode ends (a terminal/absorbing state). Many "my values exploded" bugs in real RL are exactly this.
Takeaway
An MDP is (𝒮, 𝒜, P, R, γ): states, actions, transitions, rewards, and a discount that keeps an infinite future finite. The agent runs the loop, earns a discounted return Gt = ∑ γk rt+k+1, and seeks the policy π* that maximizes expected return. Value functions Vπ and Qπ score a policy and obey the recursive Bellman expectation equation. From here the road forks: learn values and act greedily (value-based, lesson 02), or learn the policy directly (policy-based, lesson 03). Everything else is a variation on that fork.