What the agent sees — observation, the Markov line, and POMDPs

Lesson 02's step() returned an observation, and slipped in a warning: it may not be the state. This lesson is that warning, in full. When the observation carries everything the future depends on, you have the clean MDP of lesson 01 and every algorithm works. When it doesn't, two situations that demand opposite actions can look identical — and no reflex, however well-trained, can be right in both. That failure has a name, a formal model, and three standard fixes.

The one idea

The Markov property (lesson 01) is a promise about the state: the future depends on the past only through the present. The agent, though, acts on the observation. If the observation is a sufficient statistic of the state, the promise holds and you have an MDP. If it throws away something the future depends on, the promise breaks: you have a partially observable MDP (POMDP), and a memoryless policy is no longer enough.

Observation is not state

Lesson 01 quietly assumed o_t = s_t: what you see is the whole truth. Reality rarely cooperates. A camera sees pixels, not the velocities behind them. A trading bot sees the last price, not the order book that will move it. A dialogue agent sees text, not the user's intent. The environment holds a true state s and reveals an observation o through an observation function Ω(o|s). The full object is a POMDP:

(𝒮, 𝒜, 𝒪, P, R, Ω, γ) — an MDP, plus an observation space 𝒪 and the lens Ω that maps hidden state to what you see

The MDP of lesson 01 is the special case o = s — the lens is a perfect mirror. Everything else is partial observability, and it is the common case, not the exception.

The Markov line: when is an observation "enough"?

An observation is Markov (a sufficient state) if knowing it makes the past irrelevant to the future — formally, if P(s_t+1, r_t+1 | o_t, a_t) doesn't change when you also condition on the whole history. Cross that line and the Bellman equations of lesson 01 are well-defined on o; fall short of it and they are quietly solving the wrong problem.

The classic shortfall is a missing derivative. Show a pole-balancing agent a single snapshot — angle but not angular velocity — and it cannot tell a pole falling left from one swinging back right; the two need opposite pushes but look the same. One frame is not Markov; the change between frames is the missing state. (This is exactly why Atari agents stack four frames — lesson 09.)

Perceptual aliasing: the failure, precisely

Here is the sharp version of the problem. A memoryless policy is a function of the observation alone: π(a|o). So whenever two distinct states s₁ ≠ s₂ produce the same observation — they are aliased — the policy is forced to take the same action distribution in both. If the optimal actions in s₁ and s₂ differ, no memoryless policy can be optimal in both at once. The best a deterministic reflex can do is be right in one and wrong in the other.

Why this isn't fixed by more training

Aliasing is not a learning failure — it's an information failure. The signal the agent would need to choose correctly was destroyed by the observation function before learning began. Train forever on a non-Markov observation and you converge to the best reflex, which is still wrong half the time on the aliased states. The fix is never "train more"; it's "give the policy access to more of the state."

The three fixes — all of them rebuild a Markov state

Every remedy for partial observability does the same thing: reconstruct a sufficient statistic the policy can act on. They differ in how much they remember and how.

Frame stacking. Feed the last k observations as one input. The change across frames recovers velocities and direction. Cheap, fixed memory; the Atari/DQN default (lesson 09).
Recurrent policy (memory). An RNN/LSTM carries a hidden state that summarizes the whole history — π(a | o_t, h_t). Unbounded-horizon memory, learned end-to-end.
Belief state. Track a probability distribution b(s) over what the true state might be, updated by Bayes from each observation. The belief is Markov — planning on it is exact but expensive. The principled ideal the other two approximate.

Interactive · the aliased corridor

Five cells; the +1 goal sits in the middle (cell 2). The agent doesn't see its position — it sees only whether a wall is immediately to its left or right. Cells 1 and 3 both report "no wall either side": they are aliased (amber), yet cell 1 must go right to reach the goal and cell 3 must go left. Run episodes from a random start and watch a memoryless agent — forced to pick one action for that identical observation — succeed from one side and loop forever on the other. Then switch it to act on the true state (the Markov fix) and watch it solve every start.

Mirrored corridor — a memoryless reflex on a POMDP

Observation = (wall left?, wall right?). Cells 1 & 3 both look like (0,0) but need opposite actions. "Observation only" = memoryless reflex; "true state" = an agent handed the Markov state directly — an upper bound on what frame-stacking / memory / belief recover from history. Goal in the middle pays +1; step cap 20.

agent input:

mode

observation only

last episode

—

success rate (×200)

—

ceiling

—

The bug is the lesson

The memoryless agent tops out near 50% — and no amount of training moves it, because cells 1 and 3 are genuinely indistinguishable to it. Switch to the Markov agent and it hits 100% with the same policy logic. The gap is pure information: partial observability, not a bad learner. Every "my agent plateaus for no reason" story is worth checking against this.

Where this goes next

Observability decides whether the observation the environment hands back is something an algorithm can soundly learn on. The track returns to it whenever the world is genuinely hidden — financial trading as a POMDP (lesson 59), and recurrent/memory-based agents throughout. The last piece of the environment object is the number it returns alongside the observation: the reward, and the surprising ways it breaks — lesson 04.

Takeaway

The agent acts on the observation, not the state. If the observation is a sufficient statistic, you have an MDP and lesson 01's machinery just works. If it isn't, you have a POMDP: aliased states force one action where two are needed, and a memoryless policy is provably limited — no extra training helps. The fixes — frame stacking, recurrent memory, belief states — all rebuild a Markov state for the policy to stand on.