What the agent sees — observation, the Markov line, and POMDPs
Lesson 02's step() returned an observation, and slipped in a warning: it may not be the state. This lesson is that warning, in full. When the observation carries everything the future depends on, you have the clean MDP of lesson 01 and every algorithm works. When it doesn't, two situations that demand opposite actions can look identical — and no reflex, however well-trained, can be right in both. That failure has a name, a formal model, and three standard fixes.
Observation is not state
Lesson 01 quietly assumed ot = st: what you see is the whole truth. Reality rarely cooperates. A camera sees pixels, not the velocities behind them. A trading bot sees the last price, not the order book that will move it. A dialogue agent sees text, not the user's intent. The environment holds a true state s and reveals an observation o through an observation function Ω(o|s). The full object is a POMDP:
The MDP of lesson 01 is the special case o = s — the lens is a perfect mirror. Everything else is partial observability, and it is the common case, not the exception.
The Markov line: when is an observation "enough"?
An observation is Markov (a sufficient state) if knowing it makes the past irrelevant to the future — formally, if P(st+1, rt+1 | ot, at) doesn't change when you also condition on the whole history. Cross that line and the Bellman equations of lesson 01 are well-defined on o; fall short of it and they are quietly solving the wrong problem.
The classic shortfall is a missing derivative. Show a pole-balancing agent a single snapshot — angle but not angular velocity — and it cannot tell a pole falling left from one swinging back right; the two need opposite pushes but look the same. One frame is not Markov; the change between frames is the missing state. (This is exactly why Atari agents stack four frames — lesson 09.)
Perceptual aliasing: the failure, precisely
Here is the sharp version of the problem. A memoryless policy is a function of the observation alone: π(a|o). So whenever two distinct states s1 ≠ s2 produce the same observation — they are aliased — the policy is forced to take the same action distribution in both. If the optimal actions in s1 and s2 differ, no memoryless policy can be optimal in both at once. The best a deterministic reflex can do is be right in one and wrong in the other.
The three fixes — all of them rebuild a Markov state
Every remedy for partial observability does the same thing: reconstruct a sufficient statistic the policy can act on. They differ in how much they remember and how.
- Frame stacking. Feed the last k observations as one input. The change across frames recovers velocities and direction. Cheap, fixed memory; the Atari/DQN default (lesson 09).
- Recurrent policy (memory). An RNN/LSTM carries a hidden state that summarizes the whole history — π(a | ot, ht). Unbounded-horizon memory, learned end-to-end.
- Belief state. Track a probability distribution b(s) over what the true state might be, updated by Bayes from each observation. The belief is Markov — planning on it is exact but expensive. The principled ideal the other two approximate.
Interactive · the aliased corridor
Five cells; the +1 goal sits in the middle (cell 2). The agent doesn't see its position — it sees only whether a wall is immediately to its left or right. Cells 1 and 3 both report "no wall either side": they are aliased (amber), yet cell 1 must go right to reach the goal and cell 3 must go left. Run episodes from a random start and watch a memoryless agent — forced to pick one action for that identical observation — succeed from one side and loop forever on the other. Then switch it to act on the true state (the Markov fix) and watch it solve every start.
Where this goes next
Observability decides whether the observation the environment hands back is something an algorithm can soundly learn on. The track returns to it whenever the world is genuinely hidden — financial trading as a POMDP (lesson 59), and recurrent/memory-based agents throughout. The last piece of the environment object is the number it returns alongside the observation: the reward, and the surprising ways it breaks — lesson 04.