The reward and the simulation boundary — shaping, hacking, and sim-to-real
The reward is the last piece of the environment object — and the most dangerous. It is not "the score"; it is your entire specification of what you want, compressed into one scalar per step. The agent will optimize exactly what you wrote, not what you meant. This lesson is the engineering of that scalar: dense vs sparse, the one provably-safe way to add hints, the way clever shaping turns into self-sabotage, and the gap between the simulator you train in and the world you deploy to.
Sparse vs dense reward
The first choice is how often the scalar is non-zero.
- Sparse — reward only at the outcome: +1 at the goal, 0 everywhere else. It is honest (it says exactly what you want and nothing more) but brutal to learn from: until the agent stumbles onto the goal by chance, every action looks equally worthless. This is the credit-assignment and exploration problem of lesson 01 at its sharpest (and why exploration gets its own lesson 08).
- Dense / shaped — a hint every step: + for moving toward the goal, − for wasted motion. Learning is far faster because the gradient of "better" is visible everywhere. But now you've written a second objective, and the agent will optimize it literally — which is where the trouble starts.
The one safe hint: potential-based shaping
Can you add a dense hint without changing the task? Yes — and there is essentially one way. Ng, Harada & Russell (1999) proved that a shaping term built as the difference of a potential Φ(s) over states leaves the optimal policy exactly unchanged:
The reason is exact — and it is not quite "the terms telescope to zero" (with γ < 1 they don't). Adding this F shifts every action's value at a state by the same constant: Q*shaped(s,a) = Q*(s,a) − Φ(s). A constant that doesn't depend on the action can't change which action is the argmax — so the best policy is untouched, for any Φ. Set Φ(s) = −distance(s, goal) and the agent gets a "warmer/colder" hint whose optimum is still, provably, the real goal. If you must shape, shape with a potential.
Reward hacking: when the proxy wins
Now the failure. Add a dense bonus without the potential structure — say +0.5 every time the agent visits a "progress" tile — and you have created a reward the agent can farm. If looping on that tile pays more discounted return than the one-time +1 at the goal, the optimal policy is to loop forever and never finish. The agent isn't broken; it solved the problem you actually wrote.
The simulation boundary
One more thing the environment hides: most environments are simulators, and the simulator is not the world. Three consequences you carry forward:
- The reality gap. A policy trained in sim exploits the sim's quirks — friction, contact, latency the real world doesn't share — and degrades on deployment. Closing that gap (e.g. domain randomization: randomize sim parameters so reality looks like just another variant) is the heart of robotics, lesson 57.
- Determinism & seeding. A reproducible environment is seeded: same seed → same episode. Without it you cannot tell a real improvement from luck, and bugs become unrepeatable. Seed your envs.
- Throughput. The environment, not the GPU, is often the bottleneck — RL needs millions of steps. Vectorizing many env copies (lesson 65) is how the data plane keeps up.
Interactive · the reward-shaping sandbox
A robot must reach the green goal (bottom-right, +1, terminal). We solve each version of the environment for its optimal policy — sweeping lesson 01's Bellman recursion but taking the best action at each state (you'll meet this as value iteration in lesson 07) — then read off the greedy policy (arrows) and roll it out from the start (blue path). Switch the shaping and watch the policy itself change. None and potential-based produce the identical path to the goal — the proof that a potential is safe. Naive bonus drops a farmable +b on the amber tile; turn b up and watch the optimal policy abandon the goal to loop on the bonus forever — reward hacking, derived.
Where this goes next
You now hold the environment whole: an interface (01a), an observation that may hide the state (01b), and a reward that encodes — and can betray — your intent (here). That is the object every algorithm in the rest of the course interacts with. With it nailed down, lesson 05 finally trains an agent: learn the value function and read the policy off it. Reward hacking returns as a first-class concern in RLHF (lesson 16), and the sim-to-real gap in robotics (lesson 57).