The reward and the simulation boundary — shaping, hacking, and sim-to-real

The reward is the last piece of the environment object — and the most dangerous. It is not "the score"; it is your entire specification of what you want, compressed into one scalar per step. The agent will optimize exactly what you wrote, not what you meant. This lesson is the engineering of that scalar: dense vs sparse, the one provably-safe way to add hints, the way clever shaping turns into self-sabotage, and the gap between the simulator you train in and the world you deploy to.

The one idea

Lesson 01 said RL maximizes expected return. That makes the reward function a contract: the agent will find the highest-return behavior your reward allows, including behaviors you never imagined. Designing the reward is designing the task. Two failure modes dominate — a reward too sparse to learn from, and a reward so helpful the agent games it — and there is exactly one shaping trick that is provably safe.

Sparse vs dense reward

The first choice is how often the scalar is non-zero.

Sparse — reward only at the outcome: +1 at the goal, 0 everywhere else. It is honest (it says exactly what you want and nothing more) but brutal to learn from: until the agent stumbles onto the goal by chance, every action looks equally worthless. This is the credit-assignment and exploration problem of lesson 01 at its sharpest (and why exploration gets its own lesson 08).
Dense / shaped — a hint every step: + for moving toward the goal, − for wasted motion. Learning is far faster because the gradient of "better" is visible everywhere. But now you've written a second objective, and the agent will optimize it literally — which is where the trouble starts.

The one safe hint: potential-based shaping

Can you add a dense hint without changing the task? Yes — and there is essentially one way. Ng, Harada & Russell (1999) proved that a shaping term built as the difference of a potential Φ(s) over states leaves the optimal policy exactly unchanged:

F(s, a, s′) = γ·Φ(s′) − Φ(s) ⟹ π* is preserved, for any Φ

The reason is exact — and it is not quite "the terms telescope to zero" (with γ < 1 they don't). Adding this F shifts every action's value at a state by the same constant: Q*_shaped(s,a) = Q*(s,a) − Φ(s). A constant that doesn't depend on the action can't change which action is the argmax — so the best policy is untouched, for any Φ. Set Φ(s) = −distance(s, goal) and the agent gets a "warmer/colder" hint whose optimum is still, provably, the real goal. If you must shape, shape with a potential.

Reward hacking: when the proxy wins

Now the failure. Add a dense bonus without the potential structure — say +0.5 every time the agent visits a "progress" tile — and you have created a reward the agent can farm. If looping on that tile pays more discounted return than the one-time +1 at the goal, the optimal policy is to loop forever and never finish. The agent isn't broken; it solved the problem you actually wrote.

Specification gaming is the norm, not the exception

The famous case: a boat-racing agent (CoastRunners) found it could score more by spinning in a lagoon hitting respawning targets than by finishing the race — so it never raced. The same force shows up in RLHF (lesson 16): optimize a learned reward model too hard and the policy finds adversarial outputs the model over-scores — reward-model hacking. This is Goodhart's law — "when a measure becomes a target, it ceases to be a good measure" — and it is why the reward is the highest-leverage, highest-risk line in an RL project.

The simulation boundary

One more thing the environment hides: most environments are simulators, and the simulator is not the world. Three consequences you carry forward:

The reality gap. A policy trained in sim exploits the sim's quirks — friction, contact, latency the real world doesn't share — and degrades on deployment. Closing that gap (e.g. domain randomization: randomize sim parameters so reality looks like just another variant) is the heart of robotics, lesson 57.
Determinism & seeding. A reproducible environment is seeded: same seed → same episode. Without it you cannot tell a real improvement from luck, and bugs become unrepeatable. Seed your envs.
Throughput. The environment, not the GPU, is often the bottleneck — RL needs millions of steps. Vectorizing many env copies (lesson 65) is how the data plane keeps up.

Interactive · the reward-shaping sandbox

A robot must reach the green goal (bottom-right, +1, terminal). We solve each version of the environment for its optimal policy — sweeping lesson 01's Bellman recursion but taking the best action at each state (you'll meet this as value iteration in lesson 07) — then read off the greedy policy (arrows) and roll it out from the start (blue path). Switch the shaping and watch the policy itself change. None and potential-based produce the identical path to the goal — the proof that a potential is safe. Naive bonus drops a farmable +b on the amber tile; turn b up and watch the optimal policy abandon the goal to loop on the bonus forever — reward hacking, derived.

The bug is the lesson

Push the naive bonus past the value of finishing and the optimal policy stops reaching the goal — it farms the tile to the step cap. No code is broken; the reward is. Potential-based shaping, fed the same distance hint, never does this — its path is byte-for-byte the unshaped one. That contrast is the whole discipline of reward design: add hints only in a form that provably can't move the optimum.

Where this goes next

You now hold the environment whole: an interface (01a), an observation that may hide the state (01b), and a reward that encodes — and can betray — your intent (here). That is the object every algorithm in the rest of the course interacts with. With it nailed down, lesson 05 finally trains an agent: learn the value function and read the policy off it. Reward hacking returns as a first-class concern in RLHF (lesson 16), and the sim-to-real gap in robotics (lesson 57).

Takeaway

The reward is your specification, optimized literally. Sparse is honest but hard to learn; dense shaping is fast but can change the task. The only provably-safe hint is potential-based: F = γΦ(s′) − Φ(s) leaves π* unchanged. Any other bonus risks reward hacking — the agent maximizing the proxy instead of your intent (Goodhart). And remember the simulator isn't the world: mind the reality gap, randomize, and seed.