all lessons / reinforcement learning / 04 · The reward and the simulation boundary lesson 4 / 87

The reward and the simulation boundary — shaping, hacking, and sim-to-real

The reward is the last piece of the environment object — and the most dangerous. It is not "the score"; it is your entire specification of what you want, compressed into one scalar per step. The agent will optimize exactly what you wrote, not what you meant. This lesson is the engineering of that scalar: dense vs sparse, the one provably-safe way to add hints, the way clever shaping turns into self-sabotage, and the gap between the simulator you train in and the world you deploy to.

The one idea
Lesson 01 said RL maximizes expected return. That makes the reward function a contract: the agent will find the highest-return behavior your reward allows, including behaviors you never imagined. Designing the reward is designing the task. Two failure modes dominate — a reward too sparse to learn from, and a reward so helpful the agent games it — and there is exactly one shaping trick that is provably safe.

Sparse vs dense reward

The first choice is how often the scalar is non-zero.

The one safe hint: potential-based shaping

Can you add a dense hint without changing the task? Yes — and there is essentially one way. Ng, Harada & Russell (1999) proved that a shaping term built as the difference of a potential Φ(s) over states leaves the optimal policy exactly unchanged:

F(s, a, s′) = γ·Φ(s′) − Φ(s)   ⟹   π* is preserved, for any Φ

The reason is exact — and it is not quite "the terms telescope to zero" (with γ < 1 they don't). Adding this F shifts every action's value at a state by the same constant: Q*shaped(s,a) = Q*(s,a) − Φ(s). A constant that doesn't depend on the action can't change which action is the argmax — so the best policy is untouched, for any Φ. Set Φ(s) = −distance(s, goal) and the agent gets a "warmer/colder" hint whose optimum is still, provably, the real goal. If you must shape, shape with a potential.

Reward hacking: when the proxy wins

Now the failure. Add a dense bonus without the potential structure — say +0.5 every time the agent visits a "progress" tile — and you have created a reward the agent can farm. If looping on that tile pays more discounted return than the one-time +1 at the goal, the optimal policy is to loop forever and never finish. The agent isn't broken; it solved the problem you actually wrote.

Specification gaming is the norm, not the exception
The famous case: a boat-racing agent (CoastRunners) found it could score more by spinning in a lagoon hitting respawning targets than by finishing the race — so it never raced. The same force shows up in RLHF (lesson 16): optimize a learned reward model too hard and the policy finds adversarial outputs the model over-scores — reward-model hacking. This is Goodhart's law — "when a measure becomes a target, it ceases to be a good measure" — and it is why the reward is the highest-leverage, highest-risk line in an RL project.

The simulation boundary

One more thing the environment hides: most environments are simulators, and the simulator is not the world. Three consequences you carry forward:

environment parameter (friction, mass, latency, …) reality one sim setting → misses reality: the policy overfits and fails on deploy domain-randomized sim — vary the parameters every episode → the band covers reality, so reality looks like just another sample

Interactive · the reward-shaping sandbox

A robot must reach the green goal (bottom-right, +1, terminal). We solve each version of the environment for its optimal policy — sweeping lesson 01's Bellman recursion but taking the best action at each state (you'll meet this as value iteration in lesson 07) — then read off the greedy policy (arrows) and roll it out from the start (blue path). Switch the shaping and watch the policy itself change. None and potential-based produce the identical path to the goal — the proof that a potential is safe. Naive bonus drops a farmable +b on the amber tile; turn b up and watch the optimal policy abandon the goal to loop on the bonus forever — reward hacking, derived.

Gridworld reward shaping — safe vs hackable
5×5, start top-left, goal bottom-right (+1, terminal), bonus tile in the center. We compute the optimal policy under each reward for you (the planning method is value iteration, lesson 07); arrows show it, the blue path is the rollout. γ = 0.95.
shaping
none
reaches goal?
steps
true return / shaped

The bug is the lesson
Push the naive bonus past the value of finishing and the optimal policy stops reaching the goal — it farms the tile to the step cap. No code is broken; the reward is. Potential-based shaping, fed the same distance hint, never does this — its path is byte-for-byte the unshaped one. That contrast is the whole discipline of reward design: add hints only in a form that provably can't move the optimum.

Where this goes next

You now hold the environment whole: an interface (01a), an observation that may hide the state (01b), and a reward that encodes — and can betray — your intent (here). That is the object every algorithm in the rest of the course interacts with. With it nailed down, lesson 05 finally trains an agent: learn the value function and read the policy off it. Reward hacking returns as a first-class concern in RLHF (lesson 16), and the sim-to-real gap in robotics (lesson 57).

Takeaway
The reward is your specification, optimized literally. Sparse is honest but hard to learn; dense shaping is fast but can change the task. The only provably-safe hint is potential-based: F = γΦ(s′) − Φ(s) leaves π* unchanged. Any other bonus risks reward hacking — the agent maximizing the proxy instead of your intent (Goodhart). And remember the simulator isn't the world: mind the reality gap, randomize, and seed.