Recommendation & advertising (下)
Lesson 54 recommended one item to maximize one click. Real systems serve a slate, across a session, against a budget — and they cannot A/B test every idea. This is where the bandit grows up into a full sequential MDP.
- Policy gradient (lesson 10) — the slate/session objective is non-differentiable (clicks, dwell, conversions are sampled), so we optimize the policy with the score-function estimator ∇J = 𝔼[G · ∇log π].
- Off-policy evaluation / importance sampling (lesson 12) — we cannot run a live experiment for every candidate policy, so we estimate a new policy's value from logged data using the importance ratio ρ = πnew/πlog.
From one item to a slate to a session
Lesson 54 modeled a recommendation as a contextual bandit: the user context is the state s, the chosen item is the action a, the click is the reward r. One decision, one reward, done. Two things break that picture in production:
- You serve a slate, not an item. A feed, a results page, a row of ads — the action is an ordered set of K items. Items interact: a great item in slot 1 can cannibalize clicks below it; two near-duplicates waste a slot. The action space is combinatorial.
- You serve across a session. The user comes back. What you show now changes the state later — their mood, their trust, whether they open the app tomorrow. Reward is no longer one-shot; it is a stream you sum over the whole session.
The second point is the big one. The moment your action today changes the state distribution tomorrow, you are no longer in a bandit — you are in a Markov decision process, exactly the object from lesson 01. State transitions P(s'|s,a) are back. And the goal flips from "maximize this click" to "maximize the return":
The greedy-CTR trap
The single most common failure in applied recsys is optimizing the wrong objective: greedy click-through rate. A policy that maximizes the immediate click probability P(click | s, a) will happily learn to serve clickbait — the lurid thumbnail, the outrage headline, the "you won't believe…" item. It wins the click in front of it.
The fix is to optimize the return, not the reward — to value the whole session. A long-horizon policy sometimes shows a slightly less clickable but more satisfying item, accepting a lower rt now because it keeps the session alive (and the user coming back), which yields a larger Σ γ^k r_{t+k}. This is precisely the value-vs-greedy distinction from the foundations: a value function looks down the tree; a greedy one-step rule does not. The widget below makes the two policies race on the same user.
Optimizing the session with policy gradient
Why policy gradient (lesson 10) and not value-based here? Because the slate action space is enormous and combinatorial — there is no tractable argmaxa Q(s,a) over all orderings of a million-item catalog. We instead parameterize a policy πθ(a|s) (e.g. a scorer that induces a distribution over slates) and push it uphill on the session return, using the score-function estimator from lesson 10:
where the advantage A uses a baseline/critic to cut variance — the same machinery as lessons 10–11. The reward is non-differentiable (a click either happens or it does not), so the log-derivative trick is exactly what lets us train through it. Everything you learned about reward-to-go, baselines, and variance carries over unchanged; only the meaning of "action" (a slate) and "reward" (session engagement) is new.
Ad bidding as a constrained MDP
Advertising adds one more twist: a budget. An advertiser gives you, say, $10,000 to spend over a day. Each impression is an auction; bidding higher wins more valuable impressions but spends faster. You must pace the spend — don't blow the budget by 9am, don't end the day with money unspent. This is a constrained MDP:
The standard move is a Lagrangian: fold the constraint into the reward with a multiplier λ (a "shadow price" of budget), turning each step's reward into valuet − λ · costt. Then it is an ordinary MDP again, and λ is tuned (a controller raises it when spending too fast, lowers it when too slow) so the budget binds exactly at the horizon. Budget pacing is just RL with a price on the resource — the same constrained-RL pattern that reappears in finance (lesson 60) and scheduling (lesson 61).
Off-policy evaluation: you can't A/B test everything
Here is the operational reality. You have a hundred ideas for a new policy πnew. You cannot ship a live A/B test for each — experiments are slow, expensive, and a bad policy shown to real users costs revenue and trust. But you do have a giant log of what your current policy πlog did and what happened: (s, a, r) tuples, plus the probability πlog(a|s) it assigned. Can you estimate πnew's value from that log before shipping it? Yes — this is off-policy evaluation (OPE), and it is importance sampling (lesson 12) applied to a recommender.
IPS — inverse propensity scoring
Reweight each logged reward by the ratio of how likely πnew was to take that action vs how likely πlog was — the same ratio ρ from lesson 12:
It is unbiased: actions πnew favors get upweighted, actions it would avoid get downweighted. But — exactly as in lesson 12 — when πlog(a|s) is tiny (the log rarely took an action πnew loves), the ratio explodes and the variance blows up. This is why the logging policy must explore (lesson 08 / 19): a deterministic logger leaves πlog(a|s)=0 for unseen actions and the estimate is undefined.
Doubly robust — the variance fix
Doubly-robust (DR) OPE pairs IPS with a learned reward model r̂(s,a) (a "direct method" estimate). It uses the model as a baseline and importance-weights only the residual:
"Doubly robust" = it is consistent if either the propensities πlog are correct or the reward model r̂ is correct. Notice the structure: the importance weight now multiplies a small residual instead of the full reward, so it is the same baseline-subtraction trick that tamed REINFORCE's variance in lessons 10–11 — reused here for evaluation instead of for the gradient.
Interactive · greedy CTR vs the long-horizon policy
One simulated user, one multi-step session. At each step a policy picks clickbait (high immediate click probability, but it erodes the user's patience) or a quality item (lower click now, but it preserves patience). The user's patience is the hidden state s; once it hits zero the user churns and the session ends. Greedy CTR always grabs the click in front of it — that is the bug-is-the-lesson. The long-horizon policy maximizes total session value via a one-step lookahead on patience. Run a session and watch greedy win the early clicks but churn early and lose the total.
Map back to the value / policy / model spine
| This lesson | The spine |
|---|---|
| Item → slate → session | Contextual bandit (lesson 54) → full MDP (lesson 01): actions change P(s'|s,a). |
| Greedy CTR fails; optimize session return | One-step reward vs return G_t; the value-vs-greedy lesson — a value looks down the tree. |
| Train the slate/session policy | Policy gradient (lesson 10): combinatorial action space, non-differentiable reward → score-function estimator with a baseline. |
| Ad budget pacing | Constrained MDP via a Lagrangian shadow price λ — same pattern as finance (25) and scheduling (26). |
| Can't A/B everything → OPE (IPS, DR) | Importance sampling (lesson 12): ratio ρ = πnew/πlog; DR's residual reweighting is the lesson-08 baseline trick reused for evaluation. |