Recommendation & advertising (下)

Lesson 54 recommended one item to maximize one click. Real systems serve a slate, across a session, against a budget — and they cannot A/B test every idea. This is where the bandit grows up into a full sequential MDP.

Which core lessons this reuses

Policy gradient (lesson 10) — the slate/session objective is non-differentiable (clicks, dwell, conversions are sampled), so we optimize the policy with the score-function estimator ∇J = 𝔼[G · ∇log π].
Off-policy evaluation / importance sampling (lesson 12) — we cannot run a live experiment for every candidate policy, so we estimate a new policy's value from logged data using the importance ratio ρ = π_new/π_log.

Lesson 54 was a contextual bandit: one state, one action, one reward — a one-step world. This lesson removes the "one-step" assumption.

From one item to a slate to a session

Lesson 54 modeled a recommendation as a contextual bandit: the user context is the state s, the chosen item is the action a, the click is the reward r. One decision, one reward, done. Two things break that picture in production:

You serve a slate, not an item. A feed, a results page, a row of ads — the action is an ordered set of K items. Items interact: a great item in slot 1 can cannibalize clicks below it; two near-duplicates waste a slot. The action space is combinatorial.
You serve across a session. The user comes back. What you show now changes the state later — their mood, their trust, whether they open the app tomorrow. Reward is no longer one-shot; it is a stream you sum over the whole session.

The second point is the big one. The moment your action today changes the state distribution tomorrow, you are no longer in a bandit — you are in a Markov decision process, exactly the object from lesson 01. State transitions P(s'|s,a) are back. And the goal flips from "maximize this click" to "maximize the return":

G_t = Σ_k≥0 γ^k r_t+k (γ < 1 discounts engagement that is further away in the session/lifetime)

The greedy-CTR trap

The single most common failure in applied recsys is optimizing the wrong objective: greedy click-through rate. A policy that maximizes the immediate click probability P(click | s, a) will happily learn to serve clickbait — the lurid thumbnail, the outrage headline, the "you won't believe…" item. It wins the click in front of it.

The bug

Greedy CTR maximizes r_t, the one-step reward. But clickbait raises r_t while poisoning the transition P(s'|s,a): the user is disappointed, trust erodes, and the probability they keep scrolling — or return tomorrow — drops. Greedy CTR is blind to s'. It optimizes a bandit reward inside an MDP, and the MDP punishes it later.

The fix is to optimize the return, not the reward — to value the whole session. A long-horizon policy sometimes shows a slightly less clickable but more satisfying item, accepting a lower r_t now because it keeps the session alive (and the user coming back), which yields a larger Σ γ^k r_{t+k}. This is precisely the value-vs-greedy distinction from the foundations: a value function looks down the tree; a greedy one-step rule does not. The widget below makes the two policies race on the same user.

Optimizing the session with policy gradient

Why policy gradient (lesson 10) and not value-based here? Because the slate action space is enormous and combinatorial — there is no tractable argmax_a Q(s,a) over all orderings of a million-item catalog. We instead parameterize a policy π_θ(a|s) (e.g. a scorer that induces a distribution over slates) and push it uphill on the session return, using the score-function estimator from lesson 10:

∇_θ J(θ) = 𝔼_{τ ∼ π_θ} [ Σ_t ∇_θ log π_θ(a_t|s_t) · A(s_t, a_t) ]

where the advantage A uses a baseline/critic to cut variance — the same machinery as lessons 10–11. The reward is non-differentiable (a click either happens or it does not), so the log-derivative trick is exactly what lets us train through it. Everything you learned about reward-to-go, baselines, and variance carries over unchanged; only the meaning of "action" (a slate) and "reward" (session engagement) is new.

Ad bidding as a constrained MDP

Advertising adds one more twist: a budget. An advertiser gives you, say, $10,000 to spend over a day. Each impression is an auction; bidding higher wins more valuable impressions but spends faster. You must pace the spend — don't blow the budget by 9am, don't end the day with money unspent. This is a constrained MDP:

maximize 𝔼[ Σ_t value_t ] subject to 𝔼[ Σ_t cost_t ] ≤ B

The standard move is a Lagrangian: fold the constraint into the reward with a multiplier λ (a "shadow price" of budget), turning each step's reward into value_t − λ · cost_t. Then it is an ordinary MDP again, and λ is tuned (a controller raises it when spending too fast, lowers it when too slow) so the budget binds exactly at the horizon. Budget pacing is just RL with a price on the resource — the same constrained-RL pattern that reappears in finance (lesson 60) and scheduling (lesson 61).

Off-policy evaluation: you can't A/B test everything

Here is the operational reality. You have a hundred ideas for a new policy π_new. You cannot ship a live A/B test for each — experiments are slow, expensive, and a bad policy shown to real users costs revenue and trust. But you do have a giant log of what your current policy π_log did and what happened: (s, a, r) tuples, plus the probability π_log(a|s) it assigned. Can you estimate π_new's value from that log before shipping it? Yes — this is off-policy evaluation (OPE), and it is importance sampling (lesson 12) applied to a recommender.

IPS — inverse propensity scoring

Reweight each logged reward by the ratio of how likely π_new was to take that action vs how likely π_log was — the same ratio ρ from lesson 12:

V̂_IPS(π_new) = (1/N) Σ_i π_new(a_i|s_i)π_log(a_i|s_i) · r_i

It is unbiased: actions π_new favors get upweighted, actions it would avoid get downweighted. But — exactly as in lesson 12 — when π_log(a|s) is tiny (the log rarely took an action π_new loves), the ratio explodes and the variance blows up. This is why the logging policy must explore (lesson 08 / 19): a deterministic logger leaves π_log(a|s)=0 for unseen actions and the estimate is undefined.

Doubly robust — the variance fix

Doubly-robust (DR) OPE pairs IPS with a learned reward model r̂(s,a) (a "direct method" estimate). It uses the model as a baseline and importance-weights only the residual:

V̂_DR = (1/N) Σ_i [ r̂(s_i, π_new) + ρ_i · ( r_i − r̂(s_i, a_i) ) ]

"Doubly robust" = it is consistent if either the propensities π_log are correct or the reward model r̂ is correct. Notice the structure: the importance weight now multiplies a small residual instead of the full reward, so it is the same baseline-subtraction trick that tamed REINFORCE's variance in lessons 10–11 — reused here for evaluation instead of for the gradient.

OPE is your offline filter

The workflow: train candidate policies with policy gradient (07) on logged data, score each one with DR off-policy evaluation (09), and A/B test online only the handful that OPE says are promising. OPE turns "we can't test everything" into "we can pre-screen everything and test the survivors."

Interactive · greedy CTR vs the long-horizon policy

One simulated user, one multi-step session. At each step a policy picks clickbait (high immediate click probability, but it erodes the user's patience) or a quality item (lower click now, but it preserves patience). The user's patience is the hidden state s; once it hits zero the user churns and the session ends. Greedy CTR always grabs the click in front of it — that is the bug-is-the-lesson. The long-horizon policy maximizes total session value via a one-step lookahead on patience. Run a session and watch greedy win the early clicks but churn early and lose the total.

Session simulator — greedy CTR vs long-horizon

Each policy plays the same user (same random seed per run). A bar per step shows what each served; the line tracks cumulative session value. Drag clickbait erosion — how much patience a clickbait item burns — to control how badly greedy poisons its own future.

clickbait erosion: 1.00 γ (long-horizon foresight): 0.90

Greedy — clicks

—

Greedy — session value

—

Greedy — steps before churn

—

Long-horizon — clicks

—

Long-horizon — session value

—

Long-horizon — steps before churn

—

Show the core JS (≈25 lines)

// Hidden state = user patience in [0,1]. Session ends when patience <= 0.
// Two items. Clickbait: high click prob, burns patience. Quality: lower click, restores it.
const CLICKBAIT = { pClick: 0.85, dPatience: -EROSION * 0.18 };
const QUALITY   = { pClick: 0.45, dPatience: +0.05 };

function clickProb(item, patience){ return item.pClick * (0.4 + 0.6*patience); } // bored users click less

// GREEDY-CTR: pick the item with higher immediate click prob — ignores s'. (the bug)
function greedyAction(p){ return clickProb(CLICKBAIT,p) > clickProb(QUALITY,p) ? CLICKBAIT : QUALITY; }

// LONG-HORIZON: 1-step lookahead on return = r + γ·(future value ∝ next patience).
function longAction(p, gamma){
  const score = it => clickProb(it,p) + gamma * Math.max(0, p + it.dPatience) * 1.5;
  return score(CLICKBAIT) > score(QUALITY) ? CLICKBAIT : QUALITY;
}

function runSession(policy, gamma){
  let p = 0.7, value = 0, clicks = 0, t = 0;
  while (p > 0 && t < 24){
    const item = policy(p, gamma);
    const clicked = Math.random() < clickProb(item, p);
    if (clicked){ value += 1; clicks++; }
    p = Math.min(1, p + item.dPatience);   // action changes the NEXT state — this is the MDP
    t++;
  }
  return { value, clicks, len: t };
}

Map back to the value / policy / model spine

This lesson	The spine
Item → slate → session	Contextual bandit (lesson 54) → full MDP (lesson 01): actions change P(s'\|s,a).
Greedy CTR fails; optimize session return	One-step reward vs return G_t; the value-vs-greedy lesson — a value looks down the tree.
Train the slate/session policy	Policy gradient (lesson 10): combinatorial action space, non-differentiable reward → score-function estimator with a baseline.
Ad budget pacing	Constrained MDP via a Lagrangian shadow price λ — same pattern as finance (25) and scheduling (26).
Can't A/B everything → OPE (IPS, DR)	Importance sampling (lesson 12): ratio ρ = π_new/π_log; DR's residual reweighting is the lesson-08 baseline trick reused for evaluation.

Takeaway

Recommendation is a sequential MDP, not a bandit: your action today reshapes the user's state tomorrow, so you optimize the session return, not the next click — greedy CTR is the clickbait trap. Train the slate/session policy with policy gradient (07), pace ad budgets as a constrained MDP, and because you can't A/B test every idea, pre-screen candidate policies offline with off-policy evaluation — IPS and doubly-robust — built on the lesson-09 importance ratio.