all lessons / reinforcement learning / 55 · Recommendation & advertising (下) lesson 55 / 87

Recommendation & advertising (下)

Lesson 54 recommended one item to maximize one click. Real systems serve a slate, across a session, against a budget — and they cannot A/B test every idea. This is where the bandit grows up into a full sequential MDP.

Which core lessons this reuses
Lesson 54 was a contextual bandit: one state, one action, one reward — a one-step world. This lesson removes the "one-step" assumption.

From one item to a slate to a session

Lesson 54 modeled a recommendation as a contextual bandit: the user context is the state s, the chosen item is the action a, the click is the reward r. One decision, one reward, done. Two things break that picture in production:

The second point is the big one. The moment your action today changes the state distribution tomorrow, you are no longer in a bandit — you are in a Markov decision process, exactly the object from lesson 01. State transitions P(s'|s,a) are back. And the goal flips from "maximize this click" to "maximize the return":

Gt = Σk≥0 γk rt+k     (γ < 1 discounts engagement that is further away in the session/lifetime)

The greedy-CTR trap

The single most common failure in applied recsys is optimizing the wrong objective: greedy click-through rate. A policy that maximizes the immediate click probability P(click | s, a) will happily learn to serve clickbait — the lurid thumbnail, the outrage headline, the "you won't believe…" item. It wins the click in front of it.

The bug
Greedy CTR maximizes rt, the one-step reward. But clickbait raises rt while poisoning the transition P(s'|s,a): the user is disappointed, trust erodes, and the probability they keep scrolling — or return tomorrow — drops. Greedy CTR is blind to s'. It optimizes a bandit reward inside an MDP, and the MDP punishes it later.

The fix is to optimize the return, not the reward — to value the whole session. A long-horizon policy sometimes shows a slightly less clickable but more satisfying item, accepting a lower rt now because it keeps the session alive (and the user coming back), which yields a larger Σ γ^k r_{t+k}. This is precisely the value-vs-greedy distinction from the foundations: a value function looks down the tree; a greedy one-step rule does not. The widget below makes the two policies race on the same user.

Optimizing the session with policy gradient

Why policy gradient (lesson 10) and not value-based here? Because the slate action space is enormous and combinatorial — there is no tractable argmaxa Q(s,a) over all orderings of a million-item catalog. We instead parameterize a policy πθ(a|s) (e.g. a scorer that induces a distribution over slates) and push it uphill on the session return, using the score-function estimator from lesson 10:

θ J(θ) = 𝔼τ ∼ πθ [ Σtθ log πθ(at|st) · A(st, at) ]

where the advantage A uses a baseline/critic to cut variance — the same machinery as lessons 10–11. The reward is non-differentiable (a click either happens or it does not), so the log-derivative trick is exactly what lets us train through it. Everything you learned about reward-to-go, baselines, and variance carries over unchanged; only the meaning of "action" (a slate) and "reward" (session engagement) is new.

Ad bidding as a constrained MDP

Advertising adds one more twist: a budget. An advertiser gives you, say, $10,000 to spend over a day. Each impression is an auction; bidding higher wins more valuable impressions but spends faster. You must pace the spend — don't blow the budget by 9am, don't end the day with money unspent. This is a constrained MDP:

maximize 𝔼[ Σt valuet ]   subject to   𝔼[ Σt costt ] ≤ B

The standard move is a Lagrangian: fold the constraint into the reward with a multiplier λ (a "shadow price" of budget), turning each step's reward into valuet − λ · costt. Then it is an ordinary MDP again, and λ is tuned (a controller raises it when spending too fast, lowers it when too slow) so the budget binds exactly at the horizon. Budget pacing is just RL with a price on the resource — the same constrained-RL pattern that reappears in finance (lesson 60) and scheduling (lesson 61).

Off-policy evaluation: you can't A/B test everything

Here is the operational reality. You have a hundred ideas for a new policy πnew. You cannot ship a live A/B test for each — experiments are slow, expensive, and a bad policy shown to real users costs revenue and trust. But you do have a giant log of what your current policy πlog did and what happened: (s, a, r) tuples, plus the probability πlog(a|s) it assigned. Can you estimate πnew's value from that log before shipping it? Yes — this is off-policy evaluation (OPE), and it is importance sampling (lesson 12) applied to a recommender.

IPS — inverse propensity scoring

Reweight each logged reward by the ratio of how likely πnew was to take that action vs how likely πlog was — the same ratio ρ from lesson 12:

IPSnew) = (1/N) Σi   πnew(ai|si)πlog(ai|si)   · ri

It is unbiased: actions πnew favors get upweighted, actions it would avoid get downweighted. But — exactly as in lesson 12 — when πlog(a|s) is tiny (the log rarely took an action πnew loves), the ratio explodes and the variance blows up. This is why the logging policy must explore (lesson 08 / 19): a deterministic logger leaves πlog(a|s)=0 for unseen actions and the estimate is undefined.

Doubly robust — the variance fix

Doubly-robust (DR) OPE pairs IPS with a learned reward model r̂(s,a) (a "direct method" estimate). It uses the model as a baseline and importance-weights only the residual:

DR = (1/N) Σi [   r̂(si, πnew)   +   ρi · ( ri − r̂(si, ai) )   ]

"Doubly robust" = it is consistent if either the propensities πlog are correct or the reward model is correct. Notice the structure: the importance weight now multiplies a small residual instead of the full reward, so it is the same baseline-subtraction trick that tamed REINFORCE's variance in lessons 10–11 — reused here for evaluation instead of for the gradient.

OPE is your offline filter
The workflow: train candidate policies with policy gradient (07) on logged data, score each one with DR off-policy evaluation (09), and A/B test online only the handful that OPE says are promising. OPE turns "we can't test everything" into "we can pre-screen everything and test the survivors."

Interactive · greedy CTR vs the long-horizon policy

One simulated user, one multi-step session. At each step a policy picks clickbait (high immediate click probability, but it erodes the user's patience) or a quality item (lower click now, but it preserves patience). The user's patience is the hidden state s; once it hits zero the user churns and the session ends. Greedy CTR always grabs the click in front of it — that is the bug-is-the-lesson. The long-horizon policy maximizes total session value via a one-step lookahead on patience. Run a session and watch greedy win the early clicks but churn early and lose the total.

Session simulator — greedy CTR vs long-horizon
Each policy plays the same user (same random seed per run). A bar per step shows what each served; the line tracks cumulative session value. Drag clickbait erosion — how much patience a clickbait item burns — to control how badly greedy poisons its own future.
Greedy — clicks
Greedy — session value
Greedy — steps before churn
Long-horizon — clicks
Long-horizon — session value
Long-horizon — steps before churn
Show the core JS (≈25 lines)
// Hidden state = user patience in [0,1]. Session ends when patience <= 0.
// Two items. Clickbait: high click prob, burns patience. Quality: lower click, restores it.
const CLICKBAIT = { pClick: 0.85, dPatience: -EROSION * 0.18 };
const QUALITY   = { pClick: 0.45, dPatience: +0.05 };

function clickProb(item, patience){ return item.pClick * (0.4 + 0.6*patience); } // bored users click less

// GREEDY-CTR: pick the item with higher immediate click prob — ignores s'. (the bug)
function greedyAction(p){ return clickProb(CLICKBAIT,p) > clickProb(QUALITY,p) ? CLICKBAIT : QUALITY; }

// LONG-HORIZON: 1-step lookahead on return = r + γ·(future value ∝ next patience).
function longAction(p, gamma){
  const score = it => clickProb(it,p) + gamma * Math.max(0, p + it.dPatience) * 1.5;
  return score(CLICKBAIT) > score(QUALITY) ? CLICKBAIT : QUALITY;
}

function runSession(policy, gamma){
  let p = 0.7, value = 0, clicks = 0, t = 0;
  while (p > 0 && t < 24){
    const item = policy(p, gamma);
    const clicked = Math.random() < clickProb(item, p);
    if (clicked){ value += 1; clicks++; }
    p = Math.min(1, p + item.dPatience);   // action changes the NEXT state — this is the MDP
    t++;
  }
  return { value, clicks, len: t };
}

Map back to the value / policy / model spine

This lessonThe spine
Item → slate → sessionContextual bandit (lesson 54) → full MDP (lesson 01): actions change P(s'|s,a).
Greedy CTR fails; optimize session returnOne-step reward vs return G_t; the value-vs-greedy lesson — a value looks down the tree.
Train the slate/session policyPolicy gradient (lesson 10): combinatorial action space, non-differentiable reward → score-function estimator with a baseline.
Ad budget pacingConstrained MDP via a Lagrangian shadow price λ — same pattern as finance (25) and scheduling (26).
Can't A/B everything → OPE (IPS, DR)Importance sampling (lesson 12): ratio ρ = πnewlog; DR's residual reweighting is the lesson-08 baseline trick reused for evaluation.
Takeaway
Recommendation is a sequential MDP, not a bandit: your action today reshapes the user's state tomorrow, so you optimize the session return, not the next click — greedy CTR is the clickbait trap. Train the slate/session policy with policy gradient (07), pace ad budgets as a constrained MDP, and because you can't A/B test every idea, pre-screen candidate policies offline with off-policy evaluation — IPS and doubly-robust — built on the lesson-09 importance ratio.