all lessons / reinforcement learning / 59 · Financial trading (上) lesson 59 / 87

Financial trading (上) — stock trading

A market looks like the perfect RL problem: a clean sequence of states, an obvious action set, and a reward — money — that nobody has to hand-design. That promise is exactly why finance has chewed up more RL projects than almost any other domain. This lesson is about why: trading is an MDP whose state you can never fully see, whose rules change while you play, and whose past is a liar.

What this lesson reuses
This is an applications lesson — no new algorithm. It is the MDP frame from the start of the course, stress-tested by a hostile domain:

Trading as an MDP — the seductive part

Recall the loop from orientation: observe a state, pick an action, receive a reward and a next state, repeat. A single-asset trading agent maps in directly:

MDP elementTrading
state strecent prices, returns, volume, technical indicators, your current position
action atbuy / hold / sell — or, more generally, a target position size ∈ [−1, +1] (short to long)
reward rtmark-to-market PnL over the step, minus transaction cost when the position changes
return Gtdiscounted cumulative profit — the thing you actually want to maximize

The reward is the part that makes practitioners' eyes light up. In most of this course we agonized over where the reward comes from — a verifier, a human-preference model (RLHF), an expert's behavior (inverse RL). Here the reward is just money: realized profit and loss, no labeling, no reward model, perfectly verifiable. That is genuinely a gift. The trouble is everything around the reward.

It is a POMDP, not an MDP

The Markov property (lesson 01) says the current state contains everything you need to predict the future. The "true state" of a market — every participant's intent, every pending order, the macro shock that hits in an hour — is enormous and hidden. What you observe is a thin, noisy projection: a price tape and some indicators.

true state  zt  (hidden)  →  observation  ot = f(zt) + noise  (prices, indicators)

That is the definition of a partially observable MDP (POMDP), the structure we first met when bandits stripped the state away in lesson 08. The standard patch is to fold history into the state — stack the last k bars, or carry a recurrent hidden state — so the agent acts on a belief about zt rather than a single tick. It helps, but it never recovers the hidden state; the residual uncertainty is permanent, and it is why a price that "should" go up just as easily goes down.

Partial observability also poisons exploration. In a gridworld you explore for free — a wasted step costs you nothing real. Here every exploratory trade pays a spread and a fee and risks capital. The explore/exploit dilemma of lesson 08 is still the dilemma, but the "explore" arm now has a cash price tag, which pushes the whole field toward learning from logged data instead of live trial-and-error (hold that thought for offline RL).

The three hazards that make finance brutal

Plenty of domains are POMDPs and survive. Finance is singled out because three hazards stack on top of the partial observability, and each one breaks a different assumption the rest of this course leaned on.

Hazard 1 — non-stationarity (the rules change while you play)

Every method in this course assumed a fixed transition function P(s′|s,a) and reward R(s,a). Value iteration, Q-learning, policy gradients — all of them converge to a fixed target. Markets have no fixed target. The data-generating distribution drifts: regimes flip from low- to high-volatility, a strategy that printed money for a year stops working when enough others discover it, a central-bank decision rewrites correlations overnight.

Why this is worse than it sounds
Non-stationarity is not just noise you can average out. It means the optimal policy π* is itself a moving target: yesterday's optimal policy is not even a good policy tomorrow. Worse, the market is partly adversarial — if you find an edge, other participants trade against it until it vanishes. You are not solving a stationary MDP; you are playing a game whose payoff matrix is being edited, partly in response to your own success.

Hazard 2 — transaction costs (the gross/net gap that eats you alive)

Add a fee. It sounds like a footnote — a few basis points per trade — but it silently rewrites which policy is optimal. The reward is not raw PnL; it is PnL net of cost:

rt = positiont-1 · (pricet − pricet-1)  −  c · |positiont − positiont-1|

The first term is the market move you captured; the second is the toll you pay every time you change your position, with cost rate c. A naive RL agent trained with c = 0 learns to react to every wiggle — flip long, flip short, flip back — because each tiny correct call adds gross profit at no modeled cost. On paper it looks brilliant. Turn on a realistic c and the same policy bleeds out: it trades hundreds of times, and the cumulative toll swamps the edge. A patient, low-turnover policy that trades a tenth as often survives the toll and ends ahead. This is the bug in our widget below, and it is the single most common way real trading bots die.

Hazard 3 — backtest overfitting (the past is a liar)

This is the deepest trap, and it is pure distributional shift. Because you cannot trade against history live, you backtest: replay the policy over past data and read off its return. With enough parameters, indicators, and re-runs, it is trivial to discover a policy that aced the past — and then loses money the moment it meets data it never saw.

Backtest overfitting is offline RL's distributional shift
This is exactly the failure that lesson 19 is about. A backtest is offline RL on a fixed dataset: you optimize a policy against a frozen log of the past and never get to call env.step() on the real, future market. The policy learns to exploit the specific noise of that log — the equivalent of querying Q at out-of-distribution actions that look amazing only because nothing ever checked them. Deployed forward, it is acting in a distribution the training data never covered, and the inflated "value" evaporates. Acing the backtest and failing live is the canonical distributional-shift trap, and it is why finance needs the offline-RL machinery of lessons 19–21.

Interactive · the transaction-cost cliff

Two agents trade the same synthetic price series. The churner reacts to every short-term wiggle (high turnover); the patient agent only acts on a strong, persistent trend (low turnover). Both ignore cost when they decide — the only thing that changes is the transaction-cost slider, which is charged on the realized PnL.

Start at zero cost: the churner looks like the better trader, its net PnL line climbing above the patient one. Now raise the cost. Watch the churner's net line bend down and cross into the red while the patient agent — trading far less — barely flinches. Same policy, same prices; the only thing that changed is the toll. Ignoring transaction cost is the bug, and the cost slider is the lesson.

Two policies on one price series · drag the transaction cost
Net PnL = captured price moves − (cost rate × turnover). At cost 0 the churner wins; raise the cost and its constant flipping bleeds it dry while the patient, low-turnover policy survives.
Churner · net PnL
Churner · trades
Patient · net PnL
Patient · trades
Show the core JS (≈22 lines)
// Both policies output a position in {-1,0,+1}; reward = PnL - cost*|Δposition|.
function runPolicy(prices, positions, cost) {
  let net = 0, trades = 0, prevPos = 0;
  const equity = [0];
  for (let t = 1; t < prices.length; t++) {
    const move = prevPos * (prices[t] - prices[t-1]);   // captured move
    const pos  = positions[t];
    const fee  = cost * Math.abs(pos - prevPos);         // toll on turning over
    if (pos !== prevPos) trades++;
    net += move - fee;
    equity.push(net);
    prevPos = pos;
  }
  return { net, trades, equity };
}
// Churner: chase the 1-step change. Patient: act only on a strong slow trend.
const churn   = prices.map((p,t)=> t? Math.sign(p-prices[t-1]) : 0);
const patient = prices.map((p,t)=> t>W? Math.sign(sma(prices,t,3)-sma(prices,t,W)) : 0);

Notice what the widget is not doing: it is not learning. These are two fixed policies, and the point is that the question "which policy is better?" has no answer until you specify the cost. An RL agent trained with the cost left out of its reward will happily become the churner — it will discover the highest-gross policy, which is exactly the one the real market quietly bankrupts. The cost term is not a detail to add later; it changes the optimization target.

Map back to the spine — a POMDP with adversarial non-stationarity

Pull the domain back onto the value / policy / model map that organizes the whole course:

Trading realitySpine conceptWhat it breaks
You see prices, not the marketPartial observability (POMDP) — L05the Markov property of L01: the observation is not the state
The distribution drifts; others adapt to youNon-stationary, adversarial environmentthe fixed model P, R every solver assumed
Fees on every position changeReward = PnL − cost (turnover penalty)the naive value/policy objective — gross ≠ net optimal
Can't A/B the market; you replay historyOffline RL / distributional shift — L16online exploration; backtest overfit = OOD evaluation

So in spine terms, stock trading is a POMDP with adversarial non-stationarity, a turnover-penalized reward, and a forced offline learning regime. None of those is a new algorithm — they are the assumptions of lesson 01 being violated one at a time. That is precisely why the practical answer in finance leans on conservative, low-turnover, offline-trained policies rather than aggressive online learners: when you cannot trust the model, cannot explore for free, and cannot believe your backtest, humility is the edge. The next lesson takes the same lens to portfolio optimization, where the action becomes a continuous vector of allocations and the reward becomes risk-adjusted.

Takeaway
Trading drops neatly into the lesson-01 MDP — state, action (buy/hold/sell), reward (PnL) — and then violates its core assumptions. You never see the true state, so it is a POMDP (L05); the distribution drifts and adapts to you, so it is non-stationary and adversarial, voiding the fixed-model premise; transaction costs mean the highest-gross policy is often a net loser, as the widget shows when the cost slider sends the churner into the red while the patient policy survives; and because you can only replay history, backtest overfitting is the distributional-shift trap that motivates the offline RL of lesson 19. Mapped to the spine: a POMDP with adversarial non-stationarity — and the right response is to be conservative.