Financial trading (上) — stock trading
A market looks like the perfect RL problem: a clean sequence of states, an obvious action set, and a reward — money — that nobody has to hand-design. That promise is exactly why finance has chewed up more RL projects than almost any other domain. This lesson is about why: trading is an MDP whose state you can never fully see, whose rules change while you play, and whose past is a liar.
- The MDP & the agent–environment loop — lesson 01: state s, action a, reward r, return Gt = Σ γk r. Trading drops almost cleanly into this frame — and then violates the one assumption that makes it solvable.
- Exploration & partial observability — lesson 08: you never observe the true market state, only noisy projections of it. That makes trading a POMDP, and it makes exploration expensive in a way bandits warned us about — every exploratory trade costs real money.
- Foreshadows offline RL — lesson 19: you cannot rewind the market to A/B test, so most "training" is replay on historical data — which is the distributional-shift trap of offline RL wearing a suit.
Trading as an MDP — the seductive part
Recall the loop from orientation: observe a state, pick an action, receive a reward and a next state, repeat. A single-asset trading agent maps in directly:
| MDP element | Trading |
|---|---|
| state st | recent prices, returns, volume, technical indicators, your current position |
| action at | buy / hold / sell — or, more generally, a target position size ∈ [−1, +1] (short to long) |
| reward rt | mark-to-market PnL over the step, minus transaction cost when the position changes |
| return Gt | discounted cumulative profit — the thing you actually want to maximize |
The reward is the part that makes practitioners' eyes light up. In most of this course we agonized over where the reward comes from — a verifier, a human-preference model (RLHF), an expert's behavior (inverse RL). Here the reward is just money: realized profit and loss, no labeling, no reward model, perfectly verifiable. That is genuinely a gift. The trouble is everything around the reward.
It is a POMDP, not an MDP
The Markov property (lesson 01) says the current state contains everything you need to predict the future. The "true state" of a market — every participant's intent, every pending order, the macro shock that hits in an hour — is enormous and hidden. What you observe is a thin, noisy projection: a price tape and some indicators.
That is the definition of a partially observable MDP (POMDP), the structure we first met when bandits stripped the state away in lesson 08. The standard patch is to fold history into the state — stack the last k bars, or carry a recurrent hidden state — so the agent acts on a belief about zt rather than a single tick. It helps, but it never recovers the hidden state; the residual uncertainty is permanent, and it is why a price that "should" go up just as easily goes down.
Partial observability also poisons exploration. In a gridworld you explore for free — a wasted step costs you nothing real. Here every exploratory trade pays a spread and a fee and risks capital. The explore/exploit dilemma of lesson 08 is still the dilemma, but the "explore" arm now has a cash price tag, which pushes the whole field toward learning from logged data instead of live trial-and-error (hold that thought for offline RL).
The three hazards that make finance brutal
Plenty of domains are POMDPs and survive. Finance is singled out because three hazards stack on top of the partial observability, and each one breaks a different assumption the rest of this course leaned on.
Hazard 1 — non-stationarity (the rules change while you play)
Every method in this course assumed a fixed transition function P(s′|s,a) and reward R(s,a). Value iteration, Q-learning, policy gradients — all of them converge to a fixed target. Markets have no fixed target. The data-generating distribution drifts: regimes flip from low- to high-volatility, a strategy that printed money for a year stops working when enough others discover it, a central-bank decision rewrites correlations overnight.
Hazard 2 — transaction costs (the gross/net gap that eats you alive)
Add a fee. It sounds like a footnote — a few basis points per trade — but it silently rewrites which policy is optimal. The reward is not raw PnL; it is PnL net of cost:
The first term is the market move you captured; the second is the toll you pay every time you change your position, with cost rate c. A naive RL agent trained with c = 0 learns to react to every wiggle — flip long, flip short, flip back — because each tiny correct call adds gross profit at no modeled cost. On paper it looks brilliant. Turn on a realistic c and the same policy bleeds out: it trades hundreds of times, and the cumulative toll swamps the edge. A patient, low-turnover policy that trades a tenth as often survives the toll and ends ahead. This is the bug in our widget below, and it is the single most common way real trading bots die.
Hazard 3 — backtest overfitting (the past is a liar)
This is the deepest trap, and it is pure distributional shift. Because you cannot trade against history live, you backtest: replay the policy over past data and read off its return. With enough parameters, indicators, and re-runs, it is trivial to discover a policy that aced the past — and then loses money the moment it meets data it never saw.
env.step() on the real, future market. The policy learns to exploit the specific noise of that log — the equivalent of querying Q at out-of-distribution actions that look amazing only because nothing ever checked them. Deployed forward, it is acting in a distribution the training data never covered, and the inflated "value" evaporates. Acing the backtest and failing live is the canonical distributional-shift trap, and it is why finance needs the offline-RL machinery of lessons 19–21.
Interactive · the transaction-cost cliff
Two agents trade the same synthetic price series. The churner reacts to every short-term wiggle (high turnover); the patient agent only acts on a strong, persistent trend (low turnover). Both ignore cost when they decide — the only thing that changes is the transaction-cost slider, which is charged on the realized PnL.
Start at zero cost: the churner looks like the better trader, its net PnL line climbing above the patient one. Now raise the cost. Watch the churner's net line bend down and cross into the red while the patient agent — trading far less — barely flinches. Same policy, same prices; the only thing that changed is the toll. Ignoring transaction cost is the bug, and the cost slider is the lesson.
Notice what the widget is not doing: it is not learning. These are two fixed policies, and the point is that the question "which policy is better?" has no answer until you specify the cost. An RL agent trained with the cost left out of its reward will happily become the churner — it will discover the highest-gross policy, which is exactly the one the real market quietly bankrupts. The cost term is not a detail to add later; it changes the optimization target.
Map back to the spine — a POMDP with adversarial non-stationarity
Pull the domain back onto the value / policy / model map that organizes the whole course:
| Trading reality | Spine concept | What it breaks |
|---|---|---|
| You see prices, not the market | Partial observability (POMDP) — L05 | the Markov property of L01: the observation is not the state |
| The distribution drifts; others adapt to you | Non-stationary, adversarial environment | the fixed model P, R every solver assumed |
| Fees on every position change | Reward = PnL − cost (turnover penalty) | the naive value/policy objective — gross ≠ net optimal |
| Can't A/B the market; you replay history | Offline RL / distributional shift — L16 | online exploration; backtest overfit = OOD evaluation |
So in spine terms, stock trading is a POMDP with adversarial non-stationarity, a turnover-penalized reward, and a forced offline learning regime. None of those is a new algorithm — they are the assumptions of lesson 01 being violated one at a time. That is precisely why the practical answer in finance leans on conservative, low-turnover, offline-trained policies rather than aggressive online learners: when you cannot trust the model, cannot explore for free, and cannot believe your backtest, humility is the edge. The next lesson takes the same lens to portfolio optimization, where the action becomes a continuous vector of allocations and the reward becomes risk-adjusted.