Financial trading (下) — portfolio optimization
Lesson 59 traded one stock with a discrete buy/hold/sell button. Real money managers do something harder and more continuous: split a dollar across many assets at once. The action is no longer a button press — it is a vector of weights that must sum to one. That single change drops us straight back into continuous control, and the fact that the dollars are real drops us into offline RL.
- Continuous control — lesson 18: the action is a continuous vector a ∈ ℝn (here, portfolio weights), so we cannot argmax over actions; an actor must emit the allocation. DDPG / TD3 / SAC are the workhorses.
- Offline & conservative RL — lessons 19–18: you cannot trade live to "explore." Learning happens on a fixed history, where an over-optimistic value estimate (the OOD blow-up of lesson 19) is not a graph that looks bad — it is money that is gone.
The action is a simplex, not a button
In lesson 59 the action set was tiny and discrete: {buy, hold, sell} one instrument. Portfolio optimization asks the manager's real question: given n assets, what fraction of the book belongs in each? The action is a weight vector
where wi is the fraction of capital in asset i (and a cash/risk-free slot is just one more entry). This is the probability simplex Δn−1 — uncountably many points. There is no list of actions to sweep, so Q-learning's argmaxaQ(s,a) (lesson 05) is undefined here for exactly the reason it was for a robot arm in lesson 18.
The MDP extends lesson 59's: the state is the market context (recent returns, volatilities, holdings, cash); the action w is the new target allocation; the transition applies next period's returns and charges transaction costs to rebalance from the old weights; the reward is the period's gain — adjusted for risk, the crux of the lesson.
Why "maximize return" is a trap
The obvious reward is the portfolio's log-return over the period:
where ρt+1 is the vector of next-period asset returns. Plug this straight into any of our objectives — J(θ)=𝔼[Σγtrt] — and the optimizer does something technically correct and financially insane: it discovers that expected return scales with leverage, so it borrows as much as it can and concentrates everything into whatever asset had the highest mean return in the data.
The fix: a risk-adjusted reward
The fix is a better reward, not a new algorithm: reward return per unit of risk. The canonical choice is the Sharpe ratio — mean excess return over its standard deviation:
The mean-variance form on the right is the one we will use in the widget: expected return minus a risk-aversion coefficient λ times portfolio variance. Now the objective is concave in position size — variance grows with the square of leverage, so past some point adding risk costs more than it earns, and the optimizer chooses a finite, diversified allocation on its own. Diversification falls out for free: spreading across imperfectly-correlated assets lowers σ2 without lowering the mean.
| Reward | Shape in position size | What the optimizer does |
|---|---|---|
| raw return 𝔼[r] | linear — no maximum | infinite leverage, one asset, eventual ruin |
| mean − λ·variance | concave — has a peak | finite, diversified allocation; λ sets where the peak is |
Constraints: no shorting, position limits
Real mandates add hard constraints the action must respect: no shorting (wi ≥ 0), position limits (wi ≤ c, no single name above, say, 20%), fully invested (Σwi=1). A softmax actor handles the first and last for free (outputs are non-negative and sum to one); position caps are enforced by clipping-and-renormalizing the actor's output — the same "project the action onto the feasible set" instinct as the safety-clip in lesson 58. These constraints are also a free defense against leverage: a long-only, fully-invested book simply cannot reach the ruinous region.
Why offline & conservative methods are not optional here
Every continuous-control method from lesson 18 — DDPG, TD3, SAC — assumes a live environment to roll out in and fill a replay buffer. A trading book has no such thing. You cannot "explore" by placing speculative trades with the client's money to see what happens; one exploratory rollout is real, irreversible loss. So portfolio learning is structurally the offline setting of lesson 19: a fixed dataset of historical market states, allocations, and realized returns, and you must produce a good policy without ever calling env.step().
And we know exactly how that breaks. The Bellman backup queries Q(s, w) at allocations w the historical data never contained — out-of-distribution actions — where the critic's estimate is unchecked extrapolation. With no live trade to ever feel the consequence, those phantom values feed back through the bootstrap and inflate. In driving (lesson 58) that OOD blow-up was a phantom-safe maneuver; here it is a phantom-profitable allocation.
The cures are the conservatism of lessons 20–18, and they read naturally as risk management:
| Fix family | Idea | For a portfolio |
|---|---|---|
| BCQ — policy-constraint (L17) | only consider actions the data supports. | "only hold allocations resembling ones that have actually been traded in markets like this." |
| CQL — value-pessimism (L18) | push Q down on the policy's own (possibly OOD) actions. | "refuse to believe a strategy is profitable until the history proves it" — pessimism as prudence. |
Conservatism, which was a mathematical necessity for correctness in offline RL, is here a fiduciary necessity for survival. An offline finance agent that is not conservative does not merely learn slowly — it confidently bets on fantasies.
Interactive · the risk-aversion knob
The widget below is a one-period mean-variance allocator over four assets plus cash. The actor chooses weights to maximize 𝔼[r] − λ·σ2; the risk-aversion knob λ slides from greedy (λ ≈ 0: chase return) to risk-averse (large λ: smooth the ride). For each setting it computes the optimal long-only allocation, then simulates an equity curve by drawing correlated returns, and plots the resulting risk/return point against a sampled efficient frontier.
- λ = 0 is the bug-is-the-lesson setting. With no risk penalty the allocator pours everything into the single highest-mean asset. The equity curve has the steepest average slope and the worst drawdown — the volatility point flies off to the right, and a bad streak craters the curve. This is "maximize raw return" blowing up.
- Crank λ up and the allocation diversifies across assets and into cash; the equity curve smooths, the max drawdown shrinks, and the risk/return point slides back along the frontier — less expected return, far less volatility.
Notice what the knob is really doing. At λ=0 the gradient is just μi, so softmax sends all the mass to the top-mean asset — concentration and the steepest, most fragile equity curve. As λ grows, the −2λ(Σw)i term penalizes whatever you already hold a lot of, pushing mass toward uncorrelated assets and cash. The Sharpe ratio peaks at a moderate λ, not at zero: chasing raw return actively destroys risk-adjusted performance.
Mapping back to the spine
Place each piece on the value / policy / model map from orientation:
| Portfolio piece | Core method | Place on the map |
|---|---|---|
| Allocation = weights on the simplex | Continuous control, DPG/DDPG (L15) | policy: a deterministic actor μφ(s) emits w via softmax — no argmax |
| Reward = mean − λ·variance (Sharpe) | Risk-aware reward design | shapes R so the objective is concave → bounded, diversified |
| No shorting / position limits | Action projection / clip | constrain the policy to the feasible simplex (L23's clip instinct) |
| Can't trade live to explore | Offline RL (L16) | value learning on a fixed history, no env.step() |
| OOD allocation looks profitable | BCQ / CQL conservatism (L17–18) | pull value down on unsupported actions — pessimism = prudence |
Read together: portfolio optimization is continuous control with a risk-adjusted reward, learned offline and conservatively. Lesson 59 gave us trading as an MDP with discrete actions; replace the button with a simplex and you are back in lesson 18. Reward raw return and the actor levers into ruin; reward Sharpe and it diversifies. And because the dollars are real, you cannot explore — so the offline conservatism of lessons 19–21, which was about correctness elsewhere, is here about not losing the client's money.