Reinforcement Learning, From First Principles

One linearized track from the Markov decision process to the systems that train today's reasoning models and the domains that deploy them. Every lesson assumes only the ones before it; every concept is justified from scratch; every trade-off comes with quantitative reasoning.

The series runs in four parts: the foundations (the theory and algorithms), the post-training systems (the engineering that turns those algorithms into reasoning-model training at cluster scale), and two parts on applications — first the domains as concepts mapped back to the core method, then twenty applied domains as production engineering. Read it straight through, or start at the orientation for the map.

Who this is for

You know basic probability and can read pseudocode. By the end you can derive the policy gradient and the TRPO/PPO trust region; explain why GRPO drops the critic and what bias that introduces; lay out the six roles of an RL post-training system and where its wall-clock goes; and formulate a real-world problem — a recommender, a robot, a trading desk, a power grid — as an MDP and name the one difficulty that binds it.

Where this connects

The LLM-era algorithms here are the same ones the gpt_mini RLVR lesson trains at toy scale and the on-policy distillation lesson borrows. The serving mechanics behind Part II (KV cache, paged attention, scheduling) are derived in the vLLM and SGLang tracks, and the cluster design in the ML systems design track.

Part I · Foundations (lessons 01–21)

The theory, from the Markov decision process to offline RL. The environment contract and reward design; the value branch (Q-learning → DQN) and the policy branch (policy gradient → Actor–Critic); planning and exploration; the policy-gradient lineage up to TRPO/PPO; the LLM-era algorithms (PPO, DPO, GRPO, RLHF) as a map; and the frontier — imitation/inverse RL, continuous control, and offline RL (BCQ, CQL).

RL overview — the MDP & the agent–environment loop

One scalar of feedback, a loop that runs forever, and the single recursive equation that ties a future you can't see to a number you can compute. This lesson defines the symbols the other 31 reuse.

Anatomy of an environment — spaces, the step/reset contract, and episodes

Lesson 01 wrote the agent–environment loop and gave the environment two jobs: return a reward, return the next state. This lesson opens the environment up. It is not a vague "world" — it is a small, precise object with an interface, two type signatures, and an episode structure. Get that object right and every algorithm in the course has something correct to talk to; get it wrong and they all learn nonsense from a clean-looking bug.

What the agent sees — observation, the Markov line, and POMDPs

Lesson 01a's step() returned an observation, and slipped in a warning: it may not be the state. This lesson is that warning, in full. When the observation carries everything the future depends on, you have the clean MDP of lesson 01 and every algorithm works. When it doesn't, two situations that demand opposite actions can look identical — and no reflex, however well-trained, can be right in both. That failure has a name, a formal model, and three standard fixes.

The reward and the simulation boundary — shaping, hacking, and sim-to-real

The reward is the last piece of the environment object — and the most dangerous. It is not "the score"; it is your entire specification of what you want, compressed into one scalar per step. The agent will optimize exactly what you wrote, not what you meant. This lesson is the engineering of that scalar: dense vs sparse, the one provably-safe way to add hints, the way clever shaping turns into self-sabotage, and the gap between the simulator you train in and the world you deploy to.

Value-based — from Q-learning to DQN

Lesson 01 left us with the Bellman expectation equation — which needs a policy plugged in. One small change, swapping the average for a max, removes the policy entirely and lets us learn the optimal value Q* directly. That change is the whole of value-based RL.

Policy-based — from policy gradient to Actor–Critic

Lesson 02 learned a value and read the policy off it with an argmax. That argmax is a dead end for continuous and huge action spaces, and it can only ever produce a deterministic policy. So stop going through a value: parameterize the policy πθ and push on it directly.

Model & planning — from dynamic programming to MCTS

Lessons 02 and 03 learn by sampling — bootstrapping from transitions, averaging over rollouts — because the model is unknown. But if you know the model P(s'|s,a) and R(s,a) (a board game's rules, a simulator, or a model you've learned), you don't have to wait for experience. You can plan.

Exploration vs exploitation — from multi-armed bandits to Thompson sampling

Lessons 02–04 all assumed the experience just appeared: a stream of (s, a, r, s') to learn from. But the agent chooses its own actions, so it chooses its own data. Pick the action you currently think is best and you may never discover a better one. Strip the MDP down to its smallest non-trivial form — one state, no transitions — and that tension becomes the entire problem.

Deep RL — from DQN to A3C

Both prongs of the fork get a neural network. Then they get an army of them: parallel actors that make on-policy learning stable without a replay buffer.

Policy gradient, rigorously

Lesson 03 stated the policy-gradient theorem ∇J = 𝔼[G · ∇log π] and lesson 06 leaned on it to run A3C across many actors. We took it on faith. Now we prove it from the objective up, dissect every piece, and pin down the one thing that controls whether any of it works in practice: the variance of the estimator.

Advantage functions — Actor-Critic, GAE, and the road to TRPO

Lesson 07's single best variance tool was the baseline, which turned the policy gradient into an advantage-weighted update: A = Q − V. But there is no free lunch — estimating A itself trades bias against variance, and one knob, λ, dials between the two extremes.

Importance sampling — on-policy vs off-policy

A3C and policy gradient throw their data away every step. Here is the one identity that lets you reuse samples drawn from an old policy — and the variance bomb hidden inside it that will shape the next two lessons.

TRPO — natural gradient, the KL trust region → PPO

Lesson 09 said "keep πθ close to πold" so the importance ratio doesn't explode. TRPO stops asking nicely: it makes closeness a hard constraint and proves that, inside that region, improving the surrogate provably improves the real policy. This is the theoretical peak of the policy-gradient lineage — and the thing PPO will cheapen into a clip.

The LLM era (上) — PPO, DPO, GRPO: the map

Lesson 10 ended at PPO's clipped surrogate. Here is the surprise that organizes the next three lessons: training a large language model with RL is PPO, unchanged. The "actions" are tokens, the "policy" is the LLM, and a full generated response is a trajectory. Once you see the mapping, PPO, DPO, and GRPO are just three answers to three different complaints about the same loop.

The LLM era (下) — DPO & GRPO, derived

Lesson 11 drew the map: PPO is the cheap TRPO, DPO skips the RL loop, GRPO drops the critic. A map is not a derivation. Here we earn both of the new methods from one equation each — the KL-regularized objective for DPO, the policy-gradient baseline theorem for GRPO — so neither feels like a trick you have to memorize.

RLHF — the post-training workflow

Lessons 11–12 gave you the algorithms — PPO, DPO, GRPO. RLHF is the end-to-end recipe that wires them together, and its job is to answer one question: where does the reward come from for an open-ended task that no verifier can score?

Imitation learning & inverse RL

Every method so far — value-based, policy-based, RLHF — assumed a reward signal exists: a scalar the environment hands back, or a model that produces one. But what if all you have is a stack of expert demonstrations and no reward function at all? This lesson covers the two answers: copy the actions (imitation), or recover the reward behind them (inverse RL).

Frontier — discrete to continuous control

A robot's action is a torque vector in ℝd. argmaxa over infinitely many actions is impossible — so we make the actor output the action and let the critic's gradient teach it. DDPG → TD3 → SAC.

Frontier — offline (batch) RL

Every method so far needed a live environment to try things in. But you cannot let a learning agent experiment on a patient, a self-driving car in traffic, or a trading book. Offline RL learns a policy from a fixed, pre-collected dataset — no new interaction at all — and that single restriction breaks the Bellman backup in a subtle, compounding way.

Offline RL — BCQ (Batch-Constrained Q-learning)

Lesson 16 named the disease: the Bellman backup queries Q(s', a') at actions the dataset never showed, those out-of-distribution estimates blow up, and there is no fresh interaction to correct them. BCQ's cure is almost embarrassingly direct — never consider an action the data doesn't support.

Offline RL — Conservative Q-Learning

BCQ (lesson 17) fixed the disease from the actor side — it forbade the policy from naming out-of-distribution actions. CQL attacks the same disease from the critic side: push Q down on OOD actions so the policy is never tempted by them in the first place.

Part II · Post-Training Systems (lessons 22–53)

The engineering behind modern reasoning-model training, building bottom-up. The system roles (rollout, reward/reference, trainer, weight sync, controller, agentic); the algorithm family REINFORCE → PPO → GRPO → RLOO → DAPO → Dr.GRPO; RLHF/DPO recipes, PRM/search, environments and data pipelines; cluster topology, the KV cache, scheduling and throughput math; famous recipes, the infra-engineer role, and bottleneck diagnosis.

What is post-training RL?

The minimum viable loop, and the one equation that makes it work.

Rollout — sampling and πold

Autoregressive sampling, why we sample K per prompt, and why old_logp has to be captured at sampling time — not recomputed later.

Reward & Reference

Where the reward signal comes from, and why we need a frozen anchor or the policy will reward-hack itself off a cliff.

Algorithm — the plugin interface

The only piece of the system that's actually different across PPO, GRPO, RLOO, DAPO, and Dr.GRPO. Everything else is identical. This lesson defines the interface; Part II walks each algorithm in depth.

Trainer — where gradients flow

The smallest possible role, with the most responsibility per line of code.

Weight sync — closing the loop

The single most important wire in a production RL framework — and the most common source of "training looks fine but isn't learning" bugs.

Controller — the orchestrator

The only file that sees every role. The whole training loop fits in seven steps.

Agentic RL — multi-turn + tool masking

When the trajectory contains tokens the model didn't generate, the algorithm doesn't need to change — but the mask does.

REINFORCE — the starting point Williams 1992

Everything from here on is a variance-reduction patch on one line of code. Before you can read the patches, you have to read the line.

PPO — the classical RLHF workhorse Schulman 2017

Three patches on REINFORCE: a learned baseline, off-policy correction, and a trust region. Powered InstructGPT, the original ChatGPT, early Claude — and is still the reference algorithm every reasoning-RL paper compares to.

GRPO — drop the critic DeepSeek-Math 2024 / R1 2025

For verifiable-reward tasks, the value head can be replaced by a one-line statistic computed from the group itself. That's the entire algorithmic novelty.

RLOO — leave-one-out, strictly unbiased Ahmadian et al. 2024 (Cohere)

GRPO's group mean uses ri in its own baseline. RLOO uses the mean of the other K−1 rollouts. Same cost, strictly unbiased — and it lets you drop the clip and the std divisor too.

DAPO — four practical fixes on GRPO ByteDance-Seed 2025

Open-source SOTA for verifiable-reward reasoning. Take GRPO. Add four targeted patches addressing four observed failure modes at scale. ~5 lines of code each.

Dr.GRPO — "GRPO Done Right" Liu et al. 2025

Two divisors in GRPO are doing nothing for variance reduction and silently biasing the gradient. Drop both. The result is the honest REINFORCE-with-group-baseline-and-PPO-clip estimator that the math actually asks for.

RLHF — the original recipe Christiano '17 · InstructGPT '22

Before verifiable rewards there were human rewards. Preferences become a scalar reward model; the reward model trains a policy with PPO. Three stages, one ladder. Every modern preference algorithm is a shortcut around at least one rung.

DPO — RL without RL Rafailov '23

RLHF compresses preferences into a reward model, then optimizes the policy against it. DPO observes that the optimal policy of that RL problem has a closed form in the reward — so you can skip the reward model and directly fit the policy from preferences with one supervised-style loss.

Outcome rewards, process rewards, and search

A verifier tells you the answer is wrong. A process reward model tells you where it went wrong. The first is enough to train; the second is what makes test-time search and trajectory-level credit assignment possible — and is the conceptual core of OpenAI's o-series and DeepSeek-R1 distillations.

Environments & verifiers — where reward really comes from

A reward function is the most consequential design decision in any RL system. Pick a leaky verifier and the policy will find the leak before it solves the task. This lesson is a tour of the verifier landscape: math, code, web, tool-use, multi-turn — and the failure modes each invites.

Data pipelines & curation — engineering the gradient signal

Lesson 18 told you where the reward number comes from. This lesson asks the question one step upstream: where do the prompts come from, and which ones are worth spending rollouts on? The answer reframes data curation from "good examples to imitate" into something more precise — gradient-signal engineering.

System topology — how a real RL cluster is wired

Rollout and training are two workloads with opposite hardware profiles. Rollout is memory-bandwidth-bound and cares about KV cache; training is compute-bound and cares about overlap. Where you put them on the cluster — colocated, disaggregated, or fully async — is the single biggest systems decision in RL post-training.

The KV cache and PagedAttention — why decode is its own problem

Inside the rollout engine sits a single data structure that decides almost everything about throughput: the KV cache. This lesson opens it up. Why generation is memory-bandwidth-bound, what PagedAttention is and why it changed the field, and how the cache's geometry decides what batch sizes you can run.

Scheduling tricks — continuous batching, prefix caching, chunked prefill, spec decode

Lesson 20 gave us a paged KV cache. This lesson is everything else the rollout engine does with that storage: how it admits and evicts sequences, how it shares prompts across K-rollout RL, how it interleaves prefill with decode, and how it asks a small model to do part of the work. Four mechanisms; each one independently a 1.5–3× throughput win.

Memory & throughput math — what fits where

Before you turn on training, you should be able to write down on the back of an envelope what the model's memory cost is, what sharding will bring it inside an H100, and what your steady-state tokens/second ceiling looks like. This lesson is that envelope.

The throughput equation — what you're actually optimizing

Lessons 19–22 give you topology, KV cache, scheduling, and memory math as a toolkit. This lesson is the assembly: write down the single equation those tools all push on, identify which term dominates under your setup, and the rest of system-level RL optimization writes itself as "lower the largest term until something else becomes the largest."

Long-tail rollouts — the max-of-K problem, packing, and dynamic K

The throughput equation (lesson 22a) said rollout dominates at 60–80% of wall-clock. This lesson explains why most of that isn't raw decode — it's the tail. A small fraction of trajectories generate most of the wall-clock, and three patches — sequence packing, length capping, dynamic K — collectively cut τ_R by 2–4× without changing a single FLOP.

Async pipelining & weight sync — engineering for overlap

The throughput equation (22a) said disaggregated topology leaves the trainer idle while rollout runs. Lesson 22b cut rollout dominance from ~75% to ~25%. The next wall is wired into the topology itself: every step still serializes on a weight broadcast. This lesson turns that serial dependency into a pipeline — with an off-policy bill the rest of the lesson explains how to pay.

Famous recipes & failure modes — putting it all together

Six post-training recipes that defined the modern reasoning era — DeepSeek-R1-Zero, R1, Tülu 3, Qwen-2.5, InstructGPT, and the o-series sketch — described in the language of the previous twenty-two lessons. Then a taxonomy of the ways an RL run fails and the dashboards that catch each one.

RL infrastructure engineer — what the role actually is

A modern post-training RL loop is the most expensive, most fragile, most under-tooled software you'll ever own. The engineer who builds it sits between research, systems, and ops, and is the only person who knows why a step is slow today. This lesson lays out the role from first principles.

Kernels for RL — what an RL infra engineer actually writes

"New kernels for rollout and training" is shorthand for six distinct kernel surfaces, each with its own bottleneck physics. This lesson maps every surface, derives why it matters, and names the kernels that pay back for the engineer who writes them.

Why RL is shaping this way today — the synthesis

You met the three forces briefly in lesson 00's map. Now that you've walked the 24 lessons in between, this is the unhurried version: each force in detail, the historical timeline that produced today's field, and the live simulator that lets you recover named recipes from extreme slider settings.

Missing concepts — what the curriculum doesn't yet cover

Lessons 01–25 explain the spine of modern post-training RL. They name several concepts in passing and leave them for later. This lesson is "later" — a catalog of the concepts that get mentioned, what each one means, why it matters, and how the curriculum could expand to teach it.

Throughput bottlenecks, the usual optimizations, and how to identify them

An RL training step is a pipeline of five distinct workloads. Each has different hardware physics, different memory pressure, and different failure modes. This lesson is the systems-level answer: where the wall-clock goes, which knob moves it, and how you'd diagnose any of it on a real cluster.

Part III · Applications — concepts (lessons 54–66)

Each application mapped back to the core method it reuses: recommendation, robotics, finance, resource scheduling, NLP, and vision — then the platforms and tools, and the frontier outlook.

Recommendation systems (上) — personalized recommendation

The applications part begins. Each lesson here takes one domain and shows it is a method you already know wearing a costume. Today: a recommender is the lesson 05 bandit with a context vector bolted on — and the cold-start problem is nothing but the explore/exploit dilemma you already met.

Recommendation & advertising (下)

Lesson 19 recommended one item to maximize one click. Real systems serve a slate, across a session, against a budget — and they cannot A/B test every idea. This is where the bandit grows up into a full sequential MDP.

Robot control (上) — manipulators

A robot arm reaching for a cup is lesson 15 embodied. The state is joint angles and velocities, the action is a torque vector in ℝd, and there is no argmaxa to take. The new lesson is not the algorithm — it is the reward: how you write it decides whether the arm learns to reach, refuses to learn, or learns the wrong thing entirely.

Robot control (中) — sim-to-real & sample efficiency

Lesson 21 trained a continuous-control policy in a simulator and ended on a cliffhanger: it works in sim, then fails on the real arm. This lesson is about closing that gap — making a simulator-trained policy survive contact with reality, and squeezing the most out of the few real samples you can afford.

Robot control (下) — autonomous driving

A self-driving car is a robot whose mistakes are measured in human lives. That single fact rules out the one thing every online RL method takes for granted — the freedom to try a bad action and see what happens. This lesson is about how the autonomous-driving stack is built almost entirely out of the methods we introduced precisely because online trial-and-error is forbidden.

Financial trading (上) — stock trading

A market looks like the perfect RL problem: a clean sequence of states, an obvious action set, and a reward — money — that nobody has to hand-design. That promise is exactly why finance has chewed up more RL projects than almost any other domain. This lesson is about why: trading is an MDP whose state you can never fully see, whose rules change while you play, and whose past is a liar.

Financial trading (下) — portfolio optimization

Lesson 24 traded one stock with a discrete buy/hold/sell button. Real money managers do something harder and more continuous: split a dollar across many assets at once. The action is no longer a button press — it is a vector of weights that must sum to one. That single change drops us straight back into continuous control, and the fact that the dollars are real drops us into offline RL.

Resource scheduling — cloud computing to logistics

Which job goes on which machine next? Which stop does the truck visit next? These are sequential decisions over a combinatorial action space, and they are everywhere — cluster schedulers, bin-packers, delivery routers. This lesson is about why the value-based argmax we leaned on in the foundations breaks here, and why policy gradient with a structured policy is the move that makes RL practical for combinatorial optimization.

NLP (上) — machine translation

A translator emits one token at a time, and is graded only at the end. That is a sequential token-action MDP with a terminal, non-differentiable reward — which is exactly the shape that the lesson 07 policy gradient was built for. Today we use it to fix a bug that the usual training objective, cross-entropy, bakes in: exposure bias.

NLP (下) — dialogue systems

Lesson 27 made single-turn generation an RL problem: a sentence is a trajectory of token-actions scored by one terminal reward. Now stretch it across a conversation. A modern chat assistant is multi-turn RL whose reward is human preference — which means the assistant you talk to every day is the lesson 13 RLHF pipeline, scaled. This is the lesson where the theory course meets its sibling systems course.

Computer vision — detection to image generation

Vision is the home turf of backpropagation: convolutions, pixels, gradients flowing everywhere. So why is RL here at all? Because two of vision's most useful moves are not differentiable. Choosing where to look on an image is a discrete decision; scoring a generated image as "beautiful" or "faithful to the prompt" is a verdict you cannot differentiate through. Both gaps have the same shape, and we have already built the tool for that shape twice.

Platforms & tools — from OpenAI Gym to Ray

Every equation in this course turns into a loop of two function calls. This lesson is the engineering substrate that makes that loop runnable, reproducible, and — when you have a thousand machines — scalable.

Frontiers & outlook — from AGI to human-AI collaboration

Thirty lessons built one machine: an MDP, two ways to solve it, and a long campaign to scale, stabilize, and source the reward. This lesson points that machine at the open frontier — reasoning models, agents, world models, alignment — and asks which of the unsolved problems each of the earlier lessons actually bears on.

Part IV · Applications — engineering (lessons 67–86)

Twenty applied domains, each run through one loop — formulate the MDP, diagnose the binding difficulty, engineer the mechanism that removes it, guard it in production. Games and autonomous driving, trading and recommendation, robotics and UAVs, energy grids and networks, manufacturing and scheduling, medicine and epidemics.

Game-AI agent training

The first domain, and the one that built modern deep RL (Atari, AlphaGo, OpenAI Five, AlphaStar). We use it to install the method the whole track runs on: write the MDP, name the one thing that makes this MDP hard, reach for the mechanism that removes exactly that difficulty — and not before. For a MOBA bot the binding difficulties are a hybrid action space, partial observability, a sparse, hackable reward, a non-stationary self-play opponent, and a 16 ms frame budget. Each one names a tool.

Autonomous-driving path planning

A driving policy that turns a steering wheel and a pedal is a continuous-control MDP — but one where physics, the law, and a safety case all sit inside the loop. The binding difficulties of this domain are different from a game's: the action space is dimensionally inconsistent and physically constrained (a steering angle and a throttle are not the same unit, and the tires can only deliver so much grip); the reward must respect hard safety constraints you may never violate, not just optimize; the demonstrations from human drivers must be fused with exploration without forgetting; the observation is a high-dimensional, partially-observable sensor stream; and almost all training happens in simulation that does not match the real car. Each one names a mechanism.

Quant-trading order execution

You are handed a parent order — "buy 200k shares before the close" — and an RL agent that decides, microsecond by microsecond, what to post and where. The domain looks like a clean MDP until you write it down: the state is a level-2 order book that leaks the future if you encode it carelessly, the action is a hybrid price-and-size decision constrained by a 500 µs exchange round-trip, the reward conflates your own market impact with the market's noise, you can never run the policy live to evaluate it so everything rides on off-policy estimation under a non-stationary market, and every action is gated by regulation that can change at 9am. Each binding difficulty names a tool.

Recommender-system cold start

A brand-new user opens the app. You have no click history, roughly ten interactions before they decide whether to come back, and a real-time budget of tens of milliseconds per recommendation. Casting this as RL is natural — recommend, observe, update — but every piece of the MDP is awkward: the state starts empty and must be built on-device from non-sensitive priors, the reward (a purchase or next-day return) arrives tens of steps late through a conversion funnel, and the very exploration that learns a cold user faster is what scares them away. Each tension names a mechanism.

Industrial control & parameter tuning

A rolling mill, a battery sorter, a servo motor — physical plants where a bad action overheats a coil, scraps a part, or trips a safety relay. Game AI could afford a billion throwaway episodes; a plant cannot afford one unsafe one. So the binding difficulties of this domain are different: real samples are scarce and expensive, some constraints must hold on every single step, sensors arrive late, and the plant ages out from under the policy. Each one names a tool — and the recurring move is to start from the controller the plant already trusts (PID/MPC) and let RL correct it, never replace it cold.

Smart-logistics AGV scheduling

A warehouse floor is a fleet of automated guided vehicles (AGVs) that must pick, carry, yield at intersections, and recharge — all while orders arrive on a schedule that drifts hour by hour. The binding difficulties of this domain are different from a game's: the reward is a multi-agent yielding conflict that can deadlock, the order stream is non-stationary so a fixed-horizon policy ages out by noon, path and charging are two coupled timescales, the charging queue is partially observable, and the whole thing only learns fast enough if you can keep thousands of simulators feeding a GPU. Each one names a tool.

Network congestion control

A congestion controller decides how fast to send into a link it cannot see, using signals — round-trip time, loss, ECN marks — that are noisy, delayed, and not Markov. Cast as RL, the binding difficulties of this domain are: a partially-observable, non-stationary state built from jittery measurements; an action space that must change rate without oscillating; a throughput-vs-latency reward with a hard tail-latency constraint; a multi-agent fairness requirement (don't starve the TCP flows sharing your link); and a µs-scale inference budget that lives in the kernel datapath. Each names a mechanism.

Edge-computing task offloading

A device, a base station, and a wireless link between them. Every few milliseconds a task arrives and the agent must decide: run it locally or ship it to the edge server, and how much compute to claim if it goes. Casting this as an MDP is easy; making it converge and survive production is not. The binding difficulties here are a hybrid action space (a discrete offload switch fused with a continuous resource fraction), a reward built on top of error-prone energy and channel models whose bias propagates straight into the value function, a partially observable state (the hidden load on the edge node you cannot see), a non-stationary wireless channel that drifts faster than you can train, and a hard safety boundary (latency deadlines, untrusted nodes, battery limits). Each one names a tool.

Robotic grasping

A gripper has to find an object, decide how to approach it, close on it, and not crush or drop it — all from noisy multimodal sensors, with an action that lives on a curved manifold, under hard safety limits, and a policy trained in simulation that must work on a real arm it has never touched. The binding difficulties here are multimodal partial observability (vision can be occluded, tactile is low-resolution), an action space that is not flat (6-DoF pose lives on SE(3), gripper force and position must share a critic), a reward that conflates "closed" with "grasped well", and a sim-to-real gap that turns a 99% sim policy into a 78% real one. Each names a tool.

Energy-management microgrid

A microgrid is a dispatch controller that decides, every few seconds, how much to charge or discharge the battery, which loads to shed, and how much to draw from the grid — to minimize the electricity bill without ever violating a physical limit. What makes this MDP hard is that the state is a forecast, not a fact (weather and price are uncertain), the action lives under hard safety constraints that a stochastic policy must never break, the reward spans multiple time-scales and a depreciation cost you cannot directly measure, and the environment is openly non-stationary (seasons) and distributed across sites that will not share raw load data. Each of those names a mechanism.

Wireless-network power allocation

A base station has a power budget and a fistful of users on shared spectrum. Every watt you give one user is interference to the others, and the channel that decides the payoff changes faster than you can measure it. Cast as RL, the binding difficulties are sharp and physical: the state (channel-state information) is enormous and stale, the action (transmit power) is continuous and hard-constrained by a power amplifier that distorts when saturated, the reward is a non-convex sum-rate-versus-energy trade-off, the transition is partially observable because coherence time is shorter than a decision slot, and the whole thing must run inside a 3 ms TTI across many base stations at once. Each one names a tool.

Cloud-resource elastic scaling

An autoscaler decides, every few seconds, how many container instances to run and how much CPU to grant each one — trading a latency SLA against a cloud bill. It looks like a textbook control problem, but the binding difficulties are RL-specific: a hybrid action space (how many instances × how much CPU each), a delayed transition because a new instance is not ready until it cold-starts, a two-objective reward (SLA vs. cost) that invites both reward-hacking and oscillation, non-stationary traffic that breaks the stationary-MDP assumption, and a hard budget that the policy may never violate. Each of those names a tool.

UAV path planning

A drone that plans its own route through a real sky has to reconcile five awkward facts at once: its position estimate degrades when GPS drops out, its action is a continuous thrust/heading vector that must stay physically flyable, its reward braids path length against energy against regulatory risk, the wind field is non-stationary, and in a swarm it must agree with neighbors over a flaky radio link. Each fact names a tool — and reaching for the tool before the fact bites is how you over-engineer.

Personalized medical dosing

A clinician adjusts a drug dose every few hours from a patient's labs and vitals, trying to reach a therapeutic effect without crossing into toxicity. That is a sequential-decision problem — a closed-loop controller over a living body — and it is the cleanest place to learn the parts of applied RL that no game teaches you: the state is a heterogeneous, missing, high-dimensional clinical record; the action is a continuous dose under a hard safety cap; the reward is an efficacy-minus-toxicity signal that arrives days late; and you can never explore on a real patient — so everything is offline, conservative, and audited. Each of those names a mechanism.

Intelligent traffic-signal control

A controller watches an intersection through noisy, sparse sensors and decides which movement gets the green and for how long. Casting that as an MDP is easy; making it converge and pass a traffic-authority acceptance test is the hard part. The binding difficulties here are a partially observed, heterogeneous, high-dimensional state, an action space riddled with hard safety constraints (minimum green, phase conflicts), a multi-objective reward that hides perverse optima (suppressing side streets to flatter the arterial), non-stationary demand (peak/off-peak, holidays, storms), and multi-intersection coordination under communication delay. Each one names a tool.

5G network-slicing resource allocation

One physical radio carrier is sliced into virtual networks that must each meet a contract: a video slice wants bandwidth, an industrial-control slice wants 99.999% reliability at sub-millisecond latency, a best-effort slice wants whatever is left. A controller hands out radio blocks and transmit power every transmission-time interval. The binding difficulties are a high-dimensional, fast-drifting state (channel reports that are huge, delayed, and partially observed), a hybrid constrained action (discrete block selection × continuous power under a total-power cap), a multi-objective reward where a reliability breach is a regulatory red line, non-stationary traffic that shifts faster than you can retrain, and a hard isolation requirement: no slice may starve its neighbour. Each names a tool.

Smart-manufacturing scheduling

A factory floor is a sequential decision problem: every shift, an algorithm decides which job runs on which machine, in what order, and when. The MDP writes itself — but four difficulties bind hard and all at once. The action space is a huge, structured, mostly-illegal set governed by process-precedence constraints that can deadlock shared resources. The state hides a physically-unobservable variable (tool wear). And the reward is a multi-objective trade-off with hard constraints — delivery rate against changeover cost against worker overtime — where naïve weighting silently violates the one limit you cannot cross. Each names a tool.

Satellite-constellation power control

A low-Earth-orbit (LEO) broadband satellite must set its downlink transmit power every few milliseconds — enough to close a link through a channel that swings violently as geometry and the ionosphere change, but never enough to saturate the power amplifier, blow the harvested-energy budget, or burn battery cycle-life. The binding difficulties are sharp: the channel is only partially observable and non-stationary, the amplifier imposes a hard safety constraint, the true reward trades throughput against hardware lifetime, the orbit injects a periodic eclipse shock, and inference runs on a radiation-hardened FPGA with a millisecond deadline. Each one names a tool.

Data-center cooling

A hall full of racks, a fleet of fans, pumps and chilled-water valves, and one number the operator is judged on: PUE — total facility power divided by IT power. RL can shave it, but only against a wall of hard physics: you never see the whole temperature field, the cheapest cooling setting is always one step from a thermal runaway, and the IT load that drives everything lurches without warning. This lesson casts cooling as a constrained, partially-observable MDP and works the four binding difficulties: a hidden 3-D thermal state, a hybrid discrete-continuous actuator space with hard safety intervals, a multi-objective PUE-vs-temperature reward that is trivially hackable, and a non-stationary load.

Epidemic intervention strategy

The last domain — and the one where a wrong action is measured in lives and in GDP at the same time. A public-health agency must decide, week after week, how hard to test, how hard to lock down, and where to ship a finite stockpile of vaccine, while the truth it is reacting to — the true number of infections — is never directly observed, the virus it is fighting mutates underneath the policy, and every action sits under hard legal and inventory ceilings. For an epidemic controller the binding difficulties are partial observability (you see reported cases, not infections), a multi-objective reward (health vs. economy, neither dominant), hard safety constraints (stockpile, cold-chain, statute), and a non-stationary transition (the pathogen evolves). Each one names a tool.

Closing (lessons 87)

The through-line in one paragraph, and an interactive self-test spanning the whole series.

Closing & self-test · 结束语 & 结课测试

Thirty-one lessons, one story. We close by telling that story in a single breath, point you at the systems course that picks up where we stop, and hand you a ten-question self-test to find the spots worth a second read.