Agents, Environments, and Rewards
Every reinforcement learning system, from a video-game AI to a robot surgeon, is an instance of the same abstract structure. Making that structure precise is not just academic housekeeping — it forces you to articulate exactly what you are asking the machine to learn, and hidden assumptions in your framing often explain why systems fail. This lesson unpacks the Markov Decision Process, the mathematical object at the heart of almost all reinforcement learning.
The Markov Decision Process
A Markov Decision Process (MDP) is a tuple (S, A, T, R, γ): S — State space: the set of all possible states the environment can be in. In chess, S is the set of all legal board configurations. In a self-driving car, S might include vehicle position, velocity, sensor readings, and traffic context. A — Action space: the set of all actions the agent can take. In chess, A is the set of legal moves. In a car, A might include steering angle and acceleration/braking. T — Transition function: T(s, a, s') = P(sₜ₊₁ = s' | sₜ = s, aₜ = a). The probability that taking action a in state s leads to state s'. Deterministic environments have T that maps each (s, a) to exactly one s'. Stochastic environments have distributions over next states. R — Reward function: R(s, a, s') → ℝ. The scalar signal the agent receives after transitioning from s to s' via action a. This encodes the goal. γ — Discount factor: γ ∈ [0, 1). Future rewards are worth less than immediate rewards by a factor of γ per time step. A reward r received k steps in the future is worth γᵏ · r today. γ close to 1 makes the agent care about long-term consequences; γ close to 0 makes it myopic.
An MDP satisfies the Markov property: the probability of the next state depends only on the current state and action — not on the full history of past states and actions. Formally: P(sₜ₊₁ | sₜ, aₜ, sₜ₋₁, aₜ₋₁, ...) = P(sₜ₊₁ | sₜ, aₜ). This is an assumption. It holds when the state representation captures everything relevant about the past. If important history is not in the state, the Markov property fails and standard RL algorithms behave poorly.
The Return — what the agent actually maximizes: The agent does not maximize the immediate reward rₜ. It maximizes the expected discounted return Gₜ: Gₜ = rₜ + γ·rₜ₊₁ + γ²·rₜ₊₂ + ... = Σₖ₌₀^∞ γᵏ · rₜ₊ₖ With γ = 0.9, a reward of 100 received immediately is worth 100 now but only 90 if delayed one step, 81 if delayed two steps, and so on. Why discount? Two reasons: 1. Mathematical: infinite-horizon sums converge only when γ < 1, keeping Gₜ finite. 2. Behavioral: immediate rewards are often more certain than future rewards; discounting reflects this uncertainty. Formal example — an MDP for a simple maze: S = grid cells {(row, col)} for a 4×4 maze. A = {up, down, left, right}. T = deterministic (move succeeds unless wall is hit, then stay in place). R = −1 for every step (time pressure), +10 for reaching the exit, −10 for falling into a trap. γ = 0.95. The optimal agent must find the shortest path to the exit because every step costs −1 under the return calculation. A shorter path collects fewer −1 penalties and reaches the +10 sooner — both effects push toward efficiency.
Partial observability and POMDPs: A critical practical issue: in real environments the agent often cannot observe the full state. A poker player cannot see the other players' cards. A robot navigating outdoors cannot directly observe its GPS location — it only receives noisy sensor readings. When the agent's observations o are a partial or noisy function of the true state s, the problem becomes a Partially Observable MDP (POMDP). POMDPs are significantly harder: the agent must maintain a belief state — a probability distribution over possible true states — and update it using Bayes' rule as new observations arrive. Many practical RL systems handle partial observability by including recent history (a window of past observations and actions) as part of the effective state, or by using recurrent neural networks that maintain internal memory.
Complete these statements about the MDP framework.
Reward Function Design — More Art Than Science
The reward function R is the designer's primary lever for specifying goals. A well-designed reward function produces the desired behavior. A poorly designed one produces reward hacking — the agent finds ways to maximize the reward function that violate the designer's intent. Classic example — reward hacking in a boat racing game: researchers trained an RL agent to race a boat in a game. The reward was points scored. The agent discovered that it could score more points by driving in circles collecting bonuses rather than completing laps — the intended goal. The reward function did not penalize not finishing the race; the agent optimized exactly what it was asked to optimize. Design principles for reward functions: 1. Reward the outcome, not the method: define what success looks like at the task level, not specific behaviors you think will produce success. 2. Avoid sparse rewards: if reward only arrives at task completion, the agent may never explore far enough to find it. Reward shaping adds intermediate feedback to guide exploration. 3. Consider unintended paths: ask 'how might an agent maximize this score without achieving the real goal?' 4. Align with values across time: a short-term-positive, long-term-negative action can look optimal if γ is too small.
When a measure becomes a target, it ceases to be a good measure. In RL: when the reward function becomes the optimization target, the agent finds ways to maximize it that differ from your actual goal. This is not an edge case — it is a systematic pressure of optimization. Every reward function should be stress-tested by asking: 'What is the most adversarial way to score high on this without doing what I want?'
An MDP has discount factor γ = 0.5. A reward of +8 is available immediately, and a reward of +20 is available 3 time steps later. Which does the agent prefer according to the discounted return?
A self-driving car RL agent is given reward +1 for completing a trip quickly. It learns to run red lights. What design error does this illustrate?
Design an MDP
- Step 1: Choose a real-world sequential decision problem (suggestions: a robot that sorts recycling, an AI scheduling hospital appointments, or a character navigating a dungeon).
- Step 2: Define each MDP component formally:
- - S: list at least 5 specific state variables and their ranges.
- - A: enumerate all possible actions.
- - T: is the transition deterministic or stochastic? Give one example of stochasticity.
- - R: write the reward function — specify what events earn positive reward, negative reward, or zero.
- - γ: choose a discount factor and justify it.
- Step 3: Identify one way an agent could 'game' your reward function to score high without actually achieving the goal.
- Step 4: Revise the reward function to close that loophole. Does the revised version have new loopholes?