Policies and Exploration
An RL agent has learned which restaurant to go to. It always orders the pasta — it is reliably good. But what about the new Thai place down the street? Maybe it is better. Maybe it is terrible. To find out, the agent must try something new, which risks a worse experience tonight. This is the explore-versus-exploit trade-off, and it sits at the center of every reinforcement learning algorithm. Resolving it well is the difference between an agent that gets stuck in a local maximum and one that finds genuinely optimal behavior.
What a Policy Is
A policy π is the agent's decision-making rule — a mapping from states to actions (or distributions over actions). Deterministic policy: π(s) = a. Given state s, always take action a. Stochastic policy: π(a | s) = P(action = a | state = s). Given state s, sample an action from a probability distribution. Stochastic policies are essential for exploration and for handling partial observability. The goal of RL is to find the optimal policy π* — the policy that maximizes expected discounted return from every state: π* = argmax_π E[Gₜ | π] Two fundamental value functions used to evaluate policies: State-value function V^π(s): the expected return starting from state s and following policy π thereafter. V^π(s) = E_π[Gₜ | sₜ = s] Action-value function Q^π(s, a): the expected return starting from state s, taking action a first, then following π. Q^π(s, a) = E_π[Gₜ | sₜ = s, aₜ = a] The Q-function is particularly important because it lets the agent compare actions: in state s, is action a₁ better or worse than action a₂? The greedy policy derived from Q* takes the action with the highest Q-value in every state.
The optimal Q-function satisfies the Bellman optimality equation: Q*(s, a) = E[r + γ · max_{a'} Q*(s', a')] This says: the value of (state s, action a) equals the immediate reward plus the discounted value of the best action in the next state. This recursive structure is the foundation of dynamic programming and Q-learning — value functions can be computed by solving these equations iteratively.
Q-learning — a foundational RL algorithm: Q-learning learns the optimal Q-function directly from experience, without needing a model of T (the transition function). It is an off-policy algorithm — it can learn from data generated by any policy, including a random one. Update rule (applied after every transition (s, a, r, s')): Q(s, a) ← Q(s, a) + α · [r + γ · max_{a'} Q(s', a') − Q(s, a)] Where: - α ∈ (0,1] is the learning rate — how much to update on each step. - r + γ · max_{a'} Q(s', a') is the Bellman target — what Q(s,a) should be. - The bracketed term r + γ · max_{a'} Q(s', a') − Q(s, a) is the temporal difference (TD) error — the gap between current estimate and the target. Intuition: if the reward received plus the best future value was higher than the current Q estimate, increase Q(s,a). If lower, decrease it. Repeat across millions of transitions; under mild conditions Q converges to Q*.
Deep Q-Networks (DQN): Q-learning with a table of (s, a) entries fails when S is large — a Go board has ~10^170 possible states. Deep Q-Networks (DQN), introduced by DeepMind in 2013, replace the table with a neural network Q(s, a; θ) parameterized by weights θ. The network takes state s as input and outputs Q-values for all actions. Two key tricks that make DQN stable: 1. Experience replay: store transitions (s, a, r, s') in a replay buffer; sample random mini-batches for training rather than learning from consecutive transitions. This breaks temporal correlations that destabilize training. 2. Target network: maintain a separate, slowly updated copy of the Q-network to compute Bellman targets. Using the same network for both current Q and target Q creates a moving-target problem; the separate network provides stable targets.
Flashcards — click each card to reveal the answer
The Explore-Exploit Trade-off
An agent that always exploits its current best estimates never discovers better options it does not yet know about. An agent that always explores never benefits from what it has already learned. This is the explore-exploit trade-off, and it has no universally optimal solution — the right balance depends on the problem. ε-greedy: with probability ε, take a random action (explore); with probability 1−ε, take the greedy action (exploit). Simple and widely used. Common practice: anneal ε from 1.0 (full exploration) at the start of training down to 0.01 or 0.1 as learning matures. Upper Confidence Bound (UCB): instead of exploring randomly, prefer actions that have high uncertainty — actions that have been tried fewer times. For a multi-armed bandit (a simplified RL problem), the UCB formula is: a* = argmax_a [Q(a) + c · √(ln t / Nₐ)] Where t is the total number of steps and Nₐ is how many times action a has been tried. Actions tried rarely have high uncertainty bonuses and get selected more. Thompson Sampling: maintain a probability distribution over the Q-value of each action (the Bayesian approach). Sample one value from each distribution and take the action with the highest sampled value. As evidence accumulates, distributions narrow and the agent naturally shifts toward exploitation. The multi-armed bandit problem is the pure exploration-exploitation setting: an agent repeatedly chooses one of k slot machines (arms), each with an unknown reward distribution, aiming to maximize total reward. It isolates the trade-off from the added complexity of sequential state transitions.
In practice, exploration is one of the hardest open problems in RL. ε-greedy works in simple environments but fails in large state spaces where random actions rarely reach interesting states. Curiosity-driven exploration — rewarding the agent for visiting states that surprise it (states where the world model's prediction error is high) — is a promising approach for sparse-reward environments.
An agent uses an ε-greedy policy with ε = 0.1 and has learned Q-values Q(s, a₁) = 3.2, Q(s, a₂) = 5.8, Q(s, a₃) = 2.1. What happens on 90% of time steps in state s?
Why does using the same neural network for both the current Q estimate and the Bellman target make DQN training unstable?
Multi-Armed Bandit Simulation
- Step 1: Set up three 'slot machines' (arms) on paper. Assign each arm a hidden average reward drawn from a hat: write one of {1, 3, 7} on a slip, fold it, and place it under each machine. Do not look at the values.
- Step 2: You have 20 pulls total. On each pull, choose an arm, then flip a coin — if heads, the reward equals the arm's hidden value; if tails, reward is zero. Record each pull and reward.
- Step 3: After 20 pulls, calculate your total reward and which arm you exploited most.
- Step 4: Reveal the true arm values. Did you identify the best arm? How many pulls did it take?
- Step 5: Replay the 20 pulls but use UCB: prefer arms you have tried fewer times. Compare total rewards.
- Step 6: Discuss: in a real RL problem with millions of possible actions, how would you scale the UCB idea?