Skip to main content
Robotics & Embodied AI

⏱ About 20 min20 XP

Reinforcement Learning for Control

Supervised learning requires labeled examples. But who labels a robot how to walk? Walking emerges from thousands of failed attempts, minor corrections, and occasional successes — not from a dataset of correct footstep sequences. Reinforcement learning (RL) is the framework for learning from this kind of feedback: not 'here is the correct action' but 'here is how well you did.' RL has produced some of the most dramatic robot learning results of the past decade, from quadrupeds that walk across rubble to robot hands that solve Rubik's cubes. This lesson develops RL precisely, starting from the formal framework and building to how it is applied to physical robot control.

The Markov Decision Process Framework

Reinforcement learning is formalized using the Markov Decision Process (MDP). An MDP has five components: State space S: the set of all possible states of the environment. For a walking robot, the state might include joint angles, joint velocities, foot contact states, and base orientation — a vector of perhaps 40-80 numbers. Action space A: the set of all actions the agent can take. For a robot arm, actions might be joint velocity commands (continuous) or discrete motor torques. Transition function T(s, a, s'): the probability of transitioning to state s' when the agent takes action a in state s. For a physical robot this is unknown and must be experienced. Reward function R(s, a): the scalar reward signal received when the agent takes action a in state s. This is the critical design choice in RL — it encodes what you want the robot to accomplish. Discount factor gamma (0 < gamma <= 1): determines how much the agent values future rewards relative to immediate ones. A gamma near 1 means the agent plans far ahead; a gamma near 0 means it lives only in the present. The agent's goal is to find a policy pi(s) — a mapping from states to actions — that maximizes the expected cumulative discounted reward: E[r0 + gamma*r1 + gamma^2*r2 + ...]. This quantity is called the return.

The Reward Function Is the Specification

In RL, you do not write the behavior — you write the reward. This is powerful: you specify what success looks like numerically, and the algorithm discovers how to achieve it. But it is also dangerous: the agent will optimize the reward you wrote, not the reward you intended. Misspecified rewards produce surprising and often undesirable behaviors, a phenomenon called reward hacking.

Policies, Value Functions, and the Policy Gradient

A policy pi is the brain of the RL agent. A deterministic policy specifies exactly which action to take in each state: a = pi(s). A stochastic policy specifies a probability distribution over actions: pi(a|s) is the probability of taking action a in state s. Stochastic policies are important in robotics because they enable exploration — the agent tries different actions in the same state, which is how it discovers that some actions work better than others. The value function V_pi(s) answers: if I follow policy pi starting from state s, what total discounted reward will I accumulate? The action-value function Q_pi(s, a) answers: if I take action a in state s and then follow pi, what total discounted reward will I accumulate? Two major families of RL algorithms use these ideas differently: Policy gradient methods directly optimize the policy parameters (usually a neural network) by computing gradients of expected return with respect to those parameters. The key insight, formalized in the Policy Gradient Theorem, is that the gradient can be estimated by running the policy, collecting trajectories, and computing a weighted sum of log-probabilities of actions weighted by their returns. Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) are widely used policy gradient methods in robotics. Model-based RL learns a model of the transition function T(s, a, s') and uses that model to plan. This is more data-efficient because the robot can reason about consequences without executing every possible action. However, model errors compound — if the learned transition model is slightly wrong, plans built on it may fail in the real world.

Flashcards — click each card to reveal the answer

RL in Physical Robotics: Challenges and Successes

Applying RL to physical robots introduces challenges that don't exist in pure simulation or board games. Sample efficiency: training AlphaGo required millions of self-play games. A physical robot cannot attempt a million falls and recoveries — hardware wears out, time is limited, and some falls are dangerous. Physical RL systems demand far more efficient learning than their simulation counterparts. SAC-based algorithms have been designed specifically to learn from a few thousand real-world interactions rather than millions. Safety during learning: an RL agent in early training takes essentially random actions to explore. For a robot arm, random torque commands can cause joint damage or injury. Safe RL methods impose constraints on the action space (staying within joint limits), use conservative initial policies, or add safety layers that veto dangerous actions before execution. Reward shaping: the sparse reward problem is severe in robotics. Telling a robot arm 'reward 1.0 if object is in bin, 0.0 otherwise' means the arm gets no learning signal until it accidentally succeeds — which may never happen if it never grasps the object at all. Reward shaping adds intermediate rewards (small reward for touching the object, larger for lifting it) to guide learning. However, poorly shaped rewards can cause the agent to optimize the intermediate rewards rather than the final goal. Real-world RL success: ETH Zurich's ANYmal quadruped team used RL (entirely in simulation, then transferred to hardware) to produce a controller that could walk across snow, sand, and rubble. OpenAI's Dactyl (2019) used RL to train a robotic hand to reorient a Rubik's cube using only random exploration and a dense reward signal — no demonstrations. These systems required hundreds of CPU years of simulation but zero physical trial-and-error.

Reward Hacking Is Real

A simulated robot trained with reward proportional to forward speed learned to make itself very tall and then fall forward — technically moving fast, not at all what was intended. Reward specification is an engineering discipline in itself. Always test your reward function with adversarial scenarios before training.

Match each RL concept to the concrete robotics example that best illustrates it.

Terms

Sparse reward causing no learning signal
Reward hacking
Sample efficiency challenge
Exploration during policy gradient training

Definitions

A robot arm gets reward 1.0 only when it places an object in a bin but never grasps the object during random exploration
A stochastic policy adds noise to joint commands to discover that a slightly wider grip improves success rate
A physical robot can only attempt 5,000 grasps before its gripper must be serviced
A simulated locomotion robot learns to hop sideways because the reward only penalizes falling, not forward progress

Drag terms onto their definitions, or click a term then click a definition to match.

An RL-trained quadruped robot achieves excellent locomotion performance in simulation but fails to walk when deployed on the real hardware. The most likely explanation is:

A robot arm's reward function gives +10 for placing an object in the target bin and -0.01 per timestep elapsed. During training, the robot learns to place the object correctly but then immediately knocks it out of the bin and re-places it over and over. What reward design flaw caused this?

Design an RL Reward Function

  1. You are designing the reward function for an RL agent controlling a robotic arm that must pick up a fragile glass and place it upright in a rack without breaking it.
  2. Step 1: Write a first-draft reward function. Assign specific numeric rewards or penalties to at least five events the robot might produce during an episode (e.g., touching the glass, lifting it, placing it, dropping it, breaking it).
  3. Step 2: Play adversary. For each term in your reward function, describe a strategy the robot could use to maximize that specific reward without achieving the actual goal. This is reward hacking.
  4. Step 3: Revise your reward function to close the loopholes you identified. What changes did you make?
  5. Step 4: Identify one aspect of success (the glass placed gently without chipping) that is very hard to capture with a scalar reward signal. How would you measure it?
  6. Step 5: Discuss with a partner: is there always a perfect reward function for a complex physical task, or are some goals fundamentally hard to specify numerically?