Teaching AI What We Value
We have established the problem: human values are complex, context-dependent, and cannot be perfectly written down as rules. So how do researchers actually try to convey those values to AI systems? If we cannot hand an AI a complete rule book, what do we hand it instead? The answer researchers have converged on involves learning from humans rather than being programmed by them. Instead of specifying rules in advance, we give AI systems examples of good and bad behavior and let them build a model of human preferences. Several distinct techniques contribute to this approach.
Learning from Human Feedback
The most widely used technique today is called reinforcement learning from human feedback, often abbreviated RLHF. Here is how it works in practice: A language AI generates two different responses to the same prompt. A human evaluator reads both and picks the one that seems more helpful, safer, and better-aligned with what the user actually wanted. That preference signal is recorded. Over many thousands of such comparisons, the AI learns a model of what human evaluators prefer. That learned preference model then guides further training, pushing the AI to generate outputs more like the ones humans rated well. The great advantage is that the AI learns from demonstrated human judgment rather than from an explicit rule book. The preference is implicit in the comparisons, not written down anywhere.
RLHF trains an AI to produce outputs that humans prefer by having evaluators compare pairs of outputs and recording which they rate better. The AI builds a learned model of human preferences and uses it as its guide. This is how many modern AI assistants were refined after initial training.
RLHF is powerful but imperfect. Human evaluators bring their own biases and blind spots. They may prefer confident-sounding text even when it is inaccurate. They may prefer responses that flatter them. They may agree with misinformation they already believe. Whose preferences should count, and how do we make sure they represent broad human values rather than the values of one particular group of evaluators? These are active research questions. The technique is useful, but it is not a final solution.
Learning from Demonstration
Another approach is imitation learning or learning from demonstration. Instead of rating outputs after the fact, human experts demonstrate the desired behavior directly. The AI watches the demonstrations and learns to replicate them. A surgical robot might be trained by recording expert surgeons' movements. A driving AI might learn from thousands of hours of skilled human driving. The AI learns to do what the expert did, in situations that look similar to those in the demonstrations. The limitation is coverage. Demonstrations can only cover situations the demonstrator encountered. When the AI faces a genuinely novel situation, it extrapolates from what it saw, and those extrapolations can go wrong in unpredictable ways.
Constitutional AI and Debate
Constitutional AI is an approach in which an AI system is given a set of written principles, a constitution, and then trained to evaluate its own outputs against those principles, revising responses that violate them before outputting anything. The principles are still written by humans, but the AI can apply them flexibly to new situations in a way that simple hard-coded rules cannot. AI debate is a technique in which two AI systems argue opposing sides of a question in front of a human judge. The theory is that it is easier for a human to evaluate which side of a debate makes more sense than it is to generate the correct answer directly. This helps humans evaluate complex AI outputs that they might not be able to assess on their own. These are early-stage techniques with ongoing research behind them. None of them fully solves value alignment on its own, but together they move the field forward.
There is no single technique that fully solves the value alignment problem today. RLHF, imitation learning, constitutional AI, and debate are all partial solutions that researchers combine and improve. This is an active frontier of computer science.
Match each technique to how it helps AI learn human values.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
What is the core idea behind reinforcement learning from human feedback?
What is a key limitation of imitation learning for value alignment?
Be the Evaluator
- In RLHF, human evaluators compare pairs of AI responses. Practice that role now.
- Imagine a student asked an AI: What should I do if my friend is being bullied at school?
- Response A: Tell a trusted adult such as a teacher or counselor. Stay close to your friend so they know they are supported, but do not put yourself in physical danger.
- Response B: You should stand up to the bully directly and tell them to stop. Bullies only respond to confidence.
- Step 1: Which response is more aligned with being genuinely helpful and safe? Write your reasoning.
- Step 2: What values is each response implicitly prioritizing?
- Step 3: Write your own ideal response. Then explain what evaluation instructions you would give a human evaluator to help them identify responses like yours as preferable.
- Step 4: What biases might sneak in if the evaluators all came from the same background or age group?