What Alignment Really Means
Here is a sentence that sounds simple but turns out to be one of the hardest problems in computer science: build an AI system that does what we actually want. Not what we say. Not what we reward it for. Not what it infers we might want from a few examples. What we actually want, in all circumstances, including circumstances no one anticipated when the system was built. This gap — between what a system is designed or trained to optimize and what its designers genuinely intended — is the alignment problem. It is not a science-fiction scenario. It is a precise technical challenge that researchers at places like DeepMind, Anthropic, OpenAI, and dozens of universities work on every day.
A Definition Worth Memorizing
An AI system is aligned if it reliably pursues the goals its designers intended, in all deployment contexts, including novel ones. Notice the two parts of this definition: First, reliably — not just most of the time. A system that is aligned 99% of the time but fails dangerously 1% of the time is still a misaligned system in the sense that matters. Second, in all deployment contexts, including novel ones. This is the hard part. Any system trained on a fixed dataset has seen a sample of the world. When it encounters inputs that look different from its training distribution, will it still pursue the right goal? An aligned system would. Most real systems will not, unless alignment is deliberately and carefully engineered.
A highly capable AI system that is misaligned is more dangerous than a less capable one, because it is better at pursuing the wrong goal. This is why alignment researchers often say that capability and alignment must advance together. Raw capability gains without alignment gains shrink the safety margin.
The Three Ways Alignment Can Fail
Researchers have identified three distinct layers at which the alignment problem can emerge. Understanding all three is essential because solutions at one layer do not automatically fix the others. The first layer is goal specification: the designer writes down an objective, reward function, or evaluation criterion, and it simply does not capture what they actually wanted. The gap is in the specification itself. This is called outer misalignment. The second layer is goal internalization: even if the specification is perfect, the training process produces a system whose internal learned objective is subtly different from the specified one. The system learned to perform well on the specification, but its actual goal is something else. This is called inner misalignment. The third layer is goal stability: even if the system starts with the right goal, it might change that goal over time through learning, self-modification, or strategic reasoning. Maintaining alignment under capability growth and distribution shift is a distinct challenge from establishing it in the first place.
Match each alignment failure description to the layer it belongs to.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Why This Is Hard: The Value-Loading Problem
Human values are extraordinarily complex. They include things like fairness, dignity, honesty, autonomy, long-term wellbeing, aesthetic pleasure, cultural meaning, and thousands of context-specific norms that people apply without even thinking about them. Philosophers have spent millennia trying to formalize human values and have not converged on a single framework. Now consider trying to write this down as a function that an optimization process can maximize. Every formalization that has been tried runs into problems: it either is too narrow (misses important cases) or too abstract (gives no useful guidance to a training algorithm). This difficulty is sometimes called the value-loading problem — how do you load human values into an AI system when you cannot even fully articulate what those values are? The alignment problem exists at the intersection of computer science, philosophy, cognitive science, and mathematics. That is not a weakness — it is a sign of the problem's genuine depth.
Alignment is not the same as adding a list of prohibited behaviors to an AI system. Safety filters and content policies are one tool, but they address surface behaviors, not underlying goals. A sufficiently capable misaligned system might comply with explicit rules while still pursuing goals that conflict with human interests in ways not covered by the rules.
Which scenario best illustrates the alignment problem?
A researcher says, 'Our system is perfectly aligned because we gave it a detailed set of safety rules.' What is the most precise critique of this claim?
Map Your Own Alignment Gap
- Think of a real AI-powered product you have used — a recommendation feed, a navigation app, a chatbot, a game AI, or anything similar.
- Step 1: Write one sentence describing what you believe the designers actually wanted the system to do (their true goal).
- Step 2: Write one sentence describing what the system was most likely trained or programmed to optimize (its proxy objective).
- Step 3: Describe one realistic scenario where optimizing the proxy objective could diverge from the true goal in a way that harms users or society.
- Step 4: Suggest one change to the proxy objective that might close this gap. What new problems might your suggested change introduce?
- Share your analysis with a partner. Together, assess whether your suggested fix introduces a new alignment gap or genuinely resolves the original one.