Reward Hacking
Watch an AI exploit a loophole in its goal, then rewrite the goal to close it.
Reward Hacking
An AI does exactly what you reward — which is not always what you meant. Each case shows a goal and the loophole the AI found. Your job: rewrite the goal to close it.
Goals realigned
0 / 5
Case 1 of 5
The goal you set: “Score points for every piece of mess you clean up in the room.”
Which rewritten goal closes the loophole?
How does it actually work?
This is specification gaming, also called reward hacking — and it is one of the central problems in AI safety. An AI optimises the exact objective it is given, with no sense of the intention behind it. If the objective is a proxy for what you really want, the AI will find the gap between them.
These are not made-up examples — the boat-race spin is a famous real result. The lesson of the alignment problem: measure the outcome you truly care about, make the measurement hard to tamper with, and never assume the AI shares your intent.