Defining and Measuring AGI
Before you can build something, you have to know what you are building. Before you can measure progress toward a goal, you have to define the goal precisely enough to tell when you have arrived. For artificial general intelligence, both of these prerequisites turn out to be profoundly difficult. Researchers disagree not just about whether AGI is achievable or near, but about what AGI even means — and that disagreement is not semantic hair-splitting. Different definitions imply different research programs, different safety concerns, and different criteria for declaring success or failure. Understanding why defining AGI is hard is a substantive intellectual accomplishment.
A Taxonomy of AGI Definitions
The term AGI is used in several distinct ways across the field, and conflating them produces confusion. The task-based definition: AGI is a system that can perform any cognitive task a human can perform at human level or better. This is appealing in its simplicity but raises immediate questions: which tasks? All tasks, or most? Does 'human level' mean the average human, the best human, or something else? A system that can perform every cognitive task except playing the piano — is that AGI or not? The economic productivity definition: AGI is a system that can do the work of a highly skilled human across the entire economy, given appropriate tools and access. This is more concrete and practically oriented — it defines AGI by its labor-market impact. OpenAI uses something like this definition in their public discussions. The advantage is measurability: you could in principle track what fraction of economic activity AI can perform. The disadvantage is that 'economic productivity' is a moving target that depends on infrastructure, law, and social acceptance, not just raw capability. The general learning definition: AGI is a system that, given a new task and sufficient experience with that task, can learn to perform it at human level. This emphasizes flexibility and generalization rather than current capability breadth. A system that knows nothing about chess but can become a grandmaster-level player after reasonable training would qualify, even if it starts from scratch. This definition is arguably the most scientifically rigorous — it captures the essence of general intelligence as general learnability. The self-improvement definition: AGI is a system that can improve its own cognitive capabilities, potentially triggering a feedback loop of recursive enhancement. This definition is central to certain safety concerns and goes beyond task performance to include meta-cognitive capabilities.
If you define AGI as task-based, you study which tasks AI cannot yet do and try to close those gaps. If you define it as general-learning, you study sample efficiency and meta-learning. If you define it by economic impact, you study deployment, infrastructure, and adoption. Same word, different research programs.
The Measurement Problem
Even with a clear definition, measuring progress toward AGI is exceptionally difficult. Benchmark saturation: AI researchers have historically used standardized benchmarks to measure progress. The problem is that AI systems saturate benchmarks — achieve near-perfect scores — faster than new benchmarks can be designed. The ImageNet classification benchmark was saturated within a few years of the deep learning revolution. Many language benchmarks followed. When a benchmark is saturated, it no longer measures what it was designed to measure. This is not a minor technical inconvenience — it reflects a deep problem: human-designed benchmarks inevitably encode specific patterns that AI can match without possessing the underlying capability the benchmark was designed to test. The generalization gap: a system that scores 90% on a reading comprehension benchmark may fail when the same questions are rephrased slightly, or when background knowledge that humans take for granted is not present in the text. High benchmark scores do not reliably indicate that the underlying capability is robust and generalizable. A true measure of AGI would need to assess performance on tasks the system has never seen in any form, under conditions that cannot be anticipated during evaluation design. The Turing Test and its problems: Alan Turing proposed in 1950 that a machine which could sustain an indistinguishable conversation with a human would be considered intelligent. Modern large language models can pass many versions of this test. Yet most researchers do not conclude that LLMs are generally intelligent, because the test measures conversation fluency rather than the broad cognitive generality the AGI concept requires. A test can be passed without the underlying competence it was designed to detect. ARC-AGI and successor benchmarks: in 2019, researcher Francois Chollet proposed the Abstraction and Reasoning Corpus (ARC) as a test of general fluid intelligence — the ability to recognize novel patterns from very few examples, without relying on prior knowledge. Current AI systems perform far below human level on this benchmark even as they achieve superhuman performance on many others, suggesting ARC captures something qualitatively different. This is promising as a principled approach, though debate continues about whether any benchmark can fully operationalize the AGI concept.
Match each AGI definition to the research approach it most naturally motivates.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
The Consciousness Question
Some researchers and philosophers argue that any operational definition of AGI is incomplete unless it addresses consciousness — subjective experience, awareness, or what philosophers call 'qualia.' On this view, a philosophical zombie (a system that behaves exactly like a conscious agent but has no inner experience) would pass any behavioral test for AGI while not being truly intelligent in the morally relevant sense. This position has significant implications: if consciousness is necessary for 'real' AGI, and if consciousness cannot be detected from outside (the 'hard problem of consciousness'), then no behavioral or economic test could ever definitively establish AGI. The question would remain permanently open. Most AI researchers working on practical systems set the consciousness question aside — they focus on behavioral capabilities, not inner experience. But the question matters when evaluating claims about whether AI systems have interests, rights, or moral status. These are not merely philosophical puzzles; they are questions that societies will face increasingly as AI systems become more capable and more integrated into daily life.
Any time AI achieves a high score on a well-publicized benchmark, ask: does this score reflect the underlying capability the benchmark was designed to measure, or has the system found a shortcut — a pattern in the test format that allows high scores without the underlying competence? The history of AI benchmarks is full of examples of the latter.
A research team proposes measuring AGI progress by tracking what percentage of tasks listed in the O*NET occupational database AI can perform at median human expert level. Which definition of AGI are they implicitly using?
An AI system scores 98% on a reading comprehension benchmark. Researchers then create a slightly rephrased version of the same questions and the system's score drops to 61%. What does this most likely reveal?
Design an AGI Test
- Your challenge: design a test that would provide strong evidence that a system has achieved AGI, while being resistant to the benchmark-gaming problems discussed in this lesson.
- Step 1: Choose one definition of AGI (task-based, economic, general learning, or self-improvement) and state it explicitly as your target.
- Step 2: Describe the test you would administer: what tasks, in what conditions, evaluated by whom.
- Step 3: For each component of your test, identify one way a sophisticated AI system might score well without actually possessing the capability you intend to measure.
- Step 4: Revise your test to close at least one of those loopholes.
- Step 5: Write a one-paragraph argument for why your revised test is better than the Turing Test as a measure of AGI.
- Discuss: Is it possible to design a test that is immune to all gaming? What does your answer imply about the measurability of AGI?