Capability Assessment
AI capability claims arrive constantly. A company announces that its new model 'surpasses human experts' in a domain. A research paper claims a system has 'achieved superhuman performance' on a benchmark. A news headline declares that AI can now 'do X' — where X is something previously thought to require human intelligence. These claims range from carefully evidenced and honestly scoped to misleading, premature, or simply wrong. Being able to assess them rigorously — to distinguish genuine capability from hype, to understand what the evidence actually shows, and to identify the gap between benchmark performance and real-world reliability — is one of the most practically valuable skills you can develop as an AI-literate person.
A Framework for Assessing Capability Claims
When you encounter a claim that an AI system can do something, apply this five-part framework. One: What exactly is the claim? Strip it of vague language. 'AI surpasses doctors' becomes: which AI system, on which specific task, with which patient population, compared to which doctors, under which conditions? Precision exposes what the claim is actually asserting. Two: What is the evidence? Was this demonstrated on a benchmark, a real-world deployment, a controlled experiment, or an anecdote? Benchmark evidence is more standardized but subject to contamination and saturation. Real-world deployment evidence is more ecologically valid but harder to interpret. Anecdotal evidence (cherry-picked impressive outputs) is the weakest form. Three: What is the evaluation's scope and setting? Does the benchmark match the real task? A model that reads radiology images accurately in a controlled study may perform differently on a hospital's actual scanner images, with their real noise, compression artifacts, and unusual presentations. The distributional match between evaluation and deployment matters enormously. Four: What are the failure modes? Even systems that perform well on average fail on specific inputs. What kinds of inputs does the system struggle with? Are those inputs common in real deployment? A system that fails 5% of the time on average but fails 40% of the time on rare high-stakes cases is not safe for high-stakes deployment. Five: What is the reliability gap? Capability (can it ever do this?) and reliability (does it always do this correctly and safely?) are different. A system that generates a correct answer 80% of the time is capable but not reliable enough for applications where errors are costly.
A single impressive demo does not establish reliable capability. AI demos are typically cherry-picked: the best outputs are selected for presentation. The relevant question is not 'can this system produce an impressive output?' but 'what is the distribution of outputs across the full range of inputs it will encounter in deployment?' Average and tail performance both matter, and tail failures — rare but severe errors — are often the most important to understand.
Common Patterns of Capability Misrepresentation
Understanding how capability claims go wrong helps you spot problems faster. Several patterns recur. Benchmark mismatch: the cited benchmark does not measure what the claim says it measures. A model that achieves high accuracy on a reading comprehension benchmark written by NLP researchers may struggle on reading comprehension tasks as medical professionals or lawyers actually encounter them. Unfair baseline comparison: 'better than human' claims often compare to a narrow or unrepresentative set of humans — novice doctors, time-pressured nurses, students. The same model may underperform specialists with adequate time, access to context, and the ability to ask follow-up questions. Capability without context: demonstrating that a system can generate a correct answer in isolation does not mean it will generate correct answers when integrated into a real workflow with real users, real noise, and real edge cases. The system exists in a context, and context changes performance. Equivocation on 'human-level': 'human-level' means many different things in different papers. Sometimes it means 'within the range of individual human variability,' which is a low bar. Sometimes it means 'matching the best human expert under controlled conditions,' which is a high bar. The phrase itself is almost meaningless without specification.
Assess a Frontier Capability Claim
- This is the core activity for Lesson 9. You will apply the five-part capability assessment framework to a real frontier AI capability claim.
- Step 1 — Choose your claim. Select one of the following (or find a recent one from a news source or research paper):
- Option A: 'GPT-4 passed the bar exam at a score in the 90th percentile of human test-takers.'
- Option B: 'AlphaFold solved the protein folding problem.'
- Option C: 'AI agents can now autonomously resolve over 50% of real software engineering tasks on SWE-bench.'
- Option D: A claim of your own choosing from a recent AI announcement — find a specific, checkable claim.
- Step 2 — Apply the framework. For your chosen claim, write a structured assessment addressing each of the five dimensions:
- 1. What exactly is the claim? (Restate it with full precision — system, task, conditions, comparison.)
- 2. What is the evidence? (Benchmark, deployment, experiment, anecdote? Source?)
- 3. What is the evaluation's scope and setting? (Does it match real-world use?)
- 4. What are the failure modes? (What kinds of inputs does the system struggle with?)
- 5. What is the reliability gap? (Capability vs. consistent real-world reliability.)
- Step 3 — Verdict. Write two to three sentences concluding: Is the claim well-evidenced and appropriately scoped? Is it overstated? Is it understated? What would you need to see to fully accept or reject it?
- Step 4 — Peer review. Exchange assessments with another student. Evaluate their assessment: Did they apply each framework dimension rigorously? Did they identify failure modes you missed? Do you agree with their verdict? Write two sentences of feedback.
- Step 5 — Class discussion. Share your claim and verdict. Together, identify: Are some claims systematically overstated? What kinds of evidence are most and least common? What does this suggest about how to consume AI news critically?
- This activity is the synthesis of the entire module so far. Take it seriously — apply everything you have learned about how capability is demonstrated, measured, and limited.
Good sources for real AI capability claims: ArXiv.org (research preprints), the official blogs of Anthropic, OpenAI, Google DeepMind, and Meta AI, and technology journalism at venues like MIT Technology Review, The Verge, and Ars Technica. The research papers themselves are more reliable than the headlines about them — compare both.
A company announces: 'Our AI passed the medical licensing exam at the 75th percentile of human test-takers.' Which question is most important to ask before accepting this as evidence of strong medical capability?
A researcher shows that a new model generates a correct answer on a hard reasoning problem that previous models got wrong. A skeptic says 'that is just one example.' Why is the skeptic's objection methodologically valid?