What Frontier AI Still Cannot Do
In the past few years, AI systems have passed bar exams, defeated world-champion Go players, written code that ships to production, and generated images indistinguishable from photographs. The marketing narrative around these capabilities is relentlessly optimistic. This lesson is the corrective: a rigorous, honest inventory of what frontier AI still cannot do reliably, and why those limits are not bugs that a slightly larger model will fix.
The Benchmark Trap
AI capabilities are usually measured on benchmarks — curated test sets with known correct answers. A model that scores 90% on a benchmark sounds impressive, until you examine the benchmark itself. Many benchmarks test a narrow slice of human knowledge in formats — multiple-choice questions, constrained coding puzzles — that favor pattern-matching over genuine understanding. Saturation is a recurring problem: the field creates a benchmark, models reach near-human performance within a year or two, and researchers declare success — only to discover that models fail trivially modified versions of the same tasks. In 2023, a study found that GPT-4 achieved 90%+ on a popular math benchmark, yet failed nearly 60% of problems when variable names were changed and the surface presentation was altered. The model had learned the surface patterns of the benchmark, not the underlying mathematics. This benchmark-to-reality gap is one of the most important structural features of frontier AI. When evaluating any claimed capability, ask: how was it measured, and how far does the measurement stray from realistic deployment conditions?
A model that scores 95% on a benchmark may fail 40% of real-world instances of the same task. Benchmarks measure performance on a sample; that sample is almost always easier, cleaner, and more constrained than production reality.
Six Persistent Limitation Categories
Despite enormous scaling, frontier models share a common set of stubborn limitations. These are not random failures — they follow predictable patterns. Reliable factual grounding: Large language models generate text by predicting likely next tokens. They do not look things up; they recall statistical patterns from training. As a result, they produce confident, fluent, grammatically perfect statements that are factually wrong — a phenomenon called hallucination. A model asked for citations to academic papers will produce plausible-sounding author names, journal names, and titles that do not exist. This is not fixable by making the model bigger; it is structural to how language models work. Multi-step planning and long-horizon reasoning: Models struggle to maintain consistent state across long chains of reasoning. Ask a model to plan a 30-step project with dependencies between steps, and it will produce a plan that looks coherent but contains internal contradictions detectable only when you simulate the execution. AlphaCode and similar coding agents solve competitive programming problems at impressive rates — but routinely fail at realistic engineering tasks requiring sustained context and interdependency tracking. Physical-world understanding: AI systems trained on text and images have no sensorimotor experience. They cannot reliably answer questions that require intuitive physics — the kind of understanding a five-year-old has from dropping objects. Asked 'If you stack a bowling ball on top of a wine glass, what happens?', a model may answer correctly because this scenario appeared in training text. Perturb the scenario slightly and the answer breaks. Physical reasoning derived from statistics is fragile. Robust tool use and agency: AI agents that use tools — search, code execution, APIs — perform well on clean demonstrations but fail when the tool returns unexpected formats, the environment changes state between steps, or error recovery is required. Current agents are brittle in exactly the environments where reliability matters most: messy, non-deterministic real-world systems. Novel problem-solving: Models excel at problems that resemble training data. When a problem is genuinely new — a proof technique not in the literature, a business scenario with no historical analog, a scientific discovery requiring novel conceptual combination — frontier models underperform expectations dramatically. They are interpolation engines, not extrapolation engines. Consistent identity and behavior: Language models do not have stable internal states. The same model will contradict itself across sessions, answer identically worded questions differently depending on phrasing, and adopt whatever persona is implied by the prompt. This inconsistency is not a safety feature — it is a coherence failure with real consequences for any application requiring predictable behavior.
Match each limitation category to the most accurate description of what the failure looks like in practice.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
What These Limits Share
Looking across these six categories, a common structure emerges. Frontier models are extraordinarily powerful pattern-completion engines over their training distribution. Where they fail is wherever deployment diverges from that distribution — novel phrasings, longer chains of reasoning, physical grounding not captured in text, environments that behave unexpectedly. This is not a statement about the future. It is a description of the current technology and the mechanisms that produce it. Future architectures may address some of these limits. But understanding that these limits exist, understanding where they come from, and knowing how to test for them is essential knowledge for anyone building with or evaluating AI systems. The next nine lessons in this module examine each category in depth.
Frontier AI systems are neither omniscient nor useless. They are powerful pattern matchers over enormous training distributions with specific, predictable failure modes. Understanding those failure modes precisely is the difference between using AI well and being misled by it.
A language model scores 92% on a medical knowledge benchmark. A hospital considers deploying it for clinical decision support. What is the most important question to investigate before deployment?
Which of the following is NOT a reason why frontier language models produce confident false statements?
Audit a Frontier Model's Limits
- Using any available large language model, run the following five tests and record the results.
- Test 1 (Hallucination): Ask the model to cite three peer-reviewed papers published since 2022 on a technical topic of your choice. Verify each citation's existence.
- Test 2 (Long-horizon planning): Ask the model to plan a detailed 20-step software engineering project. Trace step 8's outputs into step 12's inputs — do the dependencies hold?
- Test 3 (Physical reasoning): Ask: 'A sealed glass bottle half-full of water is placed in a freezer. What happens over 4 hours, and what does the bottle look like afterward?' Evaluate whether the answer accounts for water expansion, pressure, and glass fracture.
- Test 4 (Paraphrase robustness): Ask a math problem. Then restate the same problem with different variable names and objects. Does the model give the same answer?
- Test 5 (Consistency): Ask the same factual question twice in the same session with a paragraph of unrelated text in between. Does the model contradict itself?
- Document which tests the model passed and which it failed. What pattern do you notice about where failures occur?