Skip to main content
Frontier & Future AI

⏱ About 20 min20 XP

Coding and Scientific Capability

Two domains have emerged as bellwethers of frontier AI capability: software development and scientific research. These fields share important properties that make them excellent test beds for AI. Both are rigorous — code either runs correctly or it does not; scientific claims either replicate or they do not. Both are economically important — software development employs millions of people and underpins the global economy; scientific research drives medicine, engineering, and our understanding of the world. And both are areas where frontier AI has made dramatic, measurable progress in a short time.

Coding Capability: What Frontier Models Can Do

Frontier models can write functional code from natural-language descriptions across dozens of programming languages — Python, JavaScript, TypeScript, Rust, Go, C++, Java, SQL, bash, and more. They can generate complete functions, classes, and modules; implement standard algorithms; write unit tests; refactor existing code for readability or performance; explain what a piece of code does; identify bugs; and suggest fixes. On competitive programming benchmarks, frontier models now match or exceed the performance of skilled human programmers on many problem classes. The SWE-bench benchmark — which tasks models with resolving real GitHub issues from popular open-source repositories — has become a key measure of software engineering capability. It tests the full loop: reading a codebase, understanding a bug report, writing code that fixes the issue, and passing the existing test suite. In 2023, top models scored below 5% on SWE-bench. By late 2024, frontier agents (models with code execution and file system tools) exceeded 50% on the standard version. This is not a synthetic toy problem — these are real bugs in real codebases, and resolving half of them autonomously represents genuine software engineering capability. In practice, frontier models are most useful as coding collaborators: they draft code that a human developer reviews and corrects, generate boilerplate that would otherwise take time, suggest approaches for unfamiliar libraries, and explain complex codebases to newcomers. The net effect is substantial productivity amplification — studies suggest experienced developers working with frontier AI coding tools complete certain tasks 55% faster on average.

Why Code Is a Strong Test of AI Capability

Code is unusually good for evaluating AI because it has objective ground truth: you can run the code and see if it works. This makes it possible to benchmark rigorously and to train models using automated feedback — a model that writes code can immediately learn whether its code is correct by running it. This verifiable feedback loop is one of the reasons coding capability has advanced so rapidly.

Scientific Capability: From Literature to Discovery

Science is harder to benchmark than code because scientific ground truth often takes years to establish through replication and peer review. Nevertheless, frontier AI has demonstrated striking scientific capability. Literature navigation and synthesis: frontier models can read, summarize, and connect ideas across thousands of scientific papers, identifying relevant prior work and spotting connections across disciplines that a human researcher might miss. Tools like Semantic Scholar and Elicit build on this to accelerate literature review. Protein structure prediction: AlphaFold 2 (DeepMind, 2020) predicted the three-dimensional structure of proteins from their amino acid sequences with accuracy approaching experimental methods — a problem that had resisted solution for fifty years. By 2023, AlphaFold had predicted structures for essentially all known proteins, transforming structural biology and drug discovery. AlphaFold 3 extended this to predict interactions between proteins, DNA, RNA, and small molecules. Mathematics: frontier models have made verifiable contributions to mathematics. Google DeepMind's AlphaGeometry solved competition-level geometry problems using a hybrid of neural and symbolic methods. FunSearch, another DeepMind system, discovered new solutions to combinatorial mathematics problems that had eluded human mathematicians for decades. Drug discovery and materials science: AI systems are being used to propose candidate drug molecules, predict their properties, and prioritize which to synthesize and test — dramatically accelerating the early stages of pharmaceutical research. Similar workflows are applied to discovering new materials with desired properties.

Flashcards — click each card to reveal the answer

The coding and scientific capabilities of frontier AI are not without limits. In coding, models make subtle logical errors that superficially look correct — a bug that passes a test suite but fails on edge cases, a security vulnerability introduced by a plausible-looking but flawed implementation. They lack deep understanding of the system context in which code will run; they do not know your team's conventions, your production constraints, or your codebase's history unless these are explicitly in their context. In science, AI systems excel at pattern recognition and suggestion but do not yet reliably distinguish between a plausible-sounding hypothesis and a well-evidenced one. They can hallucinate citations, misrepresent the findings of papers, and propose experiments that violate known physical constraints. The most reliable scientific AI applications are those with tight verification loops — where every AI output is checked against objective reality quickly.

Plausible Code Is Not Correct Code

Frontier models generate code that looks correct with high frequency — it follows syntax rules, uses appropriate library calls, and implements the apparent logic. But looking correct and being correct are not the same. Security vulnerabilities, off-by-one errors, incorrect handling of edge cases, and subtle race conditions can all appear in AI-generated code that passes initial review. Treat AI-generated code as a draft requiring careful human review, especially in security-sensitive contexts.

Match each AI coding or science capability to the most accurate description of its current state.

Terms

Resolving real GitHub issues autonomously
Protein structure prediction
Generating unit tests for a given function
Guaranteeing absence of security vulnerabilities in AI-generated code
Mathematical discovery via AI

Definitions

Solved for essentially all known proteins by AlphaFold, transforming structural biology
A reliable high-value application where models excel and errors are quickly caught by execution
Demonstrated by AlphaGeometry and FunSearch on competition and research-level problems
Not achievable with current models — security review by humans remains essential
Demonstrated on over 50% of SWE-bench standard problems by frontier agents as of late 2024

Drag terms onto their definitions, or click a term then click a definition to match.

Why is automated code execution feedback particularly powerful for improving AI coding capability?

AlphaFold 2's protein structure prediction is described as one of the most significant scientific AI achievements. What makes it historically significant, specifically?

Code Generation and Verification Experiment

  1. This activity explores the gap between plausible and correct AI-generated code.
  2. Part 1 — Generate: Give a frontier model (Claude, GPT-4o, or Gemini) a coding problem of moderate complexity. Suggested problems: implement a function that checks whether a string is a valid IPv4 address; write a function that finds the longest common subsequence of two strings; implement binary search on a sorted list. Record the code it produces.
  3. Part 2 — Review: Before running the code, read it carefully. Can you spot any potential edge cases it might handle incorrectly? Make a prediction: list at least two inputs where you think the code might fail or produce wrong output.
  4. Part 3 — Test: Run the code (in any Python environment, browser-based REPL, or coding tool). Test your predicted edge cases. Also test: empty inputs, very large inputs, inputs with special characters. Record the results — did the code pass? Did it fail where you predicted?
  5. Part 4 — Reflect: Write a paragraph on what this experiment reveals about the appropriate role of AI in a software development workflow. When should AI-generated code be trusted, and when must it always be reviewed?