Evaluation and Release
Training a frontier model is only part of the story. Before that model reaches users, it must be rigorously tested. How capable is it? Where does it fail? Can it be prompted to produce harmful content? Does it behave reliably across thousands of different input types? The science and practice of AI evaluation — sometimes called 'evals' — has become one of the most active and contested areas in frontier AI development.
Capability Benchmarks
Benchmarks are standardized test suites that allow comparison of model capabilities across different models and over time. They are the vocabulary with which labs, researchers, and the public communicate about what models can do. Some widely used benchmarks include MMLU (Massive Multitask Language Understanding), which tests knowledge across 57 academic subjects from elementary to professional level. MATH and AIME benchmarks test mathematical problem-solving. HumanEval and SWE-bench test coding ability. GPQA tests expert-level scientific reasoning. HellaSwag and WinoGrande test commonsense reasoning. Benchmarks have significant limitations. Once a benchmark becomes widely used, labs optimize for it — either deliberately (by including benchmark-style questions in training data) or inadvertently (by the model having seen similar material). This 'benchmark contamination' can make benchmark scores overstate real-world capability. A model that scores 90% on MMLU may still fail at novel tasks that require flexible application of the same knowledge. For this reason, frontier labs increasingly rely on 'held-out' evaluation sets that are never shared publicly and thus cannot contaminate training data. They also commission expert human evaluators to assess model performance on tasks where automated metrics are inadequate — such as creative writing quality, the correctness of legal analysis, or the accuracy of medical advice.
When the MATH benchmark was first released, models solved roughly 5% of problems. Within two years, leading models exceeded 80%. Some of this reflects genuine improvement; some may reflect training data that resembles MATH problems. Distinguishing real capability gains from benchmark overfitting is one of the central methodological challenges in AI evaluation.
Safety Evaluation and Red-Teaming
Capability benchmarks measure what models can do. Safety evaluations measure what they should not do and whether existing safeguards hold. Red-teaming involves human testers — internal or external — who attempt to elicit harmful, deceptive, or policy-violating outputs from the model by crafting adversarial prompts. Red teams might try to extract instructions for synthesizing dangerous chemicals, generate content that sexualizes minors, persuade the model to reveal confidential system-prompt contents, or cause the model to impersonate a trusted authority. Frontier labs typically run multiple rounds of red-teaming. Initial red-teaming informs the final stages of RLHF training, allowing labs to add specific training examples addressing discovered failure modes. A second round of red-teaming after training verifies whether those issues were addressed. External red-teaming — conducted by independent researchers or organizations not affiliated with the lab — adds credibility because internal testers may unconsciously avoid finding certain failures. Beyond adversarial testing, labs evaluate models on what might be called 'uplift risk' — the degree to which the model could help a malicious actor accomplish dangerous tasks they could not otherwise do. The relevant question is not whether a model can describe a dangerous process but whether it provides meaningful capability uplift over what a determined adversary could access through other means. This framing guides decisions about which outputs to restrict.
Match each evaluation type to what it is designed to measure.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
The Release Decision
After evaluation, the lab must decide how to release the model. This decision involves multiple dimensions that do not always point in the same direction. Gated versus open release. Will the model be available only through a controlled API (gated release), or will the model weights be released publicly for anyone to download and run (open release)? Gated release allows the lab to monitor usage, update the model, add safety guardrails, and revoke access if misuse is detected. Open release enables broader research access and allows others to build on the model, but makes it impossible to retract — weights that have been released cannot be 'un-released.' Staged rollout. Many labs release models first to a small set of trusted developers, then to a broader beta population, then to the general public. This staged approach allows labs to monitor real-world usage before exposure at full scale and to respond to unexpected failure modes before they affect millions of users. Usage policies. Released models typically come with terms of service that prohibit specific uses. These policies are enforced by monitoring API usage patterns, account review, and model-level safeguards. Their effectiveness depends on the technical robustness of the model's safety training — a model that can be easily jailbroken provides weaker guarantees than one whose safety behaviors are deeply embedded. System card publication. Major labs now typically publish a 'system card' or 'model card' alongside a release — a document summarizing known limitations, dangerous capabilities, safety evaluation results, and intended use cases. These documents are important for accountability, though critics note that labs control what they choose to include.
A model achieves 91% accuracy on the MMLU benchmark, up from 76% two years earlier. Why should this improvement be interpreted cautiously rather than celebrated as straightforward evidence of capability progress?
A frontier lab considers releasing a model that can provide detailed technical information about certain dangerous topics. The safety team frames this as an 'uplift risk' question. What does this framing mean?
Design an Evaluation Suite
- You are on the evaluation team for a new frontier language model. Your job is to design a pre-release evaluation suite.
- Step 1: Identify three capability dimensions you want to measure (for example: coding ability, factual accuracy, reasoning). For each, name a specific existing benchmark or describe a test format you would use.
- Step 2: Identify two safety dimensions you want to red-team. For each, write two specific prompt examples a red-teamer might try. Explain what a 'pass' looks like (the model behaves safely) and what a 'fail' looks like.
- Step 3: Design a held-out evaluation set for one capability dimension. How would you ensure it has not contaminated any training data? Who would create the questions?
- Step 4: Write a one-paragraph release recommendation based on imaginary results: your model scores well on capabilities, passes red-teaming for most categories, but shows a 12% failure rate on one specific safety red-team scenario. Do you recommend releasing it? What conditions or additional work would you require first?