Emergence and Capabilities
Scaling laws describe a smooth, predictable process: as you add compute, the model's average loss decreases on a curve you can extrapolate. But something strange and important also happens at scale that the smooth loss curve does not predict. Certain capabilities appear abruptly, performing near-randomly at smaller scales and then suddenly performing well at a threshold model size. This phenomenon is called emergence, and it is one of the most surprising and consequential properties of large language models.
What Is Emergence?
The term emergence comes from complex systems science: an emergent property is one that arises in a complex system but is not predictable from, or reducible to, the properties of the system's components. Water is wet; individual H2O molecules are not. Ant colonies build sophisticated structures; individual ants follow simple rules. Emergence describes phenomena where the whole has properties the parts do not. In large language models, emergent capabilities are tasks on which models perform near-randomly at small scale and then perform significantly above chance at a critical larger scale, without any explicit training on that capability. This is distinct from gradual improvement: the model does not get a little bit better at multi-step arithmetic with every factor of 10 in compute. It seems to fail entirely and then, past a threshold, it works. The 2022 paper Emergent Abilities of Large Language Models, published by researchers at Google, documented dozens of such capabilities across a range of benchmarks. Examples include: multi-step arithmetic (adding three-digit numbers with intermediate steps), chain-of-thought reasoning (explaining reasoning before giving an answer), multi-language translation without explicit translation training, and solving college-level science exam problems. None of these capabilities were directly trained into the models. They appeared as byproducts of large-scale next-token prediction pretraining.
Gradual improvement: each 10x of compute makes the model somewhat better at the same tasks. Emergence: the model cannot do a task at all below a threshold scale, and then suddenly can. The difference matters enormously for capability prediction: gradual improvement is extrapolatable; emergence is not, because you do not know in advance which capabilities will emerge or at what scale threshold.
The precise mechanism behind emergence is an active area of research with multiple competing hypotheses. Hypothesis one: phase transitions in learned representations. The model may need a minimum representational capacity to encode the structure required for a task. Below that threshold, even the best attempt produces random-level performance. Above it, the structure clicks into place. This is analogous to phase transitions in physics: water does not gradually become ice; it crosses a temperature threshold. Hypothesis two: metric artifacts. A controversial counter-argument proposed by researchers at Stanford in 2023 suggests that many apparent emergent abilities are artifacts of the evaluation metrics used. When you use a metric that credits only exact-right answers (zero credit for almost-right), a smooth underlying improvement in model capability looks like a sudden jump. Under continuous metrics like partial credit or probability scoring, many apparent emergent abilities look more gradual. This argument does not eliminate emergence as a real phenomenon, but it suggests some claimed instances may be measurement artifacts. Hypothesis three: multi-component skill requirements. Some tasks require the simultaneous presence of multiple distinct sub-skills. A model might develop each sub-skill gradually, but the task only becomes tractable when all sub-skills are present together above their individual thresholds. The joint probability of all sub-skills exceeding their thresholds creates a sharp transition even if each individual skill improves smoothly. The truth is likely some combination of all three. The honest position in 2026 is that we do not fully understand why and when emergence occurs.
Match each emergent capability to the type of reasoning or skill it requires that a smaller model lacks.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Implications and Limits
Emergence has profound practical implications. If you cannot predict which capabilities will appear and at what scale, capability prediction for future models becomes uncertain. An organization planning safety evaluations for a future model cannot exhaustively test for capabilities that have not yet emerged. This is one of the central arguments for proactive AI safety research: you may not be able to wait and see what emerges before deciding how to respond. Emergence also complicates the narrative that scaling is just engineering. If larger models might acquire capabilities qualitatively different from smaller ones, each generation of frontier models may pose genuinely novel challenges rather than just more of the same. The argument 'we have handled these models safely so far' does not straightforwardly apply to a model that may have new capabilities the current one lacks. On the other side, not every alarming capability is truly emergent. Some claimed emergent capabilities turn out, on closer inspection, to be present in smaller models at lower accuracy rates and are better described as continuous improvements that cross a threshold of usefulness. The field has become more careful about distinguishing 'this capability appeared suddenly' from 'this capability crossed the bar of being useful to measure.' Finally, emergence does not mean unlimited capability growth. Models have systematic weaknesses that have persisted across scales: unreliable long-range mathematical reasoning, tendency to confabulate facts, difficulty with novel spatial reasoning, and susceptibility to adversarial prompts. Scale does not appear to be resolving all of these. Capability is not monolithic: a model can be simultaneously superhuman at some tasks and unreliable at others.
If capabilities can appear unexpectedly at scale, then the safety properties of a future model cannot be fully inferred from tests on the current model. This is why frontier AI labs now conduct pre-deployment capability evaluations, sometimes called evals, specifically designed to probe for dangerous capabilities like assisting with weapons synthesis or autonomous deception that might have emerged in a new model.
A model achieves 5% accuracy on a multi-step reasoning benchmark at 1 billion parameters, 7% at 10 billion, and 48% at 100 billion. A critic argues this is not true emergence but a gradual improvement that crossed a usefulness threshold. What evidence would most strongly support the critic's position?
Why does the possibility of emergent capabilities complicate safety evaluation for future frontier AI models?
Design an Emergence Investigation
- You are a researcher studying emergent capabilities in language models. Design a rigorous investigation of one candidate emergent capability.
- Step 1: Choose a candidate capability that might emerge with scale. Examples: solving logic puzzles, generating valid formal mathematical proofs, correctly citing obscure historical facts, or debugging code from a description alone.
- Step 2: Write a precise definition of the capability. What exactly must the model do to succeed? What counts as failure?
- Step 3: Design a benchmark with at least three difficulty levels. What does a hard instance look like? An easy instance?
- Step 4: Choose a scoring metric. Will you use binary correct-or-wrong scoring, or a partial-credit continuous metric? Argue for your choice, and explain what each choice would reveal versus conceal.
- Step 5: Describe how you would determine whether an observed capability jump is true emergence or a metric artifact. What experimental comparison would settle the question?
- Step 6: If your capability turns out to be genuinely emergent, what safety evaluation would you design to test whether it poses any risk?
- Present your investigation design to the class and critique each other's methodologies.