Skip to main content
Frontier & Future AI

⏱ About 20 min20 XP

Why This Paradigm Works

The previous seven lessons have described what the current AI paradigm is: transformers pretrained at scale with self-supervised objectives, then aligned with human feedback. But why does this specific combination work? Why did decades of other AI approaches, each with their own logic and promise, fail to produce systems of comparable capability? And why has the current paradigm continued to work across domains, model sizes, and application areas that no one anticipated when the transformer was introduced in 2017? These are questions about the paradigm's foundations, and answering them precisely will let you reason about its future: what it can likely continue to do, where it is structurally constrained, and what would need to change for a different paradigm to supersede it.

The Richness of the Pretraining Objective

Next-token prediction seems, on the surface, like a narrow task: predict which word comes next. The reason the paradigm works begins with recognizing how extraordinarily rich this objective actually is. Human language is not a random sequence of words. Every sentence encodes decisions: what entities to mention, what properties to attribute to them, what causal and logical relationships hold between events, what the speaker assumes the listener already knows, what tone and register is appropriate. Predicting the next token in naturally occurring text requires learning a model of all of these factors simultaneously. Consider what is needed to correctly predict the next word in 'The surgeon scrubbed in and prepared to ___.' The model must know that surgeons perform operations, that scrubbing in is part of surgical preparation, and that the next step in operating room protocol typically involves a specific action. To get this right across thousands of similar sentences in medicine, law, engineering, history, and mathematics, the model must develop representations that encode each domain's structure. This is why next-token prediction is what machine learning researchers call a rich self-supervised task: it is simple to specify and compute, but the internal representations required to minimize it well are complex, general, and useful for downstream tasks. Contrast this with narrower self-supervised tasks like detecting whether two sentences are adjacent in a document, which requires much less general knowledge and produces less transferable representations.

A Proxy Task That Requires World Knowledge

To predict text accurately, a model must implicitly model the world that text describes: entities, causality, domain knowledge, logical structure, social context. This is why next-token prediction on diverse human-generated text produces surprisingly general world knowledge and reasoning ability as a byproduct. The objective is narrow; the knowledge required to excel at it is vast.

The second reason the paradigm works is the inductive bias of the transformer architecture, which is closely matched to the structure of the problem. Language and many other sequential domains have a particular property: relevance is non-local. In the sentence 'The trophy did not fit in the suitcase because it was too big,' resolving what 'it' refers to requires connecting a pronoun to a noun several positions back. In a legal document, a reference to 'the aforementioned clause' may refer to content on a previous page. In code, a variable declared at the top of a function is referenced throughout. The relevant context is not always the immediately preceding words. Self-attention is specifically designed for non-local relevance: every token can attend to every other token with equal ease. This matches the structure of language directly. By contrast, convolutional neural networks impose a locality bias (nearby pixels or tokens are most relevant), and RNNs impose a recency bias (recent tokens dominate the hidden state). These biases are appropriate for some domains but create structural disadvantages for language. The transformer's minimal inductive bias, each token can be relevant to every other token, is not a weakness. It is a strength when the domain has genuinely non-local structure. The architecture does not impose incorrect assumptions and then struggle to overcome them. It allows the data to determine which tokens are relevant to which others, learning language-specific structure directly from evidence.

The third reason the paradigm works is scale as an amplifier of latent structure. Human language is not arbitrary. The patterns in text are generated by minds that share cognitive architecture, biological constraints, social structures, and physical worlds. This means there is real, learnable structure in text that is stable across time and language communities. Scale amplifies the ability to detect and encode this structure. A small model trained on a million words can learn common word associations. A large model trained on a trillion words can learn subtle patterns that occur infrequently but consistently, the kind of specialized knowledge in science, law, and engineering that is rare in any individual document but present with regularity across a large enough corpus. The striking practical result is that capability improvements from scale are not merely quantitative. A model that has seen more data learns patterns invisible to a smaller model, patterns that underlie capabilities like analogical reasoning, multi-step planning, and scientific problem-solving. This is why scaling has produced qualitatively different systems, not just faster or more accurate versions of the same system. Finally, the paradigm benefits from what might be called multi-domain transfer. Because the pretraining corpus is diverse, the representations learned are not specialized to a single domain. A model that learned about cause-and-effect from medical literature, historical narrative, and physics textbooks has a richer representation of causality than one trained on any single domain. When that representation is applied to a new domain, it transfers more robustly. This is a structural advantage of generalist pretraining over specialist pretraining that was not anticipated in the design of the paradigm but turned out to be one of its most powerful properties.

Match each reason the current paradigm works to the specific mechanism it describes.

Terms

Richness of the pretraining objective
Matched inductive bias of the transformer
Scale amplifying latent structure
Multi-domain transfer
GPU parallelism enabling scale

Definitions

The matrix operations underlying attention and feedforward layers map efficiently onto GPU hardware, making billion-parameter training computationally feasible
Self-attention allows any token to attend to any other, fitting the non-local relevance structure of language without imposing incorrect locality assumptions
Larger training corpora reveal rare but consistent patterns invisible to smaller models, enabling qualitatively different reasoning abilities
Next-token prediction requires implicitly modeling world knowledge, causality, and domain structure across every domain in the training corpus
Pretraining on diverse domains produces representations of general concepts like causality that transfer more robustly to new domains than specialist training

Drag terms onto their definitions, or click a term then click a definition to match.

Why Prior Paradigms Failed

Understanding why the current paradigm works is complemented by understanding what prior paradigms lacked. Expert systems encoded knowledge as explicit, hand-written rules. They failed because the coverage problem is insurmountable: the space of possible inputs to a real-world system is too large to cover with rules. Every exception requires another rule, and the maintenance cost grows until the system collapses. The current paradigm sidesteps this by learning rules from data, not writing them by hand. Earlier neural networks (pre-2012, pre-transformer) failed for compounding reasons: not enough data, not enough compute, and architectural limitations that prevented them from capturing long-range dependencies. The key insight is that each of these three failures was contingent, not structural. When data, compute, and architecture improved together, the approach worked. The current paradigm is not a new idea; it is the same core idea, neural networks learned from data, operating in the regime where its prerequisites are met. Symbolic AI and logic-based systems failed because real-world language and perception are too noisy, ambiguous, and context-dependent for formal logical inference. A system that requires unambiguous formal inputs cannot process raw human language reliably. The current paradigm embraces ambiguity by learning statistical patterns from noisy data rather than requiring clean formal representation. The honest conclusion is that the current paradigm did not win on theoretical grounds. It won on empirical grounds: it works better, scales better, and generalizes better than alternatives. It may be superseded when another paradigm is demonstrated to work better still. Several candidate directions, including world models, neurosymbolic integration, and test-time compute as a scaling axis, are actively researched precisely because the current paradigm has known structural limits.

The Paradigm Won Empirically, Not Theoretically

We do not have a complete theoretical account of why large neural networks trained on next-token prediction produce such general and capable systems. The paradigm won because it works, not because we proved it would. This should inspire both confidence (the evidence is overwhelming) and humility (we do not fully understand why, which makes it harder to predict where it will fail).

A team argues that because the transformer architecture has a minimal inductive bias and allows every token to attend to every other, it should work equally well on all sequential data, including time-series data from financial markets. Why is this argument incomplete?

A student claims that expert systems failed simply because computers were not fast enough in the 1980s, and with today's hardware they would work as well as large language models. What is wrong with this argument?

Explain the Paradigm to a Skeptic

  1. Practice explaining why the current AI paradigm works by responding to each skeptic argument below. Write two to three precise sentences for each.
  2. Skeptic 1: 'Language models just memorize text. They do not really understand anything. So of course they are good at predicting text, but they will never generalize to truly new situations.'
  3. Skeptic 2: 'All these AI labs did was throw more compute at the problem. Anyone could have done that. There is no real scientific insight here.'
  4. Skeptic 3: 'The transformer was designed for translation, not for reasoning or science. Using it for everything is just lucky accident, not principled engineering.'
  5. Skeptic 4: 'If we do not understand theoretically why these models work, how can we trust them? Science requires explanation, not just empirical results.'
  6. After writing your responses individually, discuss as a class: which skeptic argument is hardest to refute? What would you need to know to refute it definitively?