Skip to main content
Frontier & Future AI

⏱ About 15 min15 XP

The Transformer and Large Models

In 2017, eight researchers at Google published a paper with a deceptively simple title: 'Attention Is All You Need.' They were proposing a new architecture for neural networks called the transformer. At the time, it seemed like a technical improvement in a specialized corner of AI — machine translation. Within five years, it had become the foundation of the most powerful AI systems ever built, reshaped how people write, code, search for information, and interact with machines, and triggered a debate about the future of human work. Few technical papers in computing history have had a faster or more sweeping impact.

The Problem Transformers Solved

Before the transformer, the best way to process language in AI was a type of network called a recurrent neural network, or RNN. An RNN reads text one word at a time, maintaining a kind of running memory of what it has read. This sequential approach had two major weaknesses. First, by the time the model reached the end of a long paragraph, information from the beginning had often been compressed or forgotten. Second, because each word had to be processed after the previous one, training could not be easily parallelized — you had to wait for step one before starting step two. The transformer solved both problems. Instead of processing words one at a time, it processes the entire input simultaneously. And it uses a mechanism called self-attention to let every word in a sentence consider every other word when building its representation. When processing the word 'it' in the sentence 'The cat sat on the mat because it was tired,' self-attention allows the model to look at every other word and determine that 'it' most likely refers to 'the cat' — not 'the mat.'

What Self-Attention Does

Self-attention lets each word ask: which other words in this input are most relevant to understanding my meaning? It then weights those relationships and builds a richer, context-aware representation. This is what gives transformers their remarkable ability to understand nuance, reference, and long-range meaning.

Scaling Up: The Birth of Large Language Models

Researchers quickly discovered something remarkable about transformers: they scaled. Unlike earlier architectures that hit diminishing returns as they grew larger, transformers kept getting better the more parameters they had and the more data they trained on. Parameters are the adjustable numbers inside a neural network — the knobs that get tuned during training. A model with more parameters can represent more complex patterns. In 2018, Google released BERT — a transformer trained on a huge amount of text that achieved state-of-the-art results on many language tasks at once. OpenAI released the first GPT model the same year. By 2020, GPT-3 had 175 billion parameters and could write coherent essays, answer questions, summarize documents, and generate code — all from a single model trained once on general text. Nobody had reprogrammed it for each task. The capability emerged from scale. These systems are called large language models, or LLMs. The word large refers to both their number of parameters and the volume of data they trained on. The word language reflects that they were trained primarily on text — though newer versions process images, audio, and other types of data as well.

Emergent Capabilities

One of the most surprising findings about large language models is the existence of emergent capabilities — abilities that appear suddenly as models grow larger, without anyone explicitly programming them in. A model at one size might fail completely at multi-step arithmetic. Scale it up past a certain threshold and, seemingly without any targeted training, it starts getting the arithmetic right. Chain-of-thought reasoning, solving analogy problems, and even rudimentary programming all emerged in this way. Researchers debate the exact interpretation of emergence. Some argue it reflects genuine new capabilities. Others suggest it is partly a measurement artifact — the capability was always improving, just too weak to score above zero until the model crossed a threshold. Either way, the phenomenon raises a challenging question: if we cannot predict what new abilities will appear as models grow larger, how should we think about the safety and behavior of future systems?

Beyond Text: Multimodal Models

The transformer architecture proved versatile far beyond language. The same self-attention mechanism that helps a model understand which words relate to which turned out to be useful for images, audio, and video. Multimodal models — systems trained on multiple types of data simultaneously — can now describe photos in words, generate images from text descriptions, transcribe speech with high accuracy, and answer questions about a video clip. This convergence, where one architecture handles many different types of input, is one of the defining characteristics of the frontier in 2025.

The transformer architecture uses a mechanism called to let each word in a sentence consider every other word when building its meaning. Unlike recurrent networks, transformers process the entire input , which makes training much faster. As these models grew larger, researchers discovered capabilities — abilities that appeared suddenly without anyone programming them in. Systems trained on text, images, and other inputs together are called models.

What is a key advantage of transformer self-attention over the sequential processing used in recurrent neural networks?

What does it mean that large language models exhibit 'emergent capabilities'?

Self-Attention in Plain English

  1. Read the following sentence: 'Maya gave her friend the book she had borrowed because she thought she would enjoy it.'
  2. Step 1: Identify every pronoun in the sentence (she, her, it, etc.).
  3. Step 2: For each pronoun, write what word or phrase it most likely refers to — and note if it is ambiguous.
  4. Step 3: You just did a human version of self-attention. Write a short paragraph explaining how this exercise relates to what self-attention does inside a transformer model.
  5. Step 4: Write one sentence explaining why getting pronouns right is important for an AI system that helps students understand a complex paragraph in their textbook.