The Transformer Architecture
In June 2017, a team at Google Brain published a paper with a deceptively modest title: Attention Is All You Need. The paper introduced the transformer, an architecture for processing sequences that discarded the recurrent structure that dominated previous models and replaced it with a mechanism called self-attention. Within two years, the transformer had become the dominant architecture not just for language, but for vision, audio, protein structure prediction, and reinforcement learning. Understanding it is essential for anyone who wants to reason about modern AI.
The Problem with Recurrence
Before transformers, sequence models were dominated by recurrent neural networks and their variant, long short-term memory networks. These models processed sequences token by token, maintaining a hidden state that summarized everything seen so far. To understand the word 'bank' in a sentence about finance, the RNN had to thread through every preceding word in order. This created two serious problems. First, limited parallelism: because each step depends on the previous hidden state, you cannot process multiple tokens simultaneously. Training on long sequences was slow. Second, long-range dependencies: relevant information must survive through many intermediate hidden state updates. In practice, RNNs struggled to maintain relevant information across gaps of more than a few dozen tokens, a severe constraint when understanding a paragraph or document. LSTMs improved the second problem with gated memory cells, but not enough for the sentence lengths and document structures researchers wanted to handle. Something fundamentally different was needed. The transformer's answer was self-attention: every token attends directly to every other token in the sequence simultaneously, computing a weighted mixture of all positions. There is no sequential bottleneck. Long-range dependencies are as cheap to learn as local ones. And the entire operation can be parallelized across modern GPU hardware. A transformer is organized into layers called transformer blocks, stacked on top of each other. Each block contains two main components: a multi-head self-attention sublayer and a feedforward sublayer. Residual connections and layer normalization sit around each sublayer to stabilize training. The flow of information: input tokens are converted to dense vector embeddings, positional encodings are added to indicate token order, the embeddings pass through all stacked transformer blocks, and the output embeddings are mapped to the model's output distribution.
Self-attention computes, for each token, a weighted sum over all tokens in the sequence. The weights are determined by learned similarity scores between token representations. Tokens that are highly relevant to each other get high weights; irrelevant tokens get near-zero weights. This is how a transformer decides which parts of the context to use when processing each token.
Flashcards — click each card to reveal the answer
Why Transformers Generalize Across Domains
One of the most striking facts about the transformer is its domain-agnosticism. The same core architecture that processes language also processes images, treating 16x16 pixel patches as tokens in Vision Transformers. It processes audio by treating spectrogram frames as tokens, protein sequences by treating each amino acid as a token, and even game states in reinforcement learning. Why does one architecture work across such different domains? Because the transformer makes very few assumptions about the structure of its input. It does not assume spatial locality, as convolutional neural networks do. It does not assume sequential order is the primary source of information, as RNNs do. It simply asks: given a set of tokens with associated positions, which tokens should inform the representation of each other? This abstraction is general enough to accommodate any structured sequence. The trade-off for this generality is computational cost. Self-attention scales quadratically with sequence length: doubling the number of tokens quadruples the computation. This is why context windows were historically limited to a few thousand tokens, and why extending them to hundreds of thousands required algorithmic innovations like flash attention and sparse attention. Parameter count scales with the number of layers and the width of each layer. A small transformer might have 100 million parameters. GPT-3 has 175 billion. Estimates for GPT-4 suggest roughly 1.8 trillion, potentially as a mixture-of-experts model where only a fraction of parameters are active for any given input.
Self-attention over a sequence of length n requires computing n squared pairwise scores. For n equal to 1,000 this is one million computations. For n equal to 100,000 this is ten billion. This is why long-context models required years of algorithmic work to make practical, and why context length remains a competitive benchmark for frontier models.
Match each architectural feature of the transformer to the problem it solves.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A developer wants to build a model that can read an entire 80,000-word novel and answer questions about it. Why would this task be difficult for an LSTM but more tractable for a transformer with a long context window?
Vision Transformers treat 16x16 pixel image patches as tokens. What must be added to patch embeddings before they enter transformer layers, and why is this necessary?
Trace Information Flow Through a Transformer
- Work with a partner on this structured diagram exercise.
- Step 1: Write this sentence as seven tokens: The, cat, sat, on, the, mat, period.
- Step 2: For each token, draw a circle representing its embedding. Add a label PE under each to indicate positional encoding has been added.
- Step 3: In the first transformer layer, every token can attend to every other token. Focus on the token 'sat': draw arrows from 'sat' to all other tokens, labeling each arrow with an intuitive attention weight from 0 (ignore) to 1 (highly relevant). Which tokens should 'sat' attend to most strongly to understand what is doing the sitting and where?
- Step 4: Draw a second row of circles representing the updated representations after attention. The new 'sat' representation is a weighted mix of all other token representations. Write what contextual information is now encoded in the updated 'sat' circle.
- Step 5: In a second transformer layer, these updated representations attend to each other again. How does stacking a second layer allow 'sat' to capture more complex relationships, such as 'the cat is what sat on the mat'?
- Discuss: Why does stacking multiple layers allow the model to build richer, more contextual representations than a single attention layer?