Skip to main content
AI Foundations

⏱ About 20 min20 XP

The Transformer, Conceptually

From the outside, a large language model looks like a black box: text goes in, text comes out. From the inside, it is an architecture called the Transformer, introduced in a 2017 paper titled 'Attention Is All You Need.' The Transformer is the backbone of virtually every major language model built since: GPT, BERT, LLaMA, Claude, Gemini. You do not need to implement one to understand it — but you do need a clear conceptual model, because the architecture directly shapes what these systems can and cannot do.

The Core Problem: Relating Tokens to Each Other

Earlier neural architectures for language — recurrent neural networks (RNNs) — processed tokens one at a time in sequence, passing a hidden state forward like a baton. This meant information from early tokens had to survive many steps to influence predictions far later. In practice, RNNs forgot long-range context. The Transformer's solution was radical: abandon sequential processing entirely. Instead, every token attends to every other token simultaneously. The mechanism that makes this possible is called self-attention. Here is the idea. For each token, the model asks: which other tokens in this sequence are most relevant to understanding what this token means here? It computes a relevance score between every pair of tokens — a number that says how much token i should 'pay attention to' token j when building its representation. These scores are used to produce a weighted combination of all other tokens' information, resulting in a richer representation of each token that has been informed by context from the entire sequence. Concrete example: In the sentence 'The animal didn't cross the street because it was too tired,' what does 'it' refer to? A human reader immediately knows: 'animal.' An attention mechanism can learn to assign a high relevance score between 'it' and 'animal' in this context, allowing the model to resolve the reference correctly even though the words are far apart.

Self-Attention

Self-attention is a mechanism in which each token in a sequence computes a weighted sum over all other tokens, with weights determined by learned relevance scores. This lets the model relate any token to any other token regardless of distance — solving the long-range dependency problem that defeated RNNs.

How are relevance scores computed? Each token's representation is projected into three vectors, called Query (Q), Key (K), and Value (V) — names borrowed from database retrieval metaphors. The relevance score between token i and token j is the dot product of token i's Query vector and token j's Key vector, scaled and passed through a softmax function to produce a probability distribution. The output for token i is then a weighted sum of all tokens' Value vectors, where the weights are those probabilities. In practice, this happens in parallel across multiple 'heads' — called multi-head attention. Each head can learn to attend to different kinds of relationships simultaneously: one head might track syntactic subject-verb agreement; another might track semantic similarity; a third might track coreference (which pronoun refers to which noun). The outputs of all heads are concatenated and projected to produce the next layer's input. One important addition: the Transformer adds positional encodings to each token's initial embedding. Because self-attention treats all tokens in parallel (no inherent order), it needs explicit information about each token's position in the sequence. Positional encodings inject this information numerically.

Layers, Feed-Forward Networks, and Scale

A full Transformer is not one attention layer but many — stacked on top of each other. In each layer, every token's representation is updated by attending to all other tokens (the attention step) and then passed through a position-wise feed-forward network — a small fully connected network applied independently to each token. Then the process repeats in the next layer. Why stack layers? Early layers tend to capture low-level patterns — local syntax, common phrases. Deeper layers build more abstract, contextually rich representations. Empirically, stacking more layers (and using larger hidden dimensions) consistently improves model quality, which is part of why modern models have dozens or even hundreds of layers. Scale: GPT-3 has 96 layers and 175 billion parameters. The 'parameters' are the numbers — in the weight matrices of the attention and feed-forward components — that are learned during training. Each forward pass through the model (generating one token) involves billions of floating-point multiplications. This is why LLMs require specialized hardware (GPUs or TPUs) and significant energy.

Match each Transformer component to its function.

Terms

Self-attention
Query, Key, Value vectors
Positional encoding
Multi-head attention
Feed-forward sublayer

Definitions

Numerical signals added to token embeddings to indicate each token's position in the sequence
Running several attention computations in parallel, each learning different relationship types
Computes relevance scores between all pairs of tokens to let each token gather context from the full sequence
Projections of each token used to compute attention scores and aggregate information
A small fully connected network applied to each token after attention, adding nonlinear transformation

Drag terms onto their definitions, or click a term then click a definition to match.

Intuition Check

Self-attention is not magic — it is a learned dot-product similarity between vectors. When it works well, the model has learned which vector directions correspond to 'relevant context' for each type of token in each type of sentence. When it fails, it has not. The architecture provides the capacity; training provides the content.

What fundamental problem does self-attention solve compared to recurrent neural networks?

A Transformer model generates the word 'she' and must determine what it refers to. Which architectural feature most directly enables this resolution?

Trace Attention by Hand

  1. This activity lets you simulate the logic of attention on a tiny example.
  2. Sentence: 'The scientist won the prize she deserved'
  3. Tokens (simplified, one per word): [The, scientist, won, the, prize, she, deserved]
  4. Step 1: Focus on token 'she' (index 5). Your task is to decide, for each other token, how relevant it is when determining what 'she' refers to. Assign each a relevance score from 0 (irrelevant) to 3 (highly relevant): The, scientist, won, the, prize, deserved.
  5. Step 2: Normalize your scores so they sum to 1 (divide each by the total). These are your attention weights.
  6. Step 3: Now change the sentence to 'The scientist won the prize it deserved.' How do your attention weights for the token 'it' differ from those for 'she'? Which tokens become more or less relevant?
  7. Step 4: Discuss: what kind of patterns in training data would a model need to see to learn appropriate attention weights for pronoun resolution? What could go wrong if the training data contained biased associations?