Pretraining: Learning From a Sea of Text
You now know that a Transformer can, in principle, relate any token to any other in a sequence, and that a language model's job is to predict the next token. The question this lesson answers is: how does a model actually learn to do this well? The answer is pretraining — a process that is conceptually simple, computationally enormous, and epistemically interesting. Understanding pretraining is essential to understanding both what LLMs know and why they sometimes confidently say things that are wrong.
Self-Supervised Learning: No Labels Required
Most supervised machine learning requires labeled data: each input paired with the correct output, prepared by human annotators. Pretraining uses a different approach called self-supervised learning, which generates its own labels automatically from raw text. The task is next-token prediction. Given a sequence of tokens, predict the next one. Every piece of text in the training corpus automatically provides thousands of training examples: for a 500-token passage, there are 499 training pairs (tokens 1..n-1 as input, token n as the target to predict). No human labeling is required — the structure of the text itself supplies the supervision signal. This is why pretraining can scale to essentially unlimited data. The web contains hundreds of trillions of tokens of human-written text: Wikipedia, books, code, scientific papers, news articles, forum discussions, legal documents. All of it becomes training data. Models like GPT-3 were trained on roughly 300 billion tokens; more recent models have used several trillion.
Pretraining is the process of training a language model on a massive text corpus using next-token prediction as the objective. Because the correct next token is already present in the text, no human annotation is needed. The model learns by repeatedly being shown a context and asked to predict what comes next, adjusting its parameters to reduce prediction error.
Concretely, training works like this. The model receives a window of tokens — say, the first 1,024 tokens of a document. It produces a probability distribution over the vocabulary for each position. The loss (prediction error) is computed as the cross-entropy between the model's predicted distribution and the actual next token at each position. Backpropagation computes how much each of the model's billions of parameters contributed to this error, and gradient descent adjusts them all slightly to reduce it. This process repeats for hundreds of billions of training examples over weeks or months on thousands of specialized chips. What does the model actually learn from all this next-token prediction? Quite a lot that goes beyond surface statistics: - Grammar and syntax: fluent text requires grammatically correct predictions - World knowledge: to predict that 'The capital of France is ___' ends with 'Paris,' the model must encode something corresponding to that fact - Reasoning patterns: to predict the conclusion of a logical argument, the model must learn something about logical structure - Style and register: formal vs. casual text have different statistical profiles - Code: programming languages have strict syntax, and the model learns it from billions of lines of code All of this emerges from one objective: predict the next token.
What Pretraining Does Not Teach
The power of pretraining is real and remarkable. But so are its limits, and understanding them is as important as understanding the capabilities. No grounded perception: The model trains on text descriptions of the world, not the world itself. It has never seen a color, tasted a food, or moved through space. Its 'knowledge' of physical experience is entirely secondhand — derived from how humans write about such things. No verified facts: The training corpus contains both accurate information and misinformation, propaganda, outdated content, and contradictory claims. The model has no mechanism to distinguish them during pretraining — it simply learns to predict what text tends to follow what other text. A frequently repeated false claim may be more confidently predicted than a rare true one. No persistent memory: Each training example is used to update parameters, but the model does not 'remember' individual documents the way a database stores records. Knowledge is distributed across billions of parameter values. This is why models can sometimes recall obscure facts and sometimes forget well-known ones — it depends on how a fact was represented across the training distribution. Static knowledge: Pretraining happens at a point in time. The model's parameters are frozen after training. Events that occurred after the training cutoff date are simply absent — the model has no knowledge of them unless provided in the context.
A language model trained on the web learns to predict what text is statistically common, not what is factually correct. A false claim that appears thousands of times in the training data may be predicted more confidently than a true claim that appears rarely. This is a fundamental property of pretraining, not a bug that will be fixed by scaling.
Flashcards — click each card to reveal the answer
Why does self-supervised pretraining not require human-labeled data?
A language model was trained on a corpus collected in 2023. A user asks it about an election that occurred in 2025. What should we expect?
Audit a Paragraph for What a Model Could Learn
- Read the following short passage and analyze what a language model trained on it could and could not learn.
- Passage: 'Marie Curie won two Nobel Prizes — one in Physics in 1903 and one in Chemistry in 1911. She was born in Warsaw in 1867 and moved to Paris to study science, as women were not permitted to attend university in Russian-controlled Poland at the time. She died in 1934 of aplastic anemia, likely caused by her prolonged exposure to radiation.'
- Step 1: List at least five specific facts a model could learn to predict from this passage.
- Step 2: List at least three things the model cannot learn from this passage alone — knowledge that would require other sources.
- Step 3: Consider: if this passage appeared in the training corpus 10,000 times (because it is widely reproduced across the web), which parts of it would the model predict most confidently? Does frequency of appearance correlate with accuracy of information? Give one example where it might not.
- Step 4: Discuss with a partner: given these properties of pretraining, what should a careful user always do before relying on an LLM's factual claims?