Pretraining and Self-Supervision
A language model trained from scratch on a narrow task, such as answering customer service questions, would face a brutal problem: it must simultaneously learn grammar, world knowledge, reasoning patterns, and the specific task, all from a small number of task-specific examples. This is asking a model to build a skyscraper without a foundation. Pretraining solves this by separating foundation-building from task-specific specialization. First, train the model on an enormous, general corpus using a self-supervised objective. The model learns rich representations of language and the world. Then, adapt this pretrained model to specific tasks with far smaller labeled datasets. This two-phase approach is one of the central design decisions of the current AI paradigm.
Self-Supervised Learning: Supervision Without Labels
Traditional supervised learning requires labeled examples: someone must annotate each data point with the correct answer. For a small dataset of thousands of examples, this is feasible. For hundreds of billions of tokens from the internet, it is not: the cost would be astronomical and the coverage impossibly narrow. Self-supervised learning sidesteps this by generating the training signal from the data itself. No human annotation is needed. The most successful self-supervised objective for language is next-token prediction, also called causal language modeling. The model receives a sequence of tokens and must predict the next one. Consider the sentence: The neural network learned to recognize patterns in the training data. A language model in training might receive the tokens up to 'the' and must predict 'training.' The loss is computed by comparing the model's probability distribution over the vocabulary to the true next token. The model updates its parameters to make 'training' more probable given that prefix. This repeats for every consecutive subsequence in the training corpus. Across hundreds of billions or trillions of such predictions, the model is forced to develop internal representations that capture grammar, syntax, semantics, factual knowledge, logical structure, and stylistic variation. Predicting the next word accurately requires understanding all of these things.
To predict the next word well, a model must understand context, grammar, facts about the world, and how ideas connect. Next-token prediction is a simple objective, but optimizing it at massive scale forces the model to develop rich, general-purpose representations. The task is simple; the knowledge required to perform it well is vast.
What data is used for pretraining? Modern frontier language models train on a curated mixture of sources. Common Crawl, a public archive of web pages scraped at regular intervals, contributes the majority of tokens by volume. Books provide dense, high-quality prose with coherent long-form reasoning. Wikipedia contributes factual encyclopedic content with structured citations. Code repositories such as GitHub contribute a domain that rewards precise logical structure. Interestingly, training on code improves reasoning ability on tasks unrelated to programming. Academic papers and filtered web content round out the corpus. Data curation matters enormously. Raw web data is noisy: spam, duplicate content, and low-quality text are common. The major labs invest significant engineering in filtering, deduplication, and quality scoring their training corpora. The exact composition, how much code, how much multilingual content, how much scientific text, significantly affects what abilities emerge and where the model is weak. The scale of pretraining is staggering. GPT-3 was trained on roughly 300 billion tokens. Models like GPT-4, Claude, and Gemini are estimated to have trained on one to ten trillion tokens or more. A single training run at frontier scale can cost tens to hundreds of millions of dollars in compute.
Match each pretraining data source to the primary capability it contributes to a language model.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
What Pretraining Produces: Representations and World Knowledge
At the end of pretraining, a model has learned billions of parameters whose values encode patterns extracted from the training corpus. What exactly is stored in those parameters? The model has learned token embeddings: dense vector representations where semantically related words cluster together in high-dimensional space. The words 'king' and 'queen' have nearby embedding vectors. The word 'bank' as a financial institution and 'bank' as a river shore end up as context-dependent mixtures of distinct clusters. The model has learned attention patterns: which tokens tend to be relevant to which other tokens in various syntactic and semantic contexts. A verb attends strongly to its subject and object. A pronoun attends to its antecedent. Critically, the model also encodes factual knowledge in its parameters. Probing experiments, where researchers analyze model representations without any fine-tuning, show that large pretrained models encode factual associations like 'Paris is the capital of France,' causal relationships like 'fire causes heat,' and commonsense associations like 'knives are sharp.' This knowledge is implicit, stored in weights as patterns that make certain completions more probable, rather than explicit as a database of facts. This is what enables transfer learning: the pretrained model's representations are useful starting points for fine-tuning on specific tasks, because the model already knows a great deal about language and the world. A fine-tuned model only needs to learn the specific format and focus of the new task, not build from scratch.
A language model does not have a database of facts it can look up. Its knowledge is implicit: encoded in parameter values that make certain token sequences more probable than others. This is why language models can be wrong about facts in ways that seem inconsistent. They are not retrieving from memory but generating plausible continuations based on distributional patterns.
Complete these key statements about pretraining.
A researcher trains two language models on the same architecture. Model A uses next-token prediction on a 1-trillion-token corpus. Model B is trained with supervised learning on 50 million labeled question-answer pairs. For a new task of classifying the sentiment of product reviews, which model will likely perform better after fine-tuning on 1,000 labeled examples?
Why does including code in the pretraining corpus of a language model improve its performance on mathematical reasoning tasks, even when those tasks involve no programming?
Reverse-Engineer What a Model Learned
- This exercise builds intuition for what self-supervised pretraining forces a model to learn.
- Step 1: Consider this sentence fragment. What must a model know to correctly predict the final blank?
- 'The defendant's lawyer objected to the evidence, arguing it was obtained without a valid ___'
- List at least four distinct types of knowledge required: vocabulary, grammar, legal domain knowledge, and so on.
- Step 2: Now consider this fragment:
- 'She placed the beaker over the Bunsen burner and waited for the solution to ___'
- Again, list the types of knowledge required to complete it correctly.
- Step 3: Compare your two lists. What knowledge is shared across both? What is domain-specific to each?
- Step 4: Imagine a model trained on billions of legal documents but almost no chemistry textbooks. For which fragment would it perform better? What does this tell you about how data composition shapes model capabilities?
- Step 5: Write your own sentence fragment in a domain of your choice such as medicine, cooking, or astrophysics. Exchange with a partner. For their fragment, identify the types of knowledge required and predict which training data sources would best support learning to complete it.