The Training Pipeline
A frontier language model does not spring into existence fully formed. It is produced by a multi-stage pipeline that spans weeks or months, consumes enormous resources, and involves distinct phases with different objectives. Understanding this pipeline — pretraining, supervised fine-tuning, reinforcement learning from human feedback, and deployment evaluation — gives you a precise map of how raw data becomes a model that answers your questions.
Stage 1: Pretraining
Pretraining is the largest, most expensive stage. The model — a transformer neural network with billions to hundreds of billions of parameters — is trained on the full curated corpus using a simple objective: predict the next token given all preceding tokens. This is called autoregressive language modeling. At every step, the model sees a sequence of tokens, produces a probability distribution over the vocabulary for the next token, and receives a loss signal based on how far its prediction was from the actual next token. The optimizer — typically AdamW — adjusts all parameters slightly in the direction that would have reduced that loss. Repeat this process trillions of times across the training corpus, and the model must internalize the structure of language at every level: letter combinations, words, syntax, facts, reasoning patterns, and world knowledge. The emergent result is a base model: extraordinarily knowledgeable about patterns in text, capable of continuing any sequence plausibly, but not yet aligned to be a helpful, honest, or safe assistant. The base model will complete text in any style — academic, tabloid, fiction, technical documentation — without preference. It has no concept of a 'user' or of being helpful.
Predicting the next token seems like a narrow task. But to predict the next token in a medical journal article accurately, you must understand medicine. To predict it in a Python script, you must understand programming logic. The task is simple; the underlying world model required to solve it at very high accuracy is extraordinarily rich. This is why large pretrained models appear to have broad knowledge — the knowledge is implicit in having learned to predict text well.
Stage 2: Supervised Fine-Tuning
After pretraining, the base model is further trained on a smaller, carefully curated dataset of (prompt, ideal response) pairs — called supervised fine-tuning (SFT). The objective remains the same: predict the next token, given the prompt. But now the training examples are structured as conversations, with the ideal response authored by expert human contractors or by the lab's own researchers. SFT teaches the model the format of a helpful assistant: that it should address the user's question directly, that its responses should have a certain structure, that it should acknowledge uncertainty, and so on. SFT is much smaller in compute than pretraining — the SFT dataset might contain a few million tokens rather than trillions — but it has an outsized effect on the model's surface behavior. The limitation of SFT is that it requires humans to write ideal responses, and humans are better at recognizing a good response than producing one. A human rater shown two responses can reliably say which is better, even when they could not have generated the better response from scratch. This limitation motivates the next stage.
Stage 3: Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) is the stage that converts a capable but imperfect SFT model into a model that reliably produces responses humans prefer. The process has three steps. First, a reward model is trained. Human raters are shown pairs of model responses to the same prompt and asked to indicate which they prefer. These preference judgments — typically hundreds of thousands to millions of them — are used to train a separate neural network that assigns a scalar reward score to any (prompt, response) pair. Second, the SFT model is fine-tuned using reinforcement learning — specifically the PPO algorithm or variants — to maximize the reward model's score on generated responses. The model explores different responses and learns to produce responses the reward model scores highly. Third, a KL-divergence penalty prevents the RL process from exploiting the reward model in degenerate ways. Without this penalty, the model learns to produce responses that fool the reward model into giving high scores without actually being better — a phenomenon called reward hacking. The penalty keeps the model close to the SFT baseline. Direct Preference Optimization (DPO), introduced in 2023, has become a popular alternative to RLHF that avoids the need for a separate reward model by directly fine-tuning on preference pairs.
Match each training pipeline stage to what it primarily teaches the model.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Checkpointing, Monitoring, and Recovery
A training run spanning months cannot be treated as a single uninterrupted computation. Hardware fails. Software bugs corrupt state. Cloud provider outages interrupt runs. Frontier labs manage this through aggressive checkpointing — saving the model's full parameter state to persistent storage at regular intervals, typically every few hours. If a failure occurs, the run resumes from the most recent checkpoint rather than restarting from scratch. Continuous monitoring tracks metrics throughout the run: training loss (how well the model is predicting training data), validation loss (how well it generalizes to held-out data), gradient norms (which can signal instability), and GPU utilization (which signals efficiency). A sudden spike in loss — called a loss spike — can indicate data quality problems, a learning rate that is too high, or hardware issues. Engineers must detect and respond to these events promptly. Toward the end of a pretraining run, labs often perform 'annealing' — reducing the learning rate sharply and potentially changing the data mixture to higher-quality sources — to squeeze final quality improvements out of the model before pretraining ends.
A base model produced by pretraining is asked to complete the prompt: 'Write a helpful response explaining how to change a tire.' The model outputs a realistic-sounding but patronizing car-repair forum post from 2008. Why is this the expected behavior of a base model, not a bug?
What is 'reward hacking' in the context of RLHF, and what mechanism is specifically designed to prevent it?
Trace a Model's Journey
- Choose a publicly released frontier model (such as GPT-4, Claude 3 Opus, Gemini 1.5 Pro, or Llama 3). Using the model's public technical report or documentation, trace its training pipeline.
- Step 1: What does the technical report say about pretraining data size (in tokens) and model parameter count? Calculate the approximate tokens-per-parameter ratio and compare it to the Chinchilla optimal of 20.
- Step 2: Does the report describe a supervised fine-tuning stage? What does it say about how the SFT data was collected?
- Step 3: Does the report describe an RLHF or DPO stage? What does it say about how human preference data was collected?
- Step 4: What does the report NOT say? Identify at least two significant details about the training pipeline that are absent from the public documentation. Why might a lab choose not to disclose these?
- Step 5: Write a one-paragraph assessment: how transparent is this lab about its training process, and what are the implications of that level of transparency for public trust and accountability?