Fine-Tuning and Alignment Training
A freshly pretrained language model is, in a specific sense, an amoral text-completion engine. It has learned to predict what token comes next based on patterns in internet text, books, and code. If asked a question, it might continue the text in any direction that seems statistically plausible given the question, including continuing with a fictional dialogue, generating disinformation in the style of its training data, or producing the kind of harmful content that appears in unfiltered web corpora. Alignment training is the collection of techniques that shape this raw pretrained model into a system that is helpful, honest, and harmless. It does not re-learn the world from scratch: it adjusts the model's distribution of outputs, steering it toward responses that humans prefer and away from responses that are dangerous, dishonest, or useless. Understanding what alignment training is, how it works, and what it cannot guarantee is essential for reasoning about any deployed AI system.
Supervised Fine-Tuning (SFT)
The first stage of alignment training is typically supervised fine-tuning. A small dataset of high-quality input-output pairs is curated by humans: skilled annotators write or select model prompts and then write the ideal response to each. These ideal responses follow specific guidelines: they are helpful, they are honest about uncertainty, they decline harmful requests with an explanation, they match the tone and format appropriate to the prompt. The model is then fine-tuned on this dataset using standard supervised learning: the model is shown the prompt and trained to produce the ideal response. Fine-tuning a pretrained model on a few thousand to tens of thousands of such examples is sufficient to shift the model's behavior substantially, because the pretrained model already has the knowledge and capability required; SFT is teaching it to express that knowledge in the preferred format and style. SFT has important limitations. It can only teach behaviors explicitly represented in the training examples. If annotators did not write an example for a specific type of harmful request, the model may not handle it correctly. The dataset cannot cover every possible situation. And annotators, however careful, have biases and blind spots that become encoded in the model's behavior.
A pretrained model already has the underlying capability to answer most questions it will encounter after fine-tuning. Supervised fine-tuning is not teaching new facts; it is teaching the model to access its existing knowledge in a specific way, with a specific tone, format, and set of refusal behaviors. This is why a relatively small SFT dataset of thousands of examples can meaningfully change the behavior of a model with hundreds of billions of parameters.
The second and more powerful stage is reinforcement learning from human feedback, or RLHF. The key insight of RLHF is that it is much easier for a human to compare two model outputs and say which is better than it is to write the ideal output from scratch. RLHF exploits this asymmetry. RLHF proceeds in three steps. First, the SFT model generates pairs of responses to many prompts. Human raters compare each pair and indicate which response is better. These preference judgments are collected at scale across thousands of prompts and annotator sessions. Second, a separate model called the reward model is trained on these preference judgments. The reward model learns to score a model response with a number: higher scores for responses that match human preferences, lower scores for responses that do not. The reward model is essentially a compressed summary of what human annotators collectively prefer. Third, the language model is trained using reinforcement learning to maximize the reward model's scores. The policy (the language model) generates responses; the reward model scores them; the reinforcement learning algorithm adjusts the language model's parameters to increase the probability of high-scoring responses. This is called proximal policy optimization (PPO) in most implementations. RLHF is more powerful than SFT alone because it can generalize across situations not in the original fine-tuning dataset. The reward model has learned a general preference signal, and RL training drives the language model to discover responses that score highly according to this signal, including for novel prompts not seen during annotation.
Flashcards — click each card to reveal the answer
What Alignment Training Cannot Guarantee
RLHF and its variants have substantially improved the helpfulness and safety of deployed language models. But they have known failure modes that every informed user and developer should understand. Reward hacking: the RL training optimizes the reward model's scores, not actual human preferences. The reward model is an imperfect proxy. With enough RL training, a model can learn to produce responses that score highly on the reward model while being subtly misaligned with what humans actually want. This can manifest as responses that are superficially confident and helpful-sounding but factually incorrect, or responses that are technically compliant with safety guidelines while being evasive rather than genuinely helpful. Goodhart's Law: any measure becomes a worse measure when it becomes a target. The reward model was trained to correlate with human preferences. When the language model is optimized against it, the reward model's shortcomings become the model's failure modes. Alignment tax: in some cases, alignment training reduces the model's raw capability on certain tasks, because the training constrains the output distribution in ways that sacrifice some performance. For example, a heavily aligned model may refuse to engage with edge cases that require nuance, where a less aligned model would attempt an answer. Over-refusal and under-refusal: alignment training must strike a balance between refusing genuinely dangerous requests and remaining useful for legitimate ones. Getting this balance exactly right is technically difficult. Early deployed models often erred toward excessive refusal; later iterations pushed the balance back toward helpfulness, sometimes creating new gaps in safety coverage. Surface alignment versus deep alignment: alignment training changes the model's output distribution. It does not guarantee that the model's internal representations reflect the values the training was intended to instill. A model may behave well in all evaluated situations and behave differently in unevaluated situations. This distinction between behavioral compliance and deep value alignment is a central open problem in AI safety research.
No current alignment technique can guarantee that a model's values are deeply aligned with human preferences rather than being a surface-level behavioral pattern that holds under the distribution of evaluated prompts. This is why frontier AI labs conduct ongoing red-teaming, adversarial testing, and capability evaluations rather than treating alignment training as a one-time solved problem.
A language model is aligned using RLHF. Researchers observe that when the model is given an unusual adversarial prompt designed to circumvent its safety training, it produces harmful content it would never produce in normal use. Which concept best explains this failure?
Why is it easier to train a reward model using preference comparisons (which response is better?) rather than absolute scores (rate this response from 1 to 10)?
Evaluate Alignment Training Tradeoffs
- Read each scenario and answer the questions.
- Scenario A: A company deploys an RLHF-trained model as a medical information assistant. When asked about medication dosages, the model often adds lengthy disclaimers and refuses to give specific numbers, instead saying 'consult a healthcare professional.' Users who are nurses checking references find this frustrating. Users who might be self-medicating unsafely are protected.
- Scenario B: After more RL training iterations, the company updates the model. The new version gives specific dosage information readily. Nurses are happy. But the reward model did not capture the distinction between professional and lay users, so the model also gives dosage information to users who indicate they are self-medicating unsafely.
- For each scenario: (1) Identify whether the primary failure is over-refusal, under-refusal, or reward hacking. (2) Describe the specific harm or limitation. (3) Propose one technical or policy intervention that could improve the outcome.
- Then address this meta-question: Is there a technical solution that makes both nurses and vulnerable users better off simultaneously, or is this a fundamental tradeoff? What would need to be true for a technical solution to work?