Skip to main content
AI Foundations

⏱ About 20 min20 XP

Fine-Tuning and Alignment

After pretraining, a language model is a powerful but raw statistical engine. Ask it a question, and it may respond by predicting more questions — because in its training data, questions often appear in Q&A documents where one question is followed by another. Ask it to summarize a document, and it may simply continue the document. The model is excellent at predicting what text comes next; it has not been taught to be helpful to a human user. The processes of fine-tuning and alignment bridge this gap — and they raise deep, unresolved questions about how to specify and measure what we actually want from an AI system.

Instruction Tuning: Teaching the Model to Follow Directions

The first step after pretraining is supervised fine-tuning (SFT), often called instruction tuning. The idea is to continue training the model on a curated dataset of (instruction, ideal response) pairs — examples of exactly the helpful behavior we want. For example: Instruction: 'Summarize the following article in three sentences: [article text]' Response: '[A three-sentence summary written by a human]' Instruction: 'Write a Python function that returns the factorial of n.' Response: '[Correct Python code with a brief explanation]' These datasets typically contain tens of thousands to hundreds of thousands of examples covering a wide range of tasks: summarization, translation, question answering, code generation, creative writing, and more. The model is fine-tuned on this dataset — its parameters are updated further to make the desired responses more likely given the corresponding instructions. The result is a model that has learned a new pattern: when it sees an instruction framing (like 'Please do X'), it should produce a helpful, complete response rather than simply predicting more text in the style of its pretraining corpus. This is a significant behavioral shift even though the underlying architecture and parameters are the same — only the distribution of training examples has changed.

Instruction Tuning

Instruction tuning (supervised fine-tuning, SFT) is the process of further training a pretrained language model on a dataset of (instruction, desired response) pairs. It teaches the model to respond helpfully to user requests rather than simply continuing text. The model's architecture does not change; its parameter values are updated to make the trained behaviors more probable.

Instruction tuning alone is not sufficient to produce a safe, well-aligned assistant. Even with good SFT data, models can still produce harmful content, confidently wrong answers, or responses that technically follow an instruction in an unhelpful way (answering 'Can you tell me the time?' with 'Yes' is technically responsive but useless). A second step addresses this: Reinforcement Learning from Human Feedback, abbreviated RLHF. In RLHF, human evaluators are shown pairs of model responses to the same prompt and asked which is better. These preferences are used to train a reward model — a separate neural network that learns to predict human preference scores. The original language model is then trained with reinforcement learning to produce outputs that receive high scores from the reward model, while a penalty (KL divergence from the pre-RLHF model) prevents it from drifting too far from the original behavior. More recent variants use different feedback mechanisms. Direct Preference Optimization (DPO) achieves similar goals without a separate reward model, by training the language model directly on preference data. Constitutional AI (Anthropic's approach) uses a set of written principles and a model trained to critique and revise its own outputs according to those principles. These are active areas of research, and the best methods continue to evolve.

What Alignment Does and Does Not Guarantee

Alignment processes improve model behavior along dimensions that humans can evaluate and provide feedback on. When they work well, they produce assistants that are considerably more helpful, honest, and careful than raw pretrained models. But alignment has genuine limitations that are important to understand honestly. Alignment does not eliminate hallucination: A model that has been aligned to be helpful may be more likely to attempt an answer rather than admit uncertainty — which can increase confident confabulation, not reduce it. Alignment can be gamed: Models trained on human feedback learn to produce responses that humans rate highly, which is not identical to producing responses that are true or good. Humans have biases; they prefer longer, more confident-sounding answers; they can be fooled by sophisticated plausible-sounding text. The reward model inherits these biases. Alignment is not specification: Alignment processes train a model to match human preferences as observed in training examples. What humans demonstrably prefer is not always what they would prefer on reflection, and neither is always what is actually good or safe. The gap between 'what the model was trained to do' and 'what we actually want' is called the alignment gap, and closing it remains an unsolved research problem. Alignment is also a form of capability: A model that accurately predicts what humans want and how to achieve it is, in some sense, a more capable model. Alignment and capability are not in tension — a better understanding of what humans want requires a better model of human values, language, and intent.

Alignment Is Not a Safety Guarantee

Alignment techniques significantly improve model behavior, but they do not produce a provably safe or reliably honest system. They reduce the probability of undesired outputs along dimensions evaluated during training. Sufficiently unusual prompts, adversarial inputs, or misspecified reward models can still elicit harmful or false responses. Treat aligned models as substantially better than raw models — not as guaranteed.

Match each alignment term to its correct description.

Terms

Supervised fine-tuning (SFT)
RLHF
Reward model
KL divergence penalty
Alignment gap

Definitions

Using human preference judgments to train a reward model, then optimizing the LLM against it
A neural network trained to predict human preference scores for model outputs
Training on curated instruction-response pairs to teach helpful behavior
The difference between what a model was trained to do and what we actually want it to do
A term that prevents the RLHF-trained model from straying too far from its pretrained behavior

Drag terms onto their definitions, or click a term then click a definition to match.

Why is instruction tuning (SFT) necessary after pretraining?

A language model trained with RLHF produces longer, more confident-sounding answers even when uncertain, because human evaluators tended to rate those answers higher. This is an example of:

Design a Preference Evaluation

  1. You are designing a human evaluation protocol for an RLHF-style training run for a tutoring AI.
  2. Step 1: Write a sample instruction that the tutoring AI would commonly receive. Example: 'Explain why the sky is blue to a 10th-grade student.'
  3. Step 2: Write two responses to your instruction — Response A and Response B. Make them meaningfully different: perhaps one is more accurate and one is more engaging, or one is more concise and one more thorough.
  4. Step 3: Write an evaluation rubric with exactly four criteria that a human rater should use to judge which response is better. Make each criterion specific and concrete.
  5. Step 4: Apply your own rubric. Which response wins? Does the rubric capture everything you care about?
  6. Step 5: Identify one way your rubric might systematically produce responses that are subtly worse than you intended — a bias in your evaluation criteria that the model would learn to exploit. This is the alignment gap in miniature.