Skip to main content
AI Foundations

⏱ About 15 min15 XP

Training Data vs. Test Data

Suppose your teacher gives you a practice exam and tells you it is the same exam you will take on Friday — same questions, same answers. You memorize every answer and score 100% on Friday. Have you proven you know the subject? Not really. You have proven you can memorize a list. The same trap exists in machine learning. If you evaluate a model on the same data it trained on, you are not measuring learning — you are measuring memorization. The solution is the same as in your classroom: the test must be different from the practice.

The Split: Training, Validation, and Test

Before training begins, the full dataset is divided into separate portions that serve different purposes. Training data: the examples the model actually learns from. During the training loop, the model sees these examples, makes predictions, measures loss, and adjusts its parameters. These examples shape what the model knows. Validation data: a held-back portion used during development to check how the model is progressing on data it has not been directly trained on. Practitioners use validation data to tune settings (like the learning rate) and decide when to stop training. The model never trains on validation data, but because developers make decisions based on it, it can indirectly influence the model. Test data: examples held back completely until after the model is fully trained and all decisions are made. The test set is touched exactly once — at the very end — to measure the model's real-world performance. It is the most honest measure of how the model will do on data it has truly never encountered.

The Golden Rule of Evaluation

Never evaluate a model's true performance on data it has seen during training or development. The test set must remain completely untouched until training is complete. Using it earlier corrupts the evaluation and gives a falsely optimistic view of the model's ability to generalize.

A typical split might be 70% training, 15% validation, 15% test — though the exact proportions depend on how much data is available. With very large datasets (millions of examples), even 1% might be enough for validation and testing. With small datasets, this split can be tricky; there may not be enough data to hold back a meaningful test set without starving the model of training examples. Here is the danger of skipping the test split. Imagine you train a model on 85% of your data and use the remaining 15% as validation to pick the best version of the model. You have now made dozens of decisions (learning rate, number of training rounds, model size) guided by that 15%. Even though the model never trained on those examples, your choices were shaped by them. Testing on this same 15% would show inflated performance. You need a third, completely fresh set — the test set — to get an honest number.

Match each data split to its purpose.

Terms

Training data
Validation data
Test data
Data leakage
Generalization

Definitions

Held-back data used to guide development decisions without training on it
Untouched data used exactly once to measure final real-world performance
When information from the test set accidentally influences training or development
A model's ability to perform well on data it has never seen before
Examples the model learns from during the training loop

Drag terms onto their definitions, or click a term then click a definition to match.

Generalization: The Whole Point

The reason this split matters so much is a concept called generalization. A model that has truly learned the underlying patterns of a problem should perform well on any new example from the same distribution — not just the specific examples it trained on. A model that only performs well on training data has not generalized — it has memorized. This is called overfitting, and you will study it in depth in the next lesson. The test set exists to catch this failure before the model is deployed in the real world. Think about it from the model's perspective: if it has truly learned 'what spam looks like,' it should correctly classify spam emails it has never seen before. If it has only memorized the training spam examples, it will fail the moment it encounters a new spammer using slightly different tactics. The test set reveals which of those two situations you are in.

Real Life Has No Answer Key

Once a model is deployed in the real world, there is no test set — it must handle whatever comes its way. The test split is your only opportunity, before deployment, to simulate that reality honestly. Protect it accordingly.

Fill in the blanks to complete these key ideas.

Data the model learns from is called data. Data held back to measure final performance is called data. A model that performs well on new data has successfully .

Why should the test set never be used during model development?

A model scores 99% on training data and 61% on test data. What does this tell you?

Design a Fair Evaluation

  1. You are building a model to predict which students in a school might need extra support in reading, based on their quiz scores, attendance, and library visit frequency.
  2. Step 1: You have data on 200 students. Plan how you would split this into training, validation, and test sets. How many students in each? Justify your numbers.
  3. Step 2: A classmate suggests: 'just use all 200 to train the model, then check its accuracy on those same 200 students.' Explain in two sentences why this is not a valid evaluation.
  4. Step 3: Describe one way that test data could accidentally 'leak' into training. (Hint: think about how the data was collected — are some students connected to each other?)
  5. Step 4: If your model scores 88% on the training set and 74% on the test set, what would you conclude, and what would you do next?