Skip to main content
Machine Learning & Deep Learning

⏱ About 15 min15 XP

Splitting Data: Train and Test

Imagine studying for an exam by memorizing the exact answer sheet, then using that same answer sheet as the test. You would score 100 percent — and learn nothing. You would also have no idea how well you actually understand the material. Machine learning has the same problem, and the solution is the same: keep the test separate from the study material.

The Train-Test Split

When you prepare a dataset for supervised learning, you divide it into two non-overlapping parts. The training set is the portion the model learns from. It sees the features and the labels. It adjusts itself based on this data, lesson after lesson, until it has found the best pattern it can. The test set is held out — locked away — until training is completely finished. The model has never seen these rows. When evaluation time comes, you show the model only the features from the test set, ask it to predict the labels, and compare its predictions against the real labels. The result is an honest measure of how well the model generalizes to new, unseen examples. A common split is 80 percent training and 20 percent test, though the exact ratio depends on how much data you have.

The Honest-Evaluation Rule

A model's performance on data it trained on is always optimistic — sometimes dramatically so. Only performance on held-out test data that the model has never seen tells you how the model will behave in the real world. Mixing training and test data — in any way — breaks this guarantee.

Here is a concrete example. You have 1,000 labeled photos of cats and dogs. You split them: 800 go into training, 200 go into testing. You train a model on the 800 photos. It learns patterns — pointy ears, snouts, fur textures, eye shapes. After training, you freeze the model and bring out the 200 test photos. You give it only the images, not the labels. It predicts each one. You then count how many it got right. If accuracy on training data is 98 percent but accuracy on test data is 62 percent, you have a serious problem: the model memorized the training examples instead of learning a general pattern. This failure mode is called overfitting, and a proper train-test split is how you detect it.

Validation Sets and the Peeking Problem

In real ML projects there is often a third split: the validation set. Here is why. Suppose you train a model, evaluate it on the test set, find it has 62 percent accuracy, adjust some settings, retrain, and check the test set again. You repeat this ten times. Each time you look at the test set to make decisions, you are indirectly fitting your choices to the test set. After enough iterations, the test set stops being truly unseen. The solution is to keep the test set locked until the very end — final evaluation only, once — and use a separate validation set during development to make tuning decisions. The test set is your single, honest final exam.

Never Peek at the Test Set During Training

Any decision made by looking at test-set performance during development — choosing features, tuning settings, deciding when to stop training — corrupts the test set. It becomes part of training in disguise. Reserve the test set for one final evaluation after all decisions are made.

Complete the key rule about splitting data.

The set is used to train the model; the set is held out to give an measure of real-world performance.

A student trains a model and reports 99% accuracy. You discover they evaluated the model on the same data they trained on. What is the problem?

What is overfitting?

The Peeking Experiment

  1. Step 1: Write ten simple math questions on separate slips of paper — mix of easy and hard. These are your 'dataset.'
  2. Step 2: Give eight slips to a partner as their 'training set.' Tell them to study these questions and their answers.
  3. Step 3: Give them two of the eight training slips back as a 'test.' Note the score.
  4. Step 4: Repeat: this time use the two slips they have NOT studied as the test. Note that score.
  5. Step 5: Compare. Which score was more informative about what your partner actually knows?
  6. Step 6: Write one sentence explaining why this maps to machine learning.