Data Cleaning and Preparation
Raw data is almost never model-ready. Years of experience in industry have produced a rough consensus: data scientists spend somewhere between 50 and 80 percent of their time on data — cleaning it, reformatting it, understanding it, and arguing about what to do with its problems. This is not a failure of process. It is the nature of data collected in the real world, where humans make mistakes, systems crash, sensors drift, and recording formats change without warning. Data cleaning is not glamorous, but it is the stage where the most consequential decisions about a model's future behavior are quietly made.
Missing Values
Missing values are perhaps the most common data-quality problem. A field may be absent because a patient refused to answer a question, a sensor failed, a field did not exist in an older version of a form, or an entry was simply never recorded. These are not the same situation, and the right response differs. The simplest approach is deletion: remove any row with a missing value. This is safe if missingness is random and the fraction of affected rows is small — say, under 5%. But if missingness is systematic — if, for example, sicker patients are more likely to refuse to report certain symptoms — deleting those rows removes exactly the patients the model most needs to learn about. Imputation replaces missing values with estimated ones. Mean or median imputation fills a missing numeric value with the column average or median. Mode imputation fills a missing categorical value with the most common category. More sophisticated methods use a secondary model to predict missing values from the other columns. Every imputation strategy introduces assumptions; those assumptions should be documented. A third option, often overlooked: encode missingness as a feature. Add a binary column indicating whether the original value was missing. Sometimes the fact that a value is absent is itself informative — a patient who did not report pain may genuinely have none, or may have been too ill to respond.
Statisticians distinguish three types of missingness. Missing Completely At Random (MCAR): the probability of being missing is unrelated to any variable. Missing At Random (MAR): missingness depends on observed variables. Missing Not At Random (MNAR): missingness depends on the missing value itself. Only MCAR is safe to handle with simple deletion. MNAR is the most dangerous — the missing data tells you something, and deleting it silently biases your dataset.
Outliers are values that fall far from the bulk of the distribution. A recorded age of 312 years is almost certainly a data-entry error. A recorded income of $10,000,000 in a personal finance dataset may be real, or may be a misplaced decimal. The challenge: you cannot know for certain which case applies without domain knowledge. For outliers that are clearly errors, correction or deletion is justified. For outliers that are legitimately extreme values, removing them may harm the model — the model needs to know that such values exist. A common compromise is capping, also called winsorizing: replace values above the 99th percentile with the 99th-percentile value, and below the 1st percentile with the 1st-percentile value. This preserves the extreme cases while preventing them from distorting downstream statistics. Inconsistent formats are another cleaning challenge. Dates recorded as '2023-04-15', '04/15/2023', and 'April 15, 2023' all mean the same thing, but a model — or even a preprocessing script — may treat them as three different strings. City names like 'New York', 'new york', 'NY', and 'New York City' may all refer to the same place. Standardizing formats before modeling is essential.
Normalization and Scaling
Many ML algorithms are sensitive to the numerical scale of features. If one column represents age (range: 0-100) and another represents income (range: 0-500,000), a model that computes distances or gradients will be dominated by the income column simply because its numbers are larger — even if age is equally or more informative. Normalization addresses this. Min-max scaling transforms each value in a column to fall between 0 and 1, using the formula: (value - min) / (max - min). Standardization (also called Z-score normalization) transforms each value to have mean 0 and standard deviation 1, using: (value - mean) / standard deviation. Critically, the statistics used for normalization — the min, max, mean, and standard deviation — must be computed on the training set only, then applied to the validation and test sets using those same training-set statistics. Computing statistics on the full dataset before splitting is a form of data leakage: it allows information from the test set to influence how the training data is preprocessed.
A common mistake: compute normalization statistics on the whole dataset, then split into train/validation/test. This leaks test-set information into the training process. The correct order is always split first, then compute normalization statistics on the training split only, then apply those statistics to transform all three splits.
The Three-Way Split
The training set, validation set, and test set serve distinct roles that must not be conflated. The training set is the data the model actually learns from. All gradient updates or parameter fitting happen using training-set examples only. The validation set — sometimes called the development set or dev set — is used during model development to make design decisions: which model architecture to try, which hyperparameters to adjust, whether to add or remove features. Because you look at validation performance repeatedly and adjust your decisions accordingly, the validation set does get 'used up' gradually. Your model is implicitly tuned to it. The test set is therefore the only truly unseen data. It is used once, at the end of development, to produce the final honest performance estimate. If you evaluate on the test set, see disappointing results, make changes, and evaluate again, you have effectively made the test set part of your development process — and your final number is no longer an honest estimate. Typical split ratios depend on dataset size. For a dataset of 10,000 examples: 70% train, 15% validation, 15% test is reasonable. For a dataset of 1,000,000 examples: 98% train, 1% validation, 1% test — 10,000 validation and 10,000 test examples are more than enough to estimate performance reliably, and starving the training set by using 15% for validation wastes learning data.
Match each data preparation technique to its primary purpose.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A data scientist computes the mean and standard deviation of all features on the full dataset, then splits into train/test sets, then applies standardization. What is wrong with this?
You have a dataset of 2 million rows. Which split is most appropriate?
Clean a Dirty Dataset
- Your instructor will provide (or you will create) a small table of 20 rows representing student records with columns: student_id, age, grade_point_average, home_city, and years_in_school.
- Introduce the following problems into the table (or have a partner introduce them):
- - Three missing values in different columns
- - One clearly impossible age value (e.g., 847)
- - Two inconsistent city-name formats (e.g., 'Los Angeles' and 'los angeles')
- - One grade point average above 4.0 (impossible on a standard scale)
- Then swap datasets with a partner. For each problem you find:
- 1. Identify the type of issue.
- 2. State your chosen fix and justify it in one sentence.
- 3. Note any assumption your fix requires.
- Discuss: did you and your partner make different decisions? Which decisions would affect a trained model most?