Good Data, Bad Data
By now you know that AI needs data to learn. But not just any data — good data. The difference between high-quality and low-quality data in a training set is the difference between an AI system that works and one that is worse than useless. Computer scientists have a blunt phrase for this: 'garbage in, garbage out.' It means exactly what it sounds like. If you train an AI on bad data, you get a bad AI. No algorithm, no matter how sophisticated, can fix fundamentally broken training data.
The Four Dimensions of Data Quality
Data quality is not a single yes/no question — it has several independent dimensions. A dataset can be excellent on one dimension and terrible on another. Understanding all of them is essential. Accuracy means the recorded values actually correspond to the true state of the world. A thermometer that consistently reads 5°F too high produces inaccurate data. A survey where participants misremember or deliberately mislead produces inaccurate data. An accuracy problem is sometimes called measurement error. Completeness means the dataset has all the information it needs. Missing values — cells in a table that are blank — are one of the most common data quality problems. If 40% of the entries in a 'blood pressure' column are missing, any AI trained on that data has a huge gap in its picture of the world. Relevance means the data actually pertains to the task at hand. Using data about American driving patterns to train an AI for Japanese roads is a relevance problem. The data is not wrong, but it does not apply. Consistency means the same thing is always recorded in the same way. If one person's age is stored as '14', another's as 'fourteen,' and another's as '14 years old,' a computer cannot easily use all three. If one row records a date as '05/17/2026' and another as '2026-05-17,' the dataset is inconsistent.
Accurate: values reflect the truth. Complete: no important values are missing. Relevant: data applies to the task. Consistent: the same format and meaning are used throughout. Every real dataset struggles with at least one of these.
Let's see how these quality problems show up in a real scenario. Suppose a hospital is building an AI to predict whether a patient will be readmitted within 30 days after surgery. Accuracy problem: Patient vital signs were recorded by a faulty monitor. Blood pressure readings for several patients are systematically too high. The AI will learn to associate those inflated readings with readmission risk — a completely wrong lesson. Completeness problem: 35% of patients have no recorded smoking status, because the intake form changed partway through the data collection period. The AI must either ignore this feature or make assumptions about the missing values, both of which hurt performance. Relevance problem: Data was collected only from patients at a single urban hospital. Rural patients, or patients with different demographics, may have very different patterns — but the AI was not trained on them. Consistency problem: Medications are recorded by brand name in some years and by generic name in others. Aspirin and acetylsalicylic acid are the same drug, but the AI may treat them as different.
No amount of algorithmic sophistication fixes fundamentally bad training data. A state-of-the-art deep learning model trained on inaccurate, incomplete, irrelevant, or inconsistent data will perform worse than a simple model trained on clean data. Data quality is not a preprocessing detail — it is the foundation.
How Data Goes Bad
Understanding how data quality breaks down is as important as knowing what good data looks like. Here are the most common culprits: Measurement errors occur at the source: a miscalibrated sensor, a survey question that is easily misunderstood, a human transcription error when moving data from paper to digital form. Missing values accumulate for many reasons: a sensor went offline, a patient skipped a question, data was collected for one purpose and some fields were not needed then but are needed now. Missing values are so common that data scientists have entire toolboxes for handling them. Outliers are values that are extreme and possibly wrong. If 99 students have quiz scores between 40 and 100, and one student has a score of 14,000, that is almost certainly a data entry error. But not all outliers are errors — some are genuine extreme cases that should be kept. Drift happens when the world changes but the data does not keep up. A language model trained on news articles from 2015 does not know about events that happened after 2015. An AI trained to recommend movies based on 2010 viewing patterns may not reflect 2026 tastes. Data that was accurate and relevant when it was collected can become outdated.
Fill in the blanks with the correct data quality terms.
A dataset for training a medical AI has accurate, complete, and consistent data — but it was collected entirely from patients at one large hospital in New York City. Which quality dimension is most at risk?
Why can a sophisticated AI algorithm not 'fix' bad training data?
Data Quality Audit
- Below is a small dataset with deliberate quality problems. Your job is to find them all.
- Name | Age | Test Score | City | Date of Test
- -----------|------|-----------|--------------|-------------
- Sam | 13 | 87 | Chicago | 2026-03-10
- Alex | 12 | 412 | chicago | 03/15/2026
- Jordan | | 76 | Chicago | 2026-03-22
- Taylor | 13 | 91 | New York | 2026-03-22
- Morgan | 12 | 84 | Chicago | 2026-03-29
- For each row and column, identify any problems.
- Classify each problem using one of the four quality dimensions: accuracy, completeness, relevance, or consistency.
- For each problem, suggest how you would handle it if you were preparing this dataset to train an AI.
- Compare your findings with a partner. Did you catch everything?