Data Detective
A detective does not just accept what they see at first glance. They look carefully, ask questions, and notice things others might miss. Data scientists are detectives too! Before a machine ever gets to study its examples, a data detective examines the pile carefully: Are there problems? Is anything missing? Are the labels right? Is the data fair? Today YOU are the Data Detective.
The Detective's Checklist
A real data detective checks four things: 1. Clarity: Can you clearly see or read each example? Are any blurry, too dark, or impossible to understand? 2. Correctness: Does the label match the example? If a photo shows a banana but the label says 'apple,' that is wrong. 3. Variety: Do the examples show the thing in many different ways? Or do they all look exactly the same? 4. Fairness: Are all the kinds of people or things that matter included? Is any group missing? If you spot a problem, you flag it and the team fixes it before the machine starts learning.
A data detective examines examples before a machine learns from them. They check for clarity, correctness, variety, and fairness — and flag any problems they find.
Let us practice! Here are five pretend datasets. Read each one and decide: is there a problem? What kind of problem is it? Dataset A: A team wants to teach a machine to recognize stop signs. They collect one thousand clear photos of stop signs from many different cities, weathers, and angles. Labels are all correct. Dataset B: A team wants to teach a machine to recognize handwriting. They collect samples only from adults over 40 years old. No children's handwriting is included. Dataset C: A team collects cat photos, but half of them are so blurry you cannot tell if it is a cat or a blob. Dataset D: A team collects one thousand fish photos. Someone accidentally labeled fifty salmon photos as 'trout.' Dataset E: A team collects weather data — temperatures, rainfall, and cloud cover — from fifteen different countries and all four seasons. Labels are carefully checked.
Match each dataset to its problem (or no problem).
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Great work, Detective! You spotted all five situations correctly. Now you know that finding problems in data is a skill — one that takes practice and careful attention. Real data scientists spend a lot of their time doing exactly this kind of checking work. Every problem you catch before a machine starts learning is a problem the machine will never have to deal with. Fixing the data early saves a lot of trouble later.
The best data detectives ask lots of questions. Who collected these examples? When? Where? Who is included? Who might be missing? What could go wrong? Curiosity is a detective's best tool.
You are checking a dataset of fruit photos and notice that the label 'mango' appears on pictures of pineapples. What kind of problem is this?
A machine is trained to recognize children's drawings, but all the examples came from one art class in one school. What should the data detective flag?
Be the Data Detective
- Ask a grown-up to help you find a small collection of similar things: ten photos from a magazine, ten of the same type of object, or ten examples of something drawn.
- Put on your detective hat — maybe a real hat for fun!
- Check each example using the four questions:
- Is it clear and easy to understand?
- Is the label (if there is one) correct?
- Are there many different varieties?
- Are all the important kinds of things represented?
- Write your findings: How many had problems? What type of problems were they?
- Report your findings like a real detective: 'I examined ten examples. I found two clarity problems and one correctness problem. I recommend...'