Skip to main content
AI Foundations

⏱ About 15 min15 XP

Data Detective

You have now built a real toolkit: you know what data is, how massive and varied it is, how datasets are structured, what labels do, what separates good data from bad, where data comes from, how bias enters and spreads, and why your personal data matters. This lesson puts that toolkit to work. You are the detective. Your case files are six real-world-style datasets. Your job is to examine each one, identify what is wrong or missing, and explain what an AI trained on it would likely do as a result.

How to Investigate a Dataset

A good data detective asks the same questions every time. Run through this checklist on every dataset you examine: Source: Where did this data come from? Who collected it, when, and from whom? Coverage: Who or what is included? Who or what is missing? Is the sample representative of the population the AI is supposed to serve? Quality: Check all four dimensions — accuracy, completeness, relevance, and consistency. Any red flags? Labels: If the dataset is labeled, who assigned the labels? What criteria were used? Are the labels consistent? Features: What was measured? What was left out? Could any missing feature be important? Bias: Does the distribution of examples look balanced across groups that matter? If not, who is over- or underrepresented? Consequences: If an AI were trained on this data and deployed, who would it serve well? Who might it fail or harm?

The Detective's Core Question

For every dataset, ask: who is in this data, and who is not — and what does that mean for anyone who depends on the AI trained from it? That single question unifies data quality, bias, source analysis, and privacy into one practical frame.

Now work through the case files. Case File 1: School Behavior Tracking A school district wants to build an AI to predict which students are at risk of suspension, so counselors can intervene early. They train it on five years of disciplinary records. The dataset has 4,200 rows (incidents) and includes: student ID, grade, gender, race/ethnicity, type of incident, whether a suspension was issued, and which teacher reported the incident. There are no records from the first three years of the district's newest school, which serves primarily recent immigrant families and opened when the recording system was not yet in use. Case File 2: Wildlife Population Survey Researchers build an AI to estimate population sizes of bird species. Training data is drawn from eBird, a citizen science platform where birdwatchers log observations. There are 12 million observations from North America and Western Europe. There are 180,000 observations from sub-Saharan Africa. There are 40,000 from Southeast Asia. The dataset has no label for whether an observation was made by an expert or a novice. Case File 3: Customer Churn Prediction An internet service provider trains an AI to predict which customers are likely to cancel their service in the next 60 days, so they can offer retention deals. Training data: 800,000 customer accounts with features including account age, monthly bill, customer service call frequency, data usage, and a binary label: churned (1) or stayed (0). 96% of the dataset is labeled 'stayed.' The data was collected between 2017 and 2022.

Case File 4: Hiring Recommendation A corporation trains an AI to screen resumes and rank candidates for software engineering positions. Training data: 50,000 historical resumes with a label indicating whether the candidate was invited to interview ('yes'/'no'). The resumes span 15 years of hiring decisions. Internally, the company acknowledges that before 2018, its engineering teams were over 90% men, and interviewers were also predominantly men. Feature set includes: university attended, degree, years of experience, programming languages listed, and prior company names. Case File 5: Medical Symptom Checker A tech startup builds an AI symptom checker for common illnesses. Training data: 200,000 patient records from three large hospital systems in the American Midwest. Features include age, symptoms reported by the patient (in English), vital signs, and diagnosis. The hospitals serve mostly English-speaking patients; 3% of records come from patients whose primary language is not English. There is no feature recording whether the diagnosis was confirmed by a specialist or was a general practitioner's initial assessment. Case File 6: Autonomous Vehicle Pedestrian Detection A self-driving car company trains its pedestrian detection system using video footage captured in San Francisco, Seattle, and Boston over 18 months. The footage covers 24-hour periods but 78% of it is daytime footage. 12% is night with streetlights. 10% is dawn, dusk, or adverse weather. The training dataset does not include footage from cities outside the US.

No Perfect Datasets Exist

Every case file above has real problems — but so does every real-world AI training dataset. The goal of data investigation is not to find the perfect dataset (it does not exist) but to understand the imperfections clearly enough to anticipate where the AI might fail and for whom.

Match each case file to the most significant data problem it illustrates.

Terms

Case File 1 (School Behavior)
Case File 2 (Wildlife Survey)
Case File 3 (Customer Churn)
Case File 4 (Hiring)
Case File 6 (Pedestrian Detection)

Definitions

Underrepresentation of difficult conditions — night, adverse weather, non-US cities
Historical bias baked into labels from past discriminatory decisions
Heavy class imbalance — one outcome dominates the dataset
Severe geographic imbalance from citizen science participation patterns
Missing data from an underserved community due to a late system rollout

Drag terms onto their definitions, or click a term then click a definition to match.

In Case File 3, 96% of examples are labeled 'stayed.' Why is this a data quality problem for an AI trained on it?

In Case File 4, why is historical bias especially hard to correct?

Full Investigation: Case File 5

  1. Case File 5 is your deep-dive case. Using the detective checklist from the start of this lesson, write a complete investigation report.
  2. 1. Source: Who generated this data and in what context?
  3. 2. Coverage: Who is represented? Who is missing? What does 3% non-English speakers mean for an AI meant to serve a broader population?
  4. 3. Quality: Which of the four quality dimensions (accuracy, completeness, relevance, consistency) are at risk? Give a specific example for each one you identify.
  5. 4. Labels: The diagnosis feature — is it consistently generated? What does the mix of specialist and general practitioner diagnoses mean?
  6. 5. Bias: What groups are most likely to be underserved by an AI trained on this data?
  7. 6. Consequences: Describe a specific, realistic scenario where this AI fails a patient because of the dataset's weaknesses.
  8. 7. Recommendation: Name one concrete change to data collection that would improve the dataset most.
  9. Write your investigation as if you are presenting findings to the company's engineering team. Be specific, be honest about uncertainty, and be constructive.