Skip to main content
Machine Learning & Deep Learning

⏱ About 15 min15 XP

Module Check

You have covered the full data pipeline: organizing examples into datasets, choosing features, attaching labels, cleaning errors, splitting for honest evaluation, handling class imbalance, identifying where bias hides, and designing a dataset from scratch. This lesson ties every thread together. Work through it carefully — the questions are designed to test understanding, not memorization.

Key Terms Review

Flashcards — click each card to reveal the answer

Multi-Topic Quiz

A row in a dataset represents a single house listing. The columns are: square footage, number of bedrooms, zip code, year built, and sale price. You want to predict sale price. Which columns are features?

You discover that 30% of rows in your dataset have a missing value in the 'income' column. You decide to fill each missing value with the average income from the entire dataset, including test rows. What mistake are you making?

A fraud detection model achieves 99.2% accuracy on a dataset where 99% of transactions are legitimate. What should you do before celebrating?

A hiring algorithm is trained on five years of past hiring decisions that favored candidates from elite universities. The data is accurate — those hires really happened. What type of bias is present?

You train a model, check its accuracy on the test set, adjust some features, retrain, and check the test set again — ten times. What problem does this create?

Which of the following best describes a labeled dataset?

The Data Pipeline in One View

Every ML project moves through the same data pipeline: 1. Define the prediction question and label. 2. Identify features that are measurable, available, and relevant. 3. Collect examples with a representative sampling strategy. 4. Clean: remove duplicates, fix errors, handle missing values. 5. Split: training set to learn from, test set for final honest evaluation. 6. Check class balance; address skew if needed. 7. Audit for bias: who is represented, who is not, what the labels actually measure. Each step protects the step after it. Skip one, and a flaw quietly flows forward.

Capstone: Dataset Design Review

  1. Step 1: Read this scenario. A city wants to predict which traffic intersections are likely to have an accident in the next month, so they can send inspectors to check signal timing and road markings.
  2. Step 2: Define the prediction task precisely: write the question, define one example (one row), and define the label.
  3. Step 3: Propose five features. For each, justify why it is measurable, available before the prediction period, and likely related to accident risk.
  4. Step 4: Identify one potential sampling bias in how the city might collect this data. How would you reduce it?
  5. Step 5: Identify one potential historical bias risk. What past pattern might the data encode that should not be reinforced?
  6. Step 6: Describe your train-test split strategy. Would you split randomly or by time? Why?
  7. Step 7: The city asks: 'Is this model fair?' Write two sentences on what you would check to answer that honestly.