Skip to main content
Machine Learning & Deep Learning

⏱ About 15 min15 XP

Building a Tiny Dataset

You have studied datasets from the outside — reading, analyzing, and cleaning them. Now you will think like the person who builds one from scratch. Even a tiny dataset, designed well, teaches you more about machine learning than memorizing a hundred definitions. And designing it badly teaches you even more — about every mistake that makes real datasets fail.

Step One: Define the Prediction Task

Every dataset starts with a question. Before you collect a single data point, write down exactly what you want to predict and for whom. Weak question: 'Something about pets.' Stronger question: 'Can I predict whether a houseplant will still be alive after two weeks, based on information I can observe when I first get it?' A sharp prediction question tells you three things automatically: what your label is (alive or not after two weeks), who your examples are (houseplants), and roughly what your features should cover (observable properties at purchase time). Write your question before you collect anything. Changing it later may mean throwing away data.

Start with the Question, Not the Data

The most common beginner mistake in ML is collecting data first and asking what to predict later. This almost always produces a messy, ill-defined dataset. The right order is: sharpen the prediction question, identify what counts as one example, list your features and label, then collect.

Step Two: Define one example. Write a single sentence: 'One row in my dataset represents one ___.' For the plant dataset: 'One row represents one houseplant observed at the time of purchase.' This sentence prevents two common mistakes: accidentally mixing different types of things into the same table (one row is a plant, another is a watering event — those belong in separate tables), and accidentally double-counting (the same plant measured twice on different days creates two rows that look independent but are not). Step Three: List your features. For each candidate feature, ask the three questions from Lesson 2: Is it measurable? Is it available at prediction time? Is it likely related to the outcome? Cut any that fail a test. Step Four: Plan your label. Who will assign it? When? How will you ensure consistency? If two people could look at the same example and disagree on the label, you need clearer labeling rules.

Collection, Sampling, and Size

Once your design is clear, you collect data. Sampling strategy matters. If you only collect plants from one flower shop, your dataset may not generalize. If you only collect in summer, your model might not predict winter plants accurately. Think deliberately about who or what your examples represent — and whether that matches the population your model will eventually serve. For a tiny dataset meant for learning, thirty to fifty rows is enough to practice every step of the pipeline. For a real ML project, hundreds or thousands of rows is a reasonable minimum, and millions are common for complex tasks. Quality over speed: it is better to collect 40 rows carefully than 400 rows sloppily. Every row with wrong feature values or an incorrect label is actively harmful — it teaches the model a lie.

Write a Data Dictionary

A data dictionary is a short document that defines every column: its name, data type, what it measures, its units, and the range of acceptable values. Write it before you collect. It prevents errors (everyone collecting data uses the same definitions) and helps anyone else understand your dataset later.

Flashcards — click each card to reveal the answer

You want to build a dataset to predict whether a student will enjoy a book. What should you decide FIRST?

A classmate collects 200 plant observations, all from the same greenhouse, in the same week of the year. What risk does this introduce?

Design Sprint: Build a Dataset Plan

  1. Step 1: Write your prediction question in one precise sentence. Make sure it includes what you want to predict and for whom.
  2. Step 2: Write the sentence 'One row in my dataset represents one _____.'
  3. Step 3: List five features. For each, write: data type (numerical/categorical/boolean), how you will measure it, and whether it will be available at prediction time.
  4. Step 4: Define your label: what are the possible values, and how will you determine the correct one?
  5. Step 5: Write two sentences about your sampling strategy: who or what will you collect examples from, and what groups or scenarios might you be missing?
  6. Step 6: Write three rows of fake but realistic data that fit your design.