Datasets and Rows
Every time you have used a recommendation, seen a translated sentence, or heard a voice assistant answer a question, a dataset was involved. A model cannot learn from nothing — it needs organized examples, and a dataset is exactly that: a structured collection of examples. Understanding how datasets work is the first step in understanding machine learning.
What a Dataset Is
A dataset is a collection of examples about the same subject, arranged so that each example has the same set of information recorded about it. Think of a school nurse's logbook. Each visit is one example. For every visit the nurse records the same things: date, student name, grade, complaint, and action taken. That logbook is a dataset. In machine learning we almost always store datasets as tables — rows and columns. One row is one example. The columns are the different pieces of information recorded for every example.
In a dataset, each row represents a single example — one observation, one record, one data point. Every row has exactly the same columns as every other row. Missing a value is still a row; a row is defined by its position, not by being complete.
Here is a tiny dataset about houseplants: Plant Name | Pot Size (cm) | Sunlight (hrs/day) | Watered Daily? | Healthy? Peacock Fern | 12 | 2 | No | Yes Basil | 8 | 6 | Yes | Yes Cactus | 6 | 8 | No | Yes Mint | 10 | 3 | Yes | No Each row is one plant. Each column records the same attribute for every plant. The last column — Healthy? — is the outcome we might want a model to predict. The other columns are the information we give the model to predict it.
Columns: The Structure of Information
Columns define what gets measured. Every column has a name and a data type. Data types matter a great deal in machine learning. Numerical columns hold numbers that can be compared or added — height in centimeters, temperature in degrees, price in dollars. Machines handle these naturally. Categorical columns hold labels from a fixed set of options — color (red, blue, green), country, species. You cannot average them the way you average numbers. Boolean columns hold exactly two values — yes/no, true/false, 0/1. Text columns hold free-form written language — product reviews, news headlines, student essays. These are the hardest for models to work with directly, because the same meaning can be expressed in countless ways.
A dataset with 500 carefully collected, accurate rows often trains a better model than a dataset with 50,000 sloppy, error-filled rows. Quality and quantity both matter — but quality comes first.
Match each term to its definition.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A table of weather records has 3,000 rows and 7 columns. How many examples does the dataset contain?
Which of the following is a categorical column?
Design Your Own Dataset
- Step 1: Choose a subject you know well — your classmates, your music library, the plants in your home, or anything else.
- Step 2: Decide what makes a single example. Write it down: 'Each row is one _____.'
- Step 3: List five columns you would record. For each column, write the data type: Numerical, Categorical, Boolean, or Text.
- Step 4: Collect or invent five real rows of data.
- Step 5: Look at your table. Is every row complete? Are any values missing? Note it — the next lessons explain why that matters.