From Data to Datasets
You now know that data is recorded information, and that the world generates it at staggering scale. But a pile of raw data points is like a pile of random puzzle pieces — not useful until they are organized. The moment you arrange data into a structured collection with a clear purpose, you have a dataset. Understanding what a dataset is, and how it is organized, is the bridge between raw data and trainable AI.
What Makes a Dataset
A dataset is an organized collection of related data gathered for a specific purpose. The two key words are organized and related. If you write down today's temperature, then a sentence from a novel, then a photo of your lunch, you have three data points — but not a dataset, because they are not organized or related. A dataset has structure: all its data points describe the same kind of thing, collected in a consistent way. The most common structure for datasets is a table: rows and columns. Each row (also called an example, record, or instance) represents one thing being described — one person, one email, one photo, one day. Each column (also called a feature, attribute, or variable) represents one type of information recorded about every example. Here is a small example dataset. Imagine a study tracking students' sleep and quiz scores: Student | Hours of Sleep | Bedtime | Quiz Score --------|---------------|---------|---------- Alice | 7.5 | 10 PM | 88 Bruno | 6.0 | 11 PM | 74 Carla | 8.0 | 9 PM | 92 Dawn | 5.5 | 12 AM | 61 Ezra | 7.0 | 10 PM | 82 Five rows, four columns. Each row is one student. Each column is one attribute measured for every student. That is the anatomy of a tabular dataset.
Row = one example (one thing being described). Column = one feature (one type of information). Every row has a value for every column. This table structure is the most common way data is organized for AI training.
Not all datasets are tables. Image datasets organize thousands or millions of image files, each tagged with information about what is in the image. Audio datasets are collections of sound recordings with associated text or labels. Text datasets are corpora — large collections of written language. But even these non-tabular datasets share the same principle: they are organized, related collections of data, gathered for a purpose. For this module, you will mostly think about tabular datasets, because they make the key ideas easiest to see. Everything you learn here applies to the messier, richer types of datasets that modern AI systems actually use.
Features: What You Measure
Choosing which columns to include in a dataset is one of the most important decisions in building an AI system. The columns you choose are called features, and they determine what the AI can learn from. In the sleep-and-scores example above, the features are hours of sleep, bedtime, and quiz score. They were chosen because the researchers thought these might be related. But notice what was left out: what subjects the quizzes covered, whether students had after-school jobs, what they ate for breakfast, whether they were fighting a cold, their general academic history. All of those things might also matter — but they were not collected. This is a fundamental truth about datasets: they always reflect choices. Someone decided what to measure and what to ignore. Those choices shape what the AI can and cannot learn. A medical AI trained on a dataset that tracked 50 health features might still miss the crucial 51st feature that was never recorded.
An AI cannot learn from information that was never collected. If a dataset about loan applications does not include neighborhood income data, the AI cannot use that — but it might learn to use zip codes as a proxy, which can recreate the same bias. The choice of features matters enormously.
Match each dataset term to its correct meaning.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Size and Shape
When data scientists talk about a dataset, one of the first things they describe is its shape — how many rows and how many columns. A dataset with 5 rows is a tiny experiment. A dataset with 1 million rows is large enough to train a solid image classifier. A dataset with 1 billion rows is what the largest language models are trained on. The number of columns also matters. A dataset with 3 features gives an AI limited information per example. A dataset with 500 features gives it much more — but also risks overwhelming it with irrelevant or redundant information, a challenge called the curse of dimensionality. Real AI training datasets are almost never as clean and tidy as the five-row example above. They are messy, inconsistent, sometimes enormous, and always imperfect. Learning to work with imperfect datasets is one of the core skills in AI. That messiness is the subject of Lesson 5.
Complete these sentences using the correct dataset terms.
What is the most important reason that feature selection matters for AI?
Which of the following best describes a dataset?
Design a Dataset
- Choose something you could study: sleep and mood, weather and attendance, practice time and skill improvement — or any measurable topic that interests you.
- Decide on your purpose: what question are you trying to answer?
- Design a table with at least 4 columns (features) and at least 5 rows (examples you could realistically collect).
- For each column, explain WHY you chose to include it — what information does it give the AI?
- Identify at least one feature you chose NOT to include, and explain what you might be missing by leaving it out.
- Compare your dataset design with a classmate's. Did they choose different features? How might those choices lead to different AI behavior?