Skip to main content
AI Foundations

⏱ About 15 min15 XP

Data Is Everywhere

In the last lesson you learned what data is: recorded information about the world. Now ask a different question — how much of it is there? The answer is so large it is almost impossible to hold in your head. Every moment of every day, billions of people and trillions of sensors are generating and storing data at a rate that has no historical parallel. Understanding that scale — and understanding where all this data comes from — is essential for thinking seriously about AI.

The Scale of Modern Data

Here are some numbers that are accurate as of the mid-2020s. Spend a moment with each one. Every minute, people send roughly 300 million emails, watch 1 million minutes of video on streaming platforms, conduct 5.9 million Google searches, and post more than 500,000 images to social media. Every day, humanity generates approximately 2.5 quintillion bytes of data — that is 2,500,000,000,000,000,000 bytes. Written out fully, that number has 19 digits. To put that in perspective: if you printed all the data generated in a single day on standard paper, the stack of pages would reach from Earth to the Sun — roughly 93 million miles — about 23 times. All of this data does not just evaporate. It is stored, transmitted, analyzed, and increasingly fed into AI systems.

Key Term: Exabyte

A byte is the basic unit of digital storage — enough for roughly one character of text. One exabyte is 1,000,000,000,000,000,000 bytes (10^18). Humanity generates roughly 120 exabytes of data per day. For context, all the words ever spoken by human beings throughout history are estimated at about 5 exabytes.

Where does it all come from? Think about a single hour of your life and trace the data trail. You wake up and your phone syncs: your sleep tracker has recorded your heart rate, movement, and estimated sleep stages all night. That is hundreds of data points before you have even opened your eyes. You check your messages: each message is stored with a timestamp, a sender ID, your recipient ID, the text, the device you read it on, your location when you read it, how long you spent reading it, and whether you replied. You search for something online: the search engine records your query, the time, your approximate location, your device type, which results you clicked, how long you spent on each page, and whether you came back to the results. You take a photo: one smartphone photo is about 3–5 megabytes. It contains not just the image but metadata — when it was taken, the GPS coordinates, the camera settings, sometimes the direction you were facing. You walk to school: if you have a fitness tracker or a phone in your pocket, your steps, pace, route, and travel time are all logged. All of this happens before noon on a typical day.

Active Data vs. Passive Data

Not all data is created the same way. It is useful to distinguish between data you generate on purpose and data that is generated about you automatically. Active data is what you deliberately create: writing a message, posting a photo, filling out a form, giving a product rating. You are consciously putting information into a system. Passive data is recorded without any deliberate action on your part: your location being pinged by your phone's GPS, the temperature sensor in your home recording the current reading, your browser logging which parts of a webpage your mouse hovers over. You did not decide to create this data — it was collected as a byproduct of your activities. Most of the data generated today is passive. Sensors, logs, and automated systems record far more than people intentionally share. This is one reason the total volume of data is so much larger than most people expect.

Think About Your Data Trail

Before you finish reading this lesson, you have already generated data today — probably hundreds of data points. Getting in the habit of asking 'what data does this action generate, and who has it?' is one of the most useful skills in the modern world. You will return to this question in Lesson 8 when you study privacy.

Flashcards — click each card to reveal the answer

Why Volume Matters for AI

The reason AI has improved so dramatically in the past decade is not primarily because algorithms got smarter — it is because data got bigger. Many modern AI techniques, especially deep learning, require enormous quantities of examples to learn from. A speech recognition system trained on 100 hours of audio sounds robotic; trained on 100,000 hours, it approaches human-level accuracy on many tasks. This creates a kind of data flywheel. Large technology companies build products that attract users. Users generate data. That data trains better AI. Better AI makes the products more compelling, attracting more users, generating more data. Understanding this cycle explains a great deal about why a few very large companies dominate AI research. But it also raises questions that will run through this entire module: if AI needs massive amounts of data, who has it? Who collected it? From whom? And does that data fairly represent everyone — or just the people who happened to use those products?

More Data Is Not Always Better Data

Raw volume is not the only thing that matters. One million data points that are biased, mislabeled, or collected from only one type of person can train a worse AI than 100,000 high-quality, representative data points. Quantity and quality are both essential — and they are not the same thing.

What is the difference between active and passive data?

Why has AI improved so much in the past decade, according to this lesson?

Map Your Data Trail

  1. Think about everything you did from the time you woke up until right now.
  2. List at least 8 actions you took — getting up, using a device, traveling somewhere, eating, communicating.
  3. For each action, identify at least one piece of data it likely generated and who probably collected it.
  4. Separate your list into active data (you created it on purpose) and passive data (recorded automatically).
  5. Count the totals in each column. What surprised you?
  6. Share your list with a partner and compare. Did they find data-generating actions you missed?