Why More Data Helps (and When It Doesn't)
'Get more data' is the most commonly given advice in machine learning, and it is often correct. But it is not universally correct, and understanding precisely when more data helps — and when it does not — separates careful practitioners from those who collect data blindly. This lesson develops the reasoning from first principles.
How Data Reduces Variance
Recall from Lesson 6 that variance measures how much the learned model changes when trained on different samples. Each training example is a constraint on the parameter space. More constraints mean the solution is pinned down more tightly — the region of the hypothesis space compatible with all the data shrinks. Think of it statistically: with n independent samples, the standard error of an estimate scales as 1/sqrt(n). Double the data, reduce the estimation uncertainty by a factor of sqrt(2) ≈ 1.41. Quadruple the data, halve the uncertainty. The implication: going from 100 to 1,000 examples (10× increase) reduces standard error by a factor of sqrt(10) ≈ 3.16 — a large gain. Going from 10,000 to 100,000 examples (also 10×) gives the same relative reduction — but achieving that 10× now requires collecting 90,000 more examples instead of 900. The absolute improvement per additional example decreases. This is the law of diminishing returns on data. Concrete numbers: a study on image classifiers found that doubling the training set size improved test accuracy by about 0.5-1 percentage point per doubling after 100,000 examples — modest but real. The gains were larger at smaller sample sizes (e.g., 2-5 points per doubling going from 1,000 to 2,000). The curves flatten as data grows.
Additional data reduces variance proportionally to 1 over the square root of the sample size. The benefit is real and reliable, but each additional data point contributes less than the last. This is why the first thousand examples often matter more than the next million.
When does more data not help? Case 1: High bias. If the model's hypothesis space does not contain a good approximation to the true function, no amount of additional data from the same distribution will find that approximation. Adding data reduces variance — but if bias is the dominant source of error, reducing variance has negligible effect on test error. The right remedy is a more expressive model, not more data. Case 2: Distribution shift. If the additional data comes from a different distribution than the test inputs, it may actively harm generalization. Training a medical image model on one hospital's scanner and adding more images from the same scanner does not help generalization to a different scanner with different calibration. Worse, if the new data reflects a different distribution and you weight it equally with in-distribution data, the optimizer may be pulled toward a loss surface that does not match deployment. Case 3: Label noise at scale. If the data collection process produces a fixed rate of mislabeled examples, adding more data adds more noise in the same proportion. Cleaning labels is more valuable than adding noisy ones. A dataset of 10,000 clean examples often outperforms 100,000 examples with 15% label noise. Case 4: Saturation. For some problems and model classes, there exists a quantity of data beyond which the irreducible noise (the floor of the bias-variance decomposition) dominates. Adding data cannot reduce irreducible noise — only improving the measurement process or the feature representation can.
Flashcards — click each card to reveal the answer
Reading Learning Curves to Diagnose Data Needs
A learning curve plots training error and test error as functions of training set size. Reading learning curves is a practical diagnostic skill. Pattern 1 — High variance (need more data): training error is low, test error is high. As data increases, training error slowly rises (harder to perfectly fit more examples) and test error slowly falls. If the curves are converging and test error is still falling, more data will likely help. Pattern 2 — High bias (more data won't help): both training error and test error are high. As data increases, both curves plateau at a high value and do not converge. More data will not lower training error because the model cannot represent the true function regardless of how many examples are shown. Pattern 3 — Good fit: training and test error are both low and have converged. Adding data will not substantially improve performance — you are at the model's capability limit for this hypothesis space. This diagnostic framework lets you make a principled decision before spending months and money collecting more data. If your learning curve shows Pattern 2, go collect a more expressive model architecture, not a bigger dataset.
Collecting more data has costs — financial, time, privacy, and ethical. Medical data requires patient consent; facial recognition data requires photos of real people. Before collecting, diagnose whether data is your actual bottleneck. A high-bias model wastes all the data you collect beyond the minimum needed to constrain its simple hypothesis space.
A team trains a linear regression model on housing prices. Training error and test error are both 18% and flat across training set sizes from 500 to 50,000 examples. The team proposes collecting another 100,000 examples. What should they do instead and why?
A sentiment analysis model trained on English movie reviews is deployed to analyze product reviews on a global e-commerce platform in 12 languages. Collecting 10 million more English movie reviews would likely:
Sketch and Interpret Learning Curves
- You will draw and reason about learning curves without a computer.
- Setup: Draw two axes — x-axis is 'Training set size' from 100 to 10,000; y-axis is 'Error' from 0% to 50%.
- Step 1: Sketch a learning curve for a high-variance model. Draw training error starting low and rising slightly; draw test error starting high and falling toward training error. Label at what point they converge.
- Step 2: Sketch a learning curve for a high-bias model. Draw both curves starting high and converging quickly to a plateau well above 0%.
- Step 3: For each curve, answer: (a) Should this team collect more data? (b) Should they change the model? (c) At what point (if any) is collecting more data wasteful?
- Step 4: A team reports their model's test error dropped from 32% to 29% when going from 1,000 to 10,000 examples (10x data). They now want to know if going to 100,000 examples is worth the cost. Using the 1/sqrt(n) approximation, estimate the expected test error at 100,000 examples. Is it worth it?
- Share your curves and reasoning with a partner. Do you agree on the recommendations?