Skip to main content
Frontier & Future AI

⏱ About 20 min20 XP

Data and Compute Limits

For most of the past decade, the dominant strategy in AI research was scaling: train larger models on more data using more compute. This strategy produced genuine breakthroughs — GPT-4, Gemini Ultra, Claude, and similar systems represent qualitative improvements over their predecessors. But scaling is not unlimited. There are hard ceilings on what data and computation can provide, and the field is actively encountering them. This lesson examines those ceilings with precision.

Scaling Laws and Their Limits

In 2020, researchers at OpenAI published a landmark result: the loss of a language model (a measure of how poorly it predicts held-out text) follows a smooth power law as a function of model size, dataset size, and compute budget. This meant that performance improvements were predictable and consistent — double the compute, get a predictable improvement in loss. This was the empirical foundation of the scaling hypothesis: that simply making models bigger with more data and compute would continue to yield proportional capability gains. The Chinchilla scaling laws (2022, DeepMind) refined this: for a given compute budget, there is an optimal ratio of model parameters to training tokens. Many large models had been under-trained — too large for the amount of data used. The right strategy was to use more data per parameter, not just larger models. But scaling laws describe the training loss, not task performance on specific capabilities. The relationship between training loss and downstream capability is not a smooth power law — it is punctuated. Some capabilities appear abruptly as scale crosses thresholds (emergence), and others plateau even as loss continues to improve. A model with half the loss of a predecessor might be 10x better on some tasks and nearly identical on others. Scaling loss is not the same as scaling capability, and that distinction matters enormously for planning AI development.

Scaling Loss vs. Scaling Capability

Training loss follows smooth scaling laws. Task capability does not. A 10% reduction in training loss can produce a 200% improvement on one task and a 2% improvement on another. Treating scaling laws as guarantees of capability improvement is a category error.

The Data Wall

The internet contains a finite amount of high-quality text. Epoch AI, a research organization that tracks AI compute and data trends, estimated in 2023 that at current consumption rates, the stock of high-quality English text on the internet would be largely exhausted for training purposes by 2026-2028. Models have already been trained on substantial fractions of the available web. This is called the data wall, and it creates a fundamental constraint. Synthetic data — text generated by AI models themselves — is one proposed solution. But synthetic data has a documented failure mode: if models train on their own outputs, errors and biases in those outputs get amplified over generations, a phenomenon researchers call model collapse. Generations of synthetic-data training can degrade performance on the very capabilities the original training data supported. The data wall is not uniform. For widely-documented domains — English-language web content, widely-translated programming languages, well-resourced scientific literature — the wall is near. For domains with limited digital representation — low-resource languages, specialized professional knowledge, embodied physical experience — the wall was hit much earlier. This explains persistent performance gaps between high-resource and low-resource domains that scaling cannot easily close. Multimodal data (images, audio, video) extends the runway: there is far more video data than text data. But video data is less densely information-rich per token than high-quality text, and labeling video at the scale needed for supervision is expensive. The data wall does not disappear with multimodality; it shifts.

Compute constraints are the second major ceiling. Training a frontier model today requires on the order of 10^24 to 10^25 floating-point operations (FLOPs). The largest known training runs consumed tens of thousands of specialized AI accelerators (GPUs or TPUs) running continuously for months. The cost of a single frontier training run is estimated between 50 and 200 million dollars. This compute cost scales with model size and dataset size. If the data wall limits how much more data can be used, and current model sizes are already at the edge of cost-feasibility, the scaling strategy faces a dual constraint. The compute ceiling also has a hardware component: chip manufacturing is subject to physical limits (lithography constraints, heat dissipation at high transistor densities) that are slowing the rate of improvement in per-chip compute. The improvement in AI chip performance follows a faster curve than general Moore's Law, but this curve is itself expected to slow as we approach physical transistor size limits. Compute efficiency research — achieving equivalent performance with fewer FLOPs — is an active response to this constraint. Techniques like mixture-of-experts architectures (only activating a subset of parameters per forward pass), quantization (using fewer bits per parameter), and distillation (training smaller models to replicate larger ones) all trade engineering complexity for compute efficiency.

Match each concept to what it means in the context of scaling limits.

Terms

Scaling laws
Data wall
Model collapse
Mixture of experts
Emergent capability

Definitions

A model behavior that appears abruptly at a scale threshold rather than improving smoothly
Degradation of model quality after successive rounds of training on AI-generated synthetic data
An architecture that activates only a subset of its parameters per inference step, reducing compute per forward pass
Power-law relationships predicting how training loss decreases with more compute, data, or parameters
The point at which high-quality training data is largely exhausted for a given domain

Drag terms onto their definitions, or click a term then click a definition to match.

Complete these statements about scaling limits using the correct technical terms.

Scaling laws describe smooth improvements in training , but task can change discontinuously. The wall refers to the exhaustion of high-quality training data for a domain, while collapse is the degradation caused by training on AI-generated synthetic data.

A lab trains successive generations of its model on text generated by the previous generation, to avoid using copyrighted internet data. Researchers find that the fifth-generation model performs worse than the first on rare knowledge domains. The most likely explanation is:

According to Chinchilla scaling laws, many large language models trained before 2022 were:

Model the Data Wall for a Domain

  1. You will estimate when the data wall hits for a specific domain and think through the implications.
  2. Step 1: Choose a specific domain (options: legal case law in a small country, patient medical notes in a regional hospital network, social media posts in a low-resource language, scientific papers in a specialized subfield).
  3. Step 2: Estimate the total number of tokens of high-quality text available in your domain. (For scale: a typical novel is about 100,000 words or roughly 130,000 tokens. A large web crawl has trillions of tokens.)
  4. Step 3: If a frontier model trains on approximately 10 trillion tokens total, what fraction of that budget could your domain realistically contribute?
  5. Step 4: What are the implications for a model's performance on your domain versus, say, English Wikipedia? Would you expect better, similar, or worse performance?
  6. Step 5: Propose one strategy for extending the data supply in your domain beyond the initial wall. What are the risks of that strategy?
  7. Present your analysis in a two-minute summary to the class.