Chapter 36: Scaling Laws - Why Bigger Keeps Winning

In 2020, researchers at OpenAI published a paper that changed how the AI industry thinks about progress. They trained hundreds of language models, varying the number of parameters from 768 to 1.5 billion, the dataset size from 22 million to 23 billion tokens, and the compute budget across six orders of magnitude. They measured the loss—how well each model predicted text—and discovered something remarkable: performance followed smooth, predictable power laws. Bigger models, trained on more data with more compute, consistently did better. And the relationship wasn’t just consistent—it was mathematically precise across massive scale ranges.

This discovery transformed AI development from trial-and-error to engineering. Before scaling laws, researchers tried different architectures, hoping for breakthroughs. After scaling laws, the industry realized: scale itself is the breakthrough. GPT-3’s 175 billion parameters, trained on 300 billion tokens, achieved capabilities that smaller models could not. GPT-4, rumored to be trained with 10-100x more compute, pushed further. The frontier of AI is not algorithmic cleverness—it is infrastructure: bigger clusters, more GPUs, larger datasets, months of training.

This chapter explains why scale drives progress, what that means for AI development, and why infrastructure has become destiny. Scaling laws make the future predictable—within the explored range. But they also reveal limits: each generation requires 10x more resources for incremental gains. Understanding scaling laws is understanding why AI is accelerating and where the limits lie.

Power Laws: How Performance Grows with Compute

In most engineering systems, performance plateaus. You add more resources, and gains diminish to nothing. Double the effort, get 10% improvement. Eventually, returns vanish. But language models are different. Performance improves smoothly as models get bigger, and the relationship follows a power law: a straight line on a log-log plot.

The scaling law for model size:

$L(N) \propto N^{-\alpha}$

Where $L$ is loss (cross-entropy), $N$ is the number of parameters, and $\alpha$ is the scaling exponent (empirically around 0.076 for language models). This says: if you increase model size by 10x, loss decreases by roughly $10^{0.076} \approx 1.19x$ . On a log-log plot, this is a straight line with slope $-\alpha$ .

What this means in practice:

GPT-1 (117M parameters): Loss ~3.3
GPT-2 (1.5B parameters): Loss ~2.5 (13x larger, loss reduced ~25%)
GPT-3 (175B parameters): Loss ~2.0 (117x larger than GPT-2, loss reduced another ~20%)
GPT-4 (estimated 1.7T parameters): Loss unknown but likely ~1.7-1.8 (10x larger, loss reduced ~15-20%)

The power law holds across six orders of magnitude in model size. This is extraordinary: most systems break down at scale, but language models get predictably better.

Why power laws matter:

Power laws enable forecasting. Before training GPT-4, OpenAI could estimate its loss based on compute budget and model size. This turns AI development into an optimization problem: how much compute should we invest, and how should we allocate it between model size and training data? Scaling laws answer this quantitatively.

But power laws are not limitless.

The power law formula $L \propto N^{-\alpha}$ implies diminishing returns. Reducing loss from 3.0 to 2.0 requires 10x scale. Reducing from 2.0 to 1.0 requires 100x scale. Reducing from 1.0 to 0.5 requires 1000x scale. Each improvement is harder than the last. Eventually, physical limits—energy, cost, available data—constrain further scaling.

The Chinchilla surprise:

In 2022, DeepMind published a follow-up study that revised the scaling laws. They found that most models were undertrained: too many parameters, not enough training data. OpenAI’s original scaling laws optimized for a fixed compute budget, but they underweighted data. DeepMind trained Chinchilla, a 70 billion parameter model, on 1.4 trillion tokens—4x more data than Gopher, a 280 billion parameter model trained on 300 billion tokens. Result: Chinchilla matched or exceeded Gopher’s performance despite being 4x smaller.

The revised scaling law:

Optimal compute allocation should balance model size and data roughly equally. If you double compute, increase both model size and training data by $\sqrt{2} \approx 1.4x$ . This means previous models (GPT-3, Gopher) were too large for their training data. A smaller model, trained longer, performs better.

Why this matters:

Chinchilla showed that algorithmic improvements still matter. Scale is powerful, but how you allocate compute is equally important. GPT-3 trained on 300B tokens; GPT-4 likely trained on 10-100T tokens. This reallocation—more data, proportionally scaled model size—explains some of GPT-4’s improvements without requiring pure parameter scaling.

Data, Model, Compute: The Three Levers

Scaling laws reveal three levers for improving model performance:

1. Model Parameters (N)

The number of trainable weights in the model. More parameters = more capacity to memorize patterns and represent complex functions.

GPT-2: 1.5 billion parameters
GPT-3: 175 billion parameters (117x larger)
GPT-4: Estimated 1-1.7 trillion parameters (6-10x larger than GPT-3)
Llama 3.1: 405 billion parameters (largest open model as of 2024)

Increasing parameters requires more memory (GPU RAM or distributed across many GPUs), more compute per forward pass (matrix multiplications scale with parameters), and more time to train.

Example: GPT-3 (175B parameters) requires ~700GB of memory in FP16 precision (2 bytes per parameter). Training requires thousands of GPUs for months.

2. Training Data (D)

The number of tokens (words, subwords) the model sees during training. More data = more examples to learn from, better generalization.

GPT-2: ~40GB of text (web scrape)
GPT-3: 300 billion tokens (~600GB of text, filtered from Common Crawl)
GPT-4: Unknown, but likely 10-100 trillion tokens
LLaMA 2: 2 trillion tokens (curated mix of web data, books, code)

Collecting and cleaning data is non-trivial. Common Crawl contains billions of web pages, but most are low-quality: spam, duplicates, non-English, generated text. Filtering and deduplicating data is an engineering challenge. The quality of training data determines model quality—garbage in, garbage out.

Chinchilla insight: Most models trained before 2022 were data-starved. Doubling data improves performance as much as quadrupling parameters. Optimal allocation: if compute budget increases 10x, increase model size 3x and data 3x.

3. Compute Budget (C)

Total floating-point operations (FLOPs) used during training. Compute is the product of model size, data size, and training time.

$C \approx 6 \times N \times D$

Where $N$ is parameters, $D$ is tokens, and the factor of 6 accounts for forward and backward passes plus other overheads.

GPT-3 training compute:

175B parameters
300B tokens
$C \approx 6 \times 175 \times 10^9 \times 300 \times 10^9 = 3.14 \times 10^{23}$ FLOPs

At $10^{15}$ FLOPs/second per GPU (NVIDIA A100), this is ~31.4 million GPU-seconds, or ~10,000 GPU-days. With 10,000 GPUs, this is ~1 day of compute—but training is not perfectly parallelized, so actual training time was likely weeks to months.

Estimated cost: At $2-3/GPU-hour for cloud compute, GPT-3 training cost approximately **$ 4-5 million** in compute alone (not counting engineering time, infrastructure, failed runs).

GPT-4 training compute: Estimated 10-100x more than GPT-3, implying training costs of $40-500 million. Only organizations with deep pockets—OpenAI (funded by Microsoft), Google, Meta, Anthropic (funded by Google and others)—can afford frontier model training.

Energy costs:

Training large models consumes massive energy. GPT-3 training is estimated to have consumed 1,287 MWh (megawatt-hours) of electricity. For context, the average U.S. household uses ~10 MWh per year. GPT-3 training = 130 households for a year. GPT-4 training likely consumed 10,000+ MWh.

Energy costs and carbon emissions are becoming engineering constraints. Datacenters are limited by power availability. Sustainable AI requires efficient architectures, better hardware (specialized AI chips), and renewable energy sources.

Diminishing Returns: Why Progress Is Predictable

Power laws guarantee that scaling improves performance, but they also guarantee diminishing returns. The exponent $\alpha \approx 0.076$ means each 10x increase in model size reduces loss by ~19%. To cut loss in half requires 100x more parameters. To cut it in half again requires 10,000x more parameters.

Concrete example:

Start with a 1B parameter model with loss 2.5. To reach loss 1.25 (half), you need a 100B parameter model (100x scale). To reach loss 0.625 (half again), you need a 10,000B = 10 trillion parameter model (100x scale again). Costs compound: compute, memory, energy, data.

What does lower loss buy you?

Loss measures how surprised the model is by the next token. Lower loss = better predictions = higher accuracy on downstream tasks. But the relationship between loss and task performance is not linear. Some tasks benefit enormously from small loss reductions; others plateau.

Example: Math word problems (GSM8K benchmark)

Small models (1B params, loss 2.5): ~5% accuracy (random guessing)
Medium models (13B params, loss 2.2): ~10-20% accuracy
Large models (100B+ params, loss 2.0): 40-60% accuracy
Frontier models (1T+ params, loss ~1.8): 80-90% accuracy

A 20% reduction in loss (2.5 → 2.0) yields a 10x improvement in accuracy (5% → 50%). But further loss reductions (2.0 → 1.8) yield smaller gains (50% → 80%). Diminishing returns appear twice: once in scaling compute to reduce loss, again in converting loss to task performance.

Economic implications:

Diminishing returns mean each generation of models is more expensive than the last. GPT-2 trained for ~ $50K. GPT-3 trained for ~$ 5M (100x). GPT-4 trained for ~ $50-100M (10-20x). GPT-5 might cost$ 500M-1B. At some point, ROI (return on investment) becomes unfavorable. Spending $1B to improve accuracy from 90% to 95% is not worthwhile for most applications.

Practical constraints:

Compute availability: Tens of thousands of GPUs are not easy to procure or manage
Energy: Training GPT-4 requires a small power plant’s worth of electricity
Data: High-quality text is finite; models risk running out of unique, valuable data
Time: Training takes months; faster iteration beats marginal performance gains

These constraints mean scaling will slow. The industry will shift focus from pure scale to efficiency: better architectures, better data, better training methods.

Emergent Abilities: Why New Skills Appear Suddenly

Scaling laws predict smooth improvements in loss. But some capabilities do not improve smoothly—they appear suddenly at a certain scale. These are called emergent abilities: skills that small models cannot perform at all, but large models can.

Examples of emergent abilities:

Few-shot in-context learning: GPT-3 (175B) can learn from a few examples in the prompt without fine-tuning. GPT-2 (1.5B) cannot. This capability “emerges” somewhere between 13B and 175B parameters.

Multi-step reasoning: Models below ~10B parameters fail at grade-school math word problems (GSM8K). Models above ~100B parameters achieve 40%+ accuracy. The capability jumps sharply, not gradually.

Instruction following: Small models generate text but do not follow instructions reliably (“Write a poem about trees” → random text). Large models (10B+) follow instructions accurately (“Write a poem about trees” → coherent poem).

Translation between languages not seen during training: Large models (100B+) can translate between low-resource language pairs (Swahili ↔ Turkish) despite minimal training data. Small models cannot.

Why do emergent abilities appear?

Two explanations:

1. Phase transitions in capability:

As loss decreases smoothly, the model crosses a threshold where a task becomes solvable. Below the threshold, the model lacks sufficient capacity to represent the solution. Above the threshold, the model can solve the task. Loss decreases smoothly, but task accuracy jumps sharply.

Analogy: Water temperature decreases smoothly from 5°C to -5°C, but at 0°C, water freezes—a phase transition. Similarly, model loss decreases smoothly, but task performance transitions sharply.

2. Measurement artifacts:

Emergent abilities might be an artifact of how tasks are measured. If a task is scored as binary (correct/incorrect), smooth improvements in loss appear as sudden jumps in accuracy. Using continuous metrics (e.g., partial credit) might reveal smooth improvements.

Recent research suggests emergence is partly measurement-dependent: tasks scored continuously show smoother scaling. But some capabilities (like few-shot learning) do appear genuinely emergent—small models cannot do it at all, large models can.

Why emergent abilities matter:

Emergence means capabilities are unpredictable before you reach the necessary scale. GPT-3’s few-shot learning was not predicted by scaling laws—it was a surprise. This raises the question: what other capabilities will emerge at larger scales?

Will 10T parameter models develop true reasoning?
Will they learn to plan multi-step actions reliably?
Will they generalize across domains like humans do?

Scaling laws predict loss will decrease, but they do not predict which capabilities will emerge. This makes frontier AI development both exciting and uncertain.

Engineering implications:

You cannot forecast emergent abilities before training. You must train the model, evaluate it, and discover what it can do. This makes large-scale training a high-stakes bet: invest $100M in training, hope new capabilities emerge that justify the cost.

Engineering Takeaway

Forecasting becomes possible—but only for loss, not capabilities

Scaling laws allow predicting loss before training. Given compute budget, model size, and data size, you can estimate final loss within tight error bars. This enables planning: “If we invest $50M in compute, we’ll reach loss 1.9.” But loss does not directly predict downstream task performance. Emergent abilities can surprise. Forecasting is powerful but incomplete.

Infrastructure is destiny—compute access determines competitiveness

Training frontier models requires thousands of GPUs, months of time, and tens of millions of dollars. Only a few organizations can afford this: OpenAI (Microsoft-backed), Google, Meta, Anthropic (Google-backed), Amazon. Compute access is the bottleneck. Smaller labs cannot compete at the frontier without funding. Infrastructure—datacenter capacity, chip supply, energy availability—determines who wins the race.

Algorithmic innovations still matter—Chinchilla shows better allocation beats pure scale

Chinchilla (70B parameters, 1.4T tokens) matched Gopher (280B parameters, 300B tokens) by training longer on more data. This means smarter training—better data, better compute allocation—can match or beat larger models trained inefficiently. Pure scale is not the only path. Research into data quality, curriculum learning, and efficient architectures remains valuable.

Diminishing returns constrain strategy—each generation costs 10x more for marginal gains

GPT-2: $50K. GPT-3:$ 5M. GPT-4: $50-100M. GPT-5:$ 500M-1B? Each generation requires 10x more compute for incrementally smaller improvements. At some point, ROI becomes negative. The industry must shift focus from pure scale to efficiency: inference optimization, model compression, application-specific models. Scaling continues but slows.

Emergent abilities are unpredictable—new skills appear, but we don’t know which or when

Few-shot learning emerged at ~13B parameters. Multi-step reasoning emerged at ~100B parameters. What emerges at 1T? 10T? No one knows until the model is trained and evaluated. This makes large-scale training a high-risk, high-reward investment. Capabilities might exceed expectations—or plateau. Emergent abilities keep scaling exciting but uncertain.

Economics drive development—only well-funded organizations can afford frontier training

Training GPT-4 costs $50-100M. GPT-5 might cost$ 500M-1B. Only organizations with massive funding can train frontier models. This concentrates power: OpenAI (Microsoft), Google, Meta, Anthropic, Amazon. Smaller labs focus on fine-tuning, distillation, or open models. The economics of AI favor scale and capital.

Energy costs compound—sustainability becomes an engineering constraint

GPT-3 training consumed 1,287 MWh. GPT-4 consumed 10,000+ MWh. Datacenter power consumption is a bottleneck. Regions with limited power infrastructure cannot host large training runs. Energy efficiency—better hardware (H100 vs A100, specialized AI chips), algorithmic optimizations—becomes critical. Sustainable AI requires renewable energy and efficient architectures.

Engineering Takeaway diagram

References and Further Reading

Scaling Laws for Neural Language Models - Kaplan et al. (2020), OpenAI

Why it matters: This paper established that language model performance follows predictable power laws across six orders of magnitude in model size, dataset size, and compute. It showed that loss scales as $L \propto N^{-\alpha}$ for model parameters, $L \propto D^{-\alpha}$ for data size, and $L \propto C^{-\alpha}$ for compute budget. This enabled forecasting: given a compute budget, predict final loss before training. The paper changed AI development from trial-and-error experimentation to engineering optimization. It justified massive investments in scale: if power laws hold, bigger models will predictably perform better. This paper is the foundation for GPT-3, GPT-4, and the entire scaling paradigm.

Training Compute-Optimal Large Language Models - Hoffmann et al. (2022), DeepMind (Chinchilla paper)

Why it matters: This paper revised OpenAI’s scaling laws, showing that most models were undertrained—too many parameters, not enough training data. DeepMind trained Chinchilla (70B parameters) on 1.4 trillion tokens (4x more data than typical) and matched Gopher (280B parameters trained on 300B tokens). The key insight: optimal compute allocation should balance model size and data roughly equally. If compute increases 10x, increase model size 3x and data 3x. This finding reshaped the industry: GPT-4 likely followed Chinchilla’s allocation strategy, training on 10-100T tokens instead of scaling parameters alone. The paper showed that smarter training beats pure scale.

Emergent Abilities of Large Language Models - Wei et al. (2022), Google Brain

Why it matters: This paper catalogued capabilities that appear suddenly at scale: multi-step reasoning, instruction following, few-shot learning. Small models cannot perform these tasks at all; large models can. The paper showed that loss decreases smoothly, but task performance transitions sharply—phase-transition-like behavior. This raised a critical question: what other abilities will emerge at larger scales? The paper also cautioned that emergence may be measurement-dependent: tasks scored as binary (correct/incorrect) show sharper transitions than tasks scored continuously. Regardless, emergent abilities make scaling both exciting and unpredictable. They justify high-risk, high-reward investments in frontier models.