Chapter 25: Emergent Abilities and Scaling

Why Bigger Models Do New Things

Scaling Laws: The Physics of Language Models

Language model performance improves predictably with scale. Increase model size, training data, or compute, and loss decreases following a power law. This relationship—scaling laws—makes larger models not just quantitatively better but qualitatively different.

The empirical relationship (Kaplan et al., 2020):

L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}, \quad L(C) \propto C^{-\alpha_C}

Where:

$L$ is the cross-entropy loss on held-out data
$N$ is the number of model parameters
$D$ is the dataset size (number of tokens)
$C$ is the compute used during training (measured in FLOPs)
$\alpha_N \approx 0.076$ , $\alpha_D \approx 0.095$ , $\alpha_C \approx 0.050$ are empirically determined exponents

These power laws hold across orders of magnitude. A 10× increase in parameters reduces loss by a predictable amount. A 100× increase produces proportionally greater improvement. The relationships are smooth and consistent, enabling forecasting: experiments with small models predict large model performance.

In log-log space, scaling laws appear as straight lines:

\log L = -\alpha \log N + c

This linearity is remarkable. Most systems exhibit diminishing returns or saturation—performance improves rapidly at first, then plateaus. Language models don’t plateau within observed ranges. Doubling compute continues to reduce loss, suggesting no near-term ceiling on performance gains.

Scaling Laws: The Physics of Language Models diagram

The diagram shows the empirical scaling relationship: loss decreases as a power law with model size. In log-log space, the relationship is linear. Each generation of models (GPT-1 → GPT-2 → GPT-3 → PaLM) follows the same trend, validating the predictability of scaling.

Why scaling laws matter: They transform AI development from alchemy to engineering. Before scaling laws, improving models required algorithmic innovations—new architectures, training tricks, clever regularization. Scaling laws show that size is sufficient: just make the model bigger, train on more data, use more compute. Performance improvements are nearly guaranteed. This shifts strategy from “find the right algorithm” to “invest in scale.”

The practical implication: forecasting. Train a 1B parameter model and measure its loss. Scaling laws predict the loss of a 100B parameter model trained the same way. This enables cost-benefit analysis—is the 100× compute investment worth the predicted performance gain? For applications where marginal improvements matter (search, translation, assistants), the answer is often yes.

Phase Transitions: When Abilities Appear Suddenly

Scaling laws describe smooth improvement in loss, but some capabilities don’t scale smoothly—they appear suddenly at specific model sizes. These emergent abilities are tasks the model couldn’t perform at small scale but can perform at large scale, with the transition happening abruptly.

Examples:

Multi-step reasoning: GPT-2 (1.5B) fails at problems requiring multiple reasoning steps. GPT-3 (175B) succeeds on some, GPT-4 on most. The ability doesn’t scale gradually—it’s nearly absent below a threshold, then present above it.
Arithmetic: Small models (<1B parameters) can’t reliably add large numbers. Larger models (>10B) can, with accuracy improving sharply around 10B parameters.
Translation: Small models translate common language pairs poorly. Scaling to 100B+ parameters unlocks reliable translation, even for rare language pairs.

These abilities emerge because the model crosses a capability threshold. Below the threshold, the model has insufficient capacity to compress the patterns required for the task. Above the threshold, capacity suffices, and training data provides enough signal to learn the pattern.

Phase Transitions: When Abilities Appear Suddenly diagram

The diagram shows emergent abilities appearing suddenly at specific model sizes. Unlike smooth loss reduction, task accuracy jumps from near-zero to high performance at capability thresholds. Different tasks have different thresholds—arithmetic emerges earlier than multi-step reasoning.

Why emergence happens: Tasks vary in complexity. Simple tasks (completing common phrases) require minimal capacity and data—small models suffice. Complex tasks (multi-step reasoning, rare translations) require large capacity to compress the patterns and substantial data to provide signal. Below the capacity threshold, the model can’t represent the solution; above it, the model can learn the task given sufficient data.

This creates a discontinuous relationship between scale and capability. Loss decreases smoothly (more capacity = better compression), but specific tasks unlock suddenly (sufficient capacity for the pattern). The result: larger models don’t just do things better—they do things smaller models can’t do at all.

Few-Shot Learning: Learning from Context

Perhaps the most surprising emergent ability is few-shot learning: the model learns tasks from examples provided in the prompt, without any parameter updates. This in-context learning transforms language models into universal function approximators.

Example:

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>

The model continues with “fromage” (French for cheese). It inferred the task (English→French translation) from the examples in the prompt, applied the pattern, and generated the correct output—all without fine-tuning.

This capability emerges at scale. GPT-2 (1.5B) struggles with few-shot learning. GPT-3 (175B) excels: given 0-5 examples, it performs tasks it was never explicitly trained on. Larger models learn more effectively from fewer examples—GPT-4 often succeeds with 1-shot or even 0-shot (just the task description, no examples).

Why in-context learning works: During pretraining, the model sees countless patterns where a task is demonstrated then applied. Web pages explain concepts then give examples. Documentation shows function signatures then usage. Forums ask questions, then provide answers. The model learns the meta-pattern: “when text shows [task format] followed by [examples], generate [application of the pattern].”

The model doesn’t “understand” it’s doing few-shot learning—it’s predicting plausible continuations based on patterns in training data. But those patterns happen to include task demonstration → task execution, so the model learns to perform tasks from prompts.

Chain-of-thought prompting extends in-context learning to multi-step reasoning. Instead of asking for direct answers, prompt the model to show its reasoning:

Problem: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?

Let's think step by step:
- Roger starts with 5 balls
- He buys 2 cans
- Each can has 3 balls, so 2 cans have 2 × 3 = 6 balls
- Total: 5 + 6 = 11 balls

Answer: 11

This prompting strategy significantly improves performance on math, logic, and reasoning tasks. The model generates intermediate steps, which serve as additional context for producing the final answer. Chain-of-thought turns next-token prediction into a reasoning process: each predicted token provides context that improves subsequent predictions.

Chain-of-thought only works at scale. Small models produce nonsensical reasoning chains. Large models (>100B parameters) produce coherent, often correct chains. This is another emergent ability: the capacity to generate useful intermediate reasoning steps appears suddenly above a threshold.

Generalization: Why LLMs Transfer Knowledge

Language models trained on internet text generalize to tasks they’ve never seen explicitly. This zero-shot transfer happens because pretraining data implicitly contains task patterns.

A model trained on web text sees:

Question-answering: forums, FAQs, Quora, Stack Overflow
Summarization: article abstracts, TL;DRs, executive summaries
Translation: multilingual websites, bilingual documents
Code generation: GitHub repositories with comments and implementations
Dialogue: chat logs, Reddit threads, customer service transcripts

These aren’t labeled datasets but naturally occurring examples. The model learns these formats as statistical patterns. When prompted with “Q: … A:”, it predicts text matching the question-answer pattern seen during pretraining. Zero-shot transfer works because pretraining data is so diverse that most tasks appear implicitly.

Why generalization improves with scale: Larger models have more capacity to compress diverse patterns. A small model must prioritize—it can learn common tasks but not rare ones. A large model compresses common tasks and rare edge cases. The result: larger models generalize better. GPT-2 fails at most zero-shot tasks. GPT-3 succeeds at many. GPT-4 succeeds at most.

This explains the power of foundation models: they’re trained on such diverse data that they generalize to countless downstream tasks without task-specific training. The model hasn’t seen explicit “sentiment classification” training data, but it’s seen enough text discussing emotions that it can classify sentiment zero-shot. It hasn’t been trained on “code debugging,” but it’s seen enough code and explanations that it can debug zero-shot.

Generalization is compression-driven. The model can’t memorize all training data, so it compresses—extracts patterns. Those patterns happen to generalize because they capture the underlying structure of language and tasks described in text.

Engineering Takeaway

Scaling transforms language models from narrow predictors to general-purpose systems. Understanding scaling laws, emergent abilities, and in-context learning is essential for leveraging modern AI effectively.

Bigger models are qualitatively different, not just quantitatively better

A 10× larger model doesn’t just perform 10% better—it unlocks entirely new capabilities. Multi-step reasoning, complex arithmetic, chain-of-thought prompting, and few-shot learning emerge at scale. These aren’t architectural innovations; they’re consequences of capacity. When designing systems, assume larger models will do things smaller models can’t. Plan for emergent abilities: applications that seem impossible with current models may become trivial with the next generation. Scale changes the problem space.

Scaling is expensive but predictable

Training GPT-4-scale models costs tens of millions of dollars. But scaling laws make outcomes predictable: invest 10× compute, get a quantifiable performance improvement. This transforms AI development from research (uncertain outcomes) to engineering (predictable ROI). For organizations, this means scaling is a viable strategy—not a gamble. Budget for larger models based on forecasted performance gains from scaling laws. The economics favor scale: the fixed cost of training is high, but the marginal cost of inference is low, and the model serves millions of users.

Few-shot learning reduces need for task-specific fine-tuning

With large models, many applications don’t require fine-tuning. Provide examples in the prompt, and the model adapts. This is cheaper (no training cost, no labeled data) and faster (immediate deployment). The trade-off: few-shot performance is lower than fine-tuned performance, and prompt engineering requires iteration. For applications where fine-tuning is expensive or data-limited, few-shot learning is a powerful alternative. As models scale, few-shot performance approaches fine-tuned performance, reducing the need for task-specific training.

In-context learning is free but limited by context window

In-context learning happens at inference time—no parameter updates, no training. Just provide examples in the prompt. This makes it incredibly flexible: the same model adapts to different tasks dynamically. The constraint: context window size. Current models support 4K–128K tokens. Long-context models (1M+ tokens) enable even more in-context learning. For production systems, maximize in-context learning by designing prompts that fit essential examples and instructions within the context window. This reduces deployment complexity compared to maintaining fine-tuned models for every task.

Emergent abilities justify the cost of training large models

Smaller models are cheaper to train but lack critical capabilities. A 1B parameter model can’t do multi-step reasoning, arithmetic, or complex translation. A 100B parameter model can. The cost difference is 100×, but the capability difference is qualitative, not quantitative. For applications requiring these abilities, there’s no substitute for scale. Organizations building AI products must decide: train/use large models with emergent abilities (high cost, high capability) or use smaller models with limited abilities (low cost, limited capability). For many applications, the emergent abilities justify the cost.

Prompting becomes programming

With large models, prompts are code. Chain-of-thought prompting, few-shot examples, instruction formatting—these are programming constructs. Engineering effective prompts (Part VI) is now a critical skill. The model’s behavior is controlled entirely by the input prompt, making prompt design as important as algorithm design in traditional software. Production systems invest heavily in prompt engineering: iterating on formats, testing variations, optimizing for task performance. Understanding how prompts interact with emergent abilities (in-context learning, chain-of-thought) enables building sophisticated applications without fine-tuning.

Compute is the bottleneck for frontier models

Training frontier models requires clusters of tens of thousands of GPUs running for months. The bottleneck isn’t algorithms or data—it’s compute. Scaling laws show exactly how much compute is needed for a target performance level. Organizations with sufficient compute can train state-of-the-art models; those without must use smaller models or API access. The practical implication: AI development increasingly favors organizations with massive compute resources. For most practitioners, accessing frontier models via APIs (OpenAI, Anthropic, Google) is more viable than training from scratch. Understanding scaling laws helps evaluate trade-offs: when is it worth training your own model vs. using a provider’s API?

The lesson: Scale is the dominant factor in language model capability. Scaling laws make performance improvements predictable—invest in size, get better models. But scale doesn’t just improve performance; it unlocks emergent abilities that smaller models lack entirely. Few-shot learning, chain-of-thought reasoning, and zero-shot generalization appear suddenly at specific model sizes, transforming what’s possible. Modern AI strategy revolves around scale: training or accessing the largest models feasible, leveraging emergent abilities through prompt engineering, and planning for future capabilities as models continue to grow. Understanding scaling—why it works, what it unlocks, and how to leverage it—is essential for building and deploying AI systems that push the frontier of what’s possible.

References and Further Reading

Scaling Laws for Neural Language Models – Jared Kaplan, Sam McCandlish, Tom Henighan, et al. (2020) https://arxiv.org/abs/2001.08361

Kaplan et al. empirically characterized how language model performance scales with model size, dataset size, and compute. They showed that loss follows predictable power laws across orders of magnitude, enabling forecasting: small-scale experiments predict large-scale performance. The paper quantifies the trade-offs between model size and training data, providing a framework for compute-optimal training. This work transformed AI development from alchemy to engineering—performance improvements with scale are nearly guaranteed. The paper’s insights justified massive investments in training GPT-3 and subsequent models. Reading this explains why scaling became the dominant strategy in AI and how to predict performance gains from increased compute.

Emergent Abilities of Large Language Models – Jason Wei, Yi Tay, Rishi Bommasani, et al. (2022) https://arxiv.org/abs/2206.07682

Wei et al. documented abilities that appear suddenly at specific model sizes rather than scaling smoothly. They identified dozens of emergent abilities—multi-step reasoning, translation, arithmetic—that are absent in models below a threshold but present above it. The paper shows that scale doesn’t just improve performance; it unlocks qualitatively new capabilities. This explains why GPT-3 can do tasks GPT-2 can’t, and why GPT-4 outperforms GPT-3 not just marginally but categorically. Understanding emergent abilities is critical for planning AI applications: current models may lack necessary capabilities, but the next generation may possess them. Reading this clarifies what scale buys beyond lower loss—entirely new capabilities.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models – Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. (2022) https://arxiv.org/abs/2201.11903

Wei et al. showed that prompting large models to generate intermediate reasoning steps (chain-of-thought) dramatically improves performance on complex reasoning tasks. Instead of directly answering, the model shows its work—breaking problems into steps, which serve as context for subsequent predictions. This simple prompting strategy unlocks reasoning-like behavior in models trained only on next-token prediction. The paper demonstrates that emergent abilities (here, multi-step reasoning) can be amplified through clever prompting. Chain-of-thought is now standard in production systems for math, logic, and complex question answering. Reading this clarifies how to leverage large models effectively: prompting strategies can unlock capabilities that seem absent without the right input format.