Chapter 22: Pretraining
Learning from the Internet
Self-Supervised Learning Without Labels
Language model pretraining is self-supervised learning: training without human-provided labels. Instead of labeled datasets—“this image is a cat,” “this email is spam”—the training data itself provides the supervision.
The next token in a sequence is the label. Given “The capital of France is”, the label is “Paris”—not added by humans but extracted from the text itself. Every document becomes millions of training examples: each position provides a context (preceding tokens) and a target (the next token). A single sentence “The cat sat on the mat” yields six training examples:
Context: "The" Target: "cat"
Context: "The cat" Target: "sat"
Context: "The cat sat" Target: "on"
Context: "The cat sat on" Target: "the"
Context: "The cat sat on the" Target: "mat"
Self-supervision eliminates the need for manual annotation. Labeling data is expensive: humans must read, understand, and categorize each example. For image classification, experts label thousands of images. For machine translation, bilingual speakers align sentence pairs. These datasets are limited by cost and human effort—ImageNet has ~1 million images, translation corpora rarely exceed tens of millions of sentence pairs.
Language modeling bypasses this bottleneck. Any text provides supervision—books, articles, websites, code, conversations. The internet contains trillions of words, all freely available (legally or otherwise). No humans need to label what comes next; the text itself contains the answer. This abundance of free supervision is why language models scale beyond supervised methods.
The trade-off: self-supervised learning uses a proxy objective. The model optimizes next-token prediction, not the actual task (question answering, translation, summarization). The bet is that learning to predict text forces the model to internalize patterns useful for downstream tasks. Chapter 21 explained why this works: prediction requires compression, and compression captures structure. Pretraining bets that text prediction is a rich enough objective to learn general language understanding.
Why Scale Matters: Coverage of Reality
The internet contains discussions of science, history, politics, fiction, code, recipes, instructions, arguments, and nonsense. Pretraining on this diversity exposes the model to vastly more concepts, patterns, and edge cases than any curated dataset.
Coverage determines capability. A model trained on 100M tokens sees limited vocabulary, few examples of rare constructions, and narrow domains. A model trained on 500B tokens (GPT-3) encounters:
- Common words tens of millions of times (strong statistical signal)
- Rare words thousands of times (sufficient to learn embeddings)
- Technical jargon across domains (medicine, law, engineering)
- Multiple languages (enabling multilingual understanding)
- Diverse reasoning patterns (math proofs, legal arguments, code logic)
- Edge cases and exceptions (idioms, sarcasm, historical events)
More data means more knowledge gets compressed into parameters. A model can only predict “Paris is the capital of France” if it saw enough instances to encode that association. For common facts (capitals of major countries), thousands of examples suffice. For rare facts (obscure historical events), few examples exist—the model may never learn them or overgeneralize from insufficient data.
Data diversity enables generalization. Training on Wikipedia alone produces models good at encyclopedia-style text but poor at conversations, code, or creative writing. Training on diverse sources—web crawls, books, GitHub, forums—produces models that generalize across domains. The model learns patterns common to all text (grammar, factual structure, reasoning) while also learning domain-specific conventions (code syntax, formal vs informal tone).
The scale of modern pretraining is staggering:
- GPT-2 (2019): 1.5B parameters, 40GB text (~10B tokens)
- GPT-3 (2020): 175B parameters, ~300GB text (~500B tokens)
- PaLM (2022): 540B parameters, ~780B tokens
- LLaMA 2 (2023): 70B parameters, 2 trillion tokens
Each generation trains on more data. Why? Because performance continues to improve. Scaling laws (Chapter 25) show predictable improvements with data size—double the data, reduce the loss by a consistent factor. There’s no sign of saturation: more data keeps helping.
The diagram shows pretraining as compression: diverse training data (web text, books, code) is compressed into model parameters, which encode grammar, facts, reasoning patterns—and biases. The model learns whatever statistical regularities exist in the data, both useful and harmful.
Token Diversity and Concept Learning
A language model can only predict tokens it has seen during training. If “quantum chromodynamics” never appears in the training data, the model won’t predict it—it’s outside the learned vocabulary.
This creates a coverage problem for rare concepts. Common words (“the”, “is”, “cat”) appear millions of times, providing strong signal. Technical terms (“glioblastoma”, “recombinase”, “merkle tree”) appear rarely, providing weak signal. If a medical term appears only 10 times, the model may not learn its meaning or may overgeneralize from insufficient context.
Token frequency determines learning. High-frequency tokens (top 1000 words) are learned robustly—the model sees them in countless contexts and learns precise embeddings and usage patterns. Mid-frequency tokens (10K–50K rank) are learned adequately if the dataset is large. Low-frequency tokens (tail of vocabulary) are learned poorly or not at all—few examples don’t provide enough signal to distinguish them from noise.
This explains why larger datasets improve performance on specialized domains. A model trained on 10B tokens may rarely see biochemistry terminology. A model trained on 1T tokens sees biochemistry papers thousands of times, learning domain-specific patterns. The long tail of rare concepts requires massive data to cover adequately.
Multilingual coverage: English dominates the internet (~60% of web pages), but GPT models learn dozens of languages by exposure to non-English text. The model learns that “cat” (English), “chat” (French), “Katze” (German) refer to similar concepts through contextual similarity in training data. However, languages with less training data (e.g., low-resource languages like Swahili or Urdu) are learned poorly compared to high-resource languages (English, Spanish, Chinese).
The quality of learned representations correlates directly with data quantity. This creates performance disparities: models excel at English but struggle with low-resource languages, understand common topics better than obscure ones, and perform well on popular domains (general knowledge, programming) but poorly on specialized fields (rare medical conditions, niche legal subfields).
Spurious Correlations and Inherited Biases
Training data reflects reality—including its biases, misinformation, and problematic patterns. Models learn statistical associations without distinguishing true patterns from spurious correlations.
Biases are statistical regularities. If training data contains gender stereotypes (“doctor” often appears with “he”, “nurse” with “she”), the model learns these associations. Asked to complete “The doctor entered the room. She ___”, the model may generate text reflecting stereotyped assumptions because that pattern was statistically common in training data. The model doesn’t “believe” stereotypes—it predicts based on learned correlations.
Misinformation becomes encoded. If conspiracy theories or false claims appear frequently in training data, the model learns to predict them as plausible continuations. It can’t distinguish truth from falsehood without external verification. The model learns that “vaccines cause autism” is a phrase that appears in text, even though it’s factually false. Prediction doesn’t require truth—only statistical frequency.
Toxic content and harmful patterns: Internet text includes hate speech, offensive language, and instructions for harmful activities. Models trained on unfiltered web crawls absorb these patterns. A model completing “Group X people are ___” may generate offensive completions because such text existed in training data. This is not malice—it’s compression of statistical patterns, including harmful ones.
Data curation attempts to mitigate these issues but trades coverage for safety:
- Filtering toxic content: Removes explicitly harmful text but reduces dataset size and may introduce new biases (over-filtering innocuous content)
- Deduplication: Removes repeated text (preventing memorization) but may eliminate rare examples needed for coverage
- Source selection: Prioritizing high-quality sources (books, Wikipedia) over low-quality ones (forums, comments) improves average quality but narrows diversity
There’s no perfect solution. Filtering reduces harms but also reduces coverage of reality. Unfiltered data maximizes coverage but includes harmful content. Modern pretraining balances these trade-offs through careful curation—removing the most egregious content while preserving diversity. However, biases persist because they’re woven into human-generated text. Models will reflect their training data’s values, biases, and errors unless explicitly aligned through post-training (Chapters 23–24).
The Mathematics of Data Scaling
Performance improves with data size following empirical scaling laws (detailed in Chapter 25). The relationship is a power law:
Where is test loss, is dataset size (measured in tokens), and is the scaling exponent (typically –). Doubling training data reduces loss by a predictable factor.
In log-log space, this appears as a straight line: . The linearity of this relationship enables forecasting: experiments with small datasets predict performance at large scales. This predictability justifies massive investments in compute and data—performance gains are nearly guaranteed with scale.
The diagram shows the empirical relationship between training data size and loss. In log-log space, the curve is approximately linear (power law). Each successive generation of models trains on more data and achieves lower loss. This predictable scaling justifies continued investment in larger datasets.
However, scaling has diminishing returns: the first 10B tokens provide massive improvement, the next 90B provide moderate improvement, the next 900B provide incremental gains. At some point, the cost of additional data exceeds the value of marginal performance gains. Current models (2023–2024) train on 1–2 trillion tokens, approaching the limit of available high-quality text on the internet.
Engineering Takeaway
Pretraining on massive datasets is the foundation of modern language models. Understanding data scale, diversity, and curation explains model capabilities and limitations.
Pretraining is expensive but amortizes across all tasks
Training GPT-3 cost an estimated $4–5 million in compute. Training cutting-edge models costs tens of millions. But once trained, the model serves millions of users and countless applications. Pretraining cost is amortized over all downstream uses—fine-tuning (Chapter 23), prompting (Part VI), and deployment. This is why foundation models dominate: the fixed cost of pretraining is enormous, but the marginal cost of reusing the model is near zero.
Data quality matters more than quantity at the margin
Early scaling (10B → 100B tokens) benefits from any data. Later scaling (500B → 2T tokens) saturates on low-quality sources. Adding more Reddit comments provides diminishing returns compared to high-quality books or technical documentation. Modern pretraining curates data sources, prioritizing quality and diversity over raw quantity. Filtering, deduplication, and source selection improve model behavior even if total token count decreases.
Data curation trades coverage for safety
Removing toxic content reduces harmful outputs but narrows the model’s understanding of the world. Over-filtering can introduce new biases (e.g., removing all mentions of certain topics makes the model ignorant of them, even for benign queries). Curation is a judgment call: what content is harmful enough to exclude vs. valuable enough to include? There’s no perfect answer. Modern pipelines use automated toxicity classifiers, source reputation, and manual review to balance safety and coverage.
Deduplication prevents memorization and improves generalization
Training data often contains duplicates (identical or near-identical text appearing multiple times). Duplicates cause models to memorize specific sequences rather than learning general patterns. Deduplication (removing repeated text) improves generalization by forcing the model to compress diverse examples rather than memorizing common ones. However, some duplication is benign (common phrases, idioms), and over-aggressive deduplication can remove valid training signal. Practical systems use fuzzy matching (e.g., MinHash, SimHash) to remove near-duplicates while preserving useful repetition.
Compute-optimal training balances model size and data size
The Chinchilla paper (Hoffmann et al., 2022) showed that prior models were undertrained: given a compute budget, it’s better to train smaller models on more data than larger models on less data. GPT-3 (175B parameters, ~500B tokens) was trained on insufficient data relative to its size. Chinchilla (70B parameters, 1.4T tokens) achieved lower loss with the same compute by reducing model size and increasing data. The optimal ratio: for every doubling of model size, double the amount of training data. This insight shifted pretraining strategies toward longer training runs on massive datasets rather than simply scaling model parameters.
Checkpoints capture knowledge at different training stages
Models improve continuously during training as they process more data. Early checkpoints (e.g., after 10% of training) have learned basic patterns but not specialized knowledge. Late checkpoints (after 90%) have learned nuanced patterns but may overfit to training distribution. Saving intermediate checkpoints allows selecting the model that balances generalization and performance. Some applications prefer earlier checkpoints (less overfitting), others prefer late checkpoints (maximum capability).
Why foundation models work: transfer from pretraining to any task
Pretraining creates general-purpose models. The model hasn’t seen tasks explicitly but has seen task-like patterns in diverse data. Question-answering appears in forums, FAQ pages, and textbooks. Translation appears in multilingual documents. Code generation appears in GitHub repositories with comments. The model learns these formats as statistical patterns, enabling zero-shot transfer to formal tasks (Chapter 23). Foundation models work because pretraining data is diverse enough to contain implicit examples of most tasks humans care about.
The lesson: Language model capabilities scale with data. More diverse, high-quality training data produces more capable, knowledgeable models. But data isn’t neutral—it encodes biases, misinformation, and toxic patterns. Pretraining compresses the internet, including both its wisdom and its flaws. Understanding data curation, coverage, and scaling laws is essential for building and deploying modern language models effectively.
References and Further Reading
Language Models are Few-Shot Learners – Tom B. Brown, Benjamin Mann, Nick Ryder, et al. (2020) https://arxiv.org/abs/2005.14165
The GPT-3 paper demonstrated that scaling model size and training data produces qualitative improvements in capability. Brown et al. trained a 175B parameter model on ~500B tokens and showed it could perform tasks zero-shot or few-shot without fine-tuning—answering questions, translating languages, writing code—purely from in-context examples. This established scale as a primary driver of performance: bigger models trained on more data unlock emergent abilities. The paper includes detailed ablations showing how data diversity (web text, books, Wikipedia) affects performance across domains. Reading this explains why pretraining on internet-scale data is the foundation of modern AI and how capabilities emerge from scale alone.
Training Compute-Optimal Large Language Models – Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. (2022) https://arxiv.org/abs/2203.15556
The Chinchilla paper challenged conventional scaling wisdom. Hoffmann et al. showed that GPT-3-scale models were undertrained—given fixed compute, it’s better to train smaller models on more data. Chinchilla (70B parameters, 1.4T tokens) outperformed much larger models (GPT-3, Gopher) by using a compute-optimal training regime. The paper derives scaling laws for optimal model size vs. data size and provides practical guidance: for every doubling of parameters, double training tokens. This reshaped pretraining strategy: modern models train longer on larger datasets rather than simply adding parameters. Understanding compute-optimal scaling clarifies the economics of training large models and why data matters as much as model size.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? – Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell (2021) https://dl.acm.org/doi/10.1145/3442188.3445922
Bender et al. critically examine the costs and risks of large language models. They discuss environmental impact (massive compute requires enormous energy), bias (models inherit training data biases), misinformation (models generate plausible but false text), and misalignment (models optimize prediction, not truth or ethics). The paper argues for responsible scaling: considering societal impacts, transparency in data curation, and alignment research alongside capability improvements. While not a technical paper, it provides essential context on why pretraining choices matter beyond performance metrics. Understanding the dangers and trade-offs of scale is crucial for building AI systems responsibly. Reading this grounds technical decisions in ethical and societal considerations.