Part V: Language Models

How do Transformers become language models like GPT? Through a simple objective applied at unprecedented scale: predict the next token. This part explains how next-token prediction, combined with massive datasets and careful training methods, produces systems with surprising capabilities.

Language models are Transformers trained to predict what comes next. Given a sequence of words, predict the next word. This self-supervised task requires no labeled data—just text. The model learns patterns in language by trying to predict billions of examples.

We start with next-token prediction as the core task. Why does this simple objective work? Because predicting the next word requires understanding syntax, semantics, facts, and context. A model that predicts well has learned something about language structure.

Pretraining is where capabilities emerge. Models train on vast text corpora—books, websites, code—learning general patterns in language. This unsupervised learning creates models with broad knowledge that can be specialized for specific tasks.

Fine-tuning adapts pretrained models to specific applications. Take a model trained on general text and continue training on task-specific data. This transfer learning is efficient: general capabilities transfer, only task-specific adjustments are needed.

Reinforcement learning from human feedback aligns models with human preferences. Predicting the next token doesn’t ensure helpful or truthful outputs. RLHF adjusts model behavior based on human judgments of quality, safety, and usefulness.

Emergent abilities appear at scale: chain-of-thought reasoning, in-context learning, instruction following. These weren’t explicitly trained but emerge from next-token prediction on large enough models with enough data. The mechanisms aren’t fully understood, but the pattern is clear.

After this part, you’ll understand what language models are and how they’re trained. But models alone don’t make systems. Part VI shows how to build production applications using these models as components.