Chapter 21: Next-Token Prediction
The Engine of Language Models
Language as Probability
Every language model—from GPT-3 to Claude to LLaMA—is trained to solve a single task: predict the next word. That’s it. No explicit instruction following, no question answering, no code generation. Just predict what comes next in a sequence of text.
This simple objective is the foundation of all modern large language models. Everything these models can do—answering questions, writing code, translating languages, engaging in conversation—emerges from learning to predict the next token in a sequence. Understanding this objective is essential to understanding how language models work and why they behave the way they do.
The architecture is a Transformer (Chapter 20), but the task is next-token prediction. Given a sequence of tokens , the model predicts the probability distribution over the next token . The model is trained on billions of sequences, learning statistical patterns that enable it to assign high probability to likely continuations and low probability to unlikely ones.
Tokens: The Atoms of Language
Before prediction, text must be broken into tokens—the atomic units of language modeling. A token might be a word, a subword, or even a character, depending on the tokenization scheme.
Tokenization converts raw text into a sequence of integers from a vocabulary. For English, common tokenization schemes like Byte-Pair Encoding (BPE) or WordPiece create vocabularies of 30,000 to 100,000 tokens. Common words get their own token (“hello” → token 4521), while rare words are split into subwords (“unhappiness” → “un” + “happiness” → tokens 837, 9204).
Why subword tokenization? It balances vocabulary size with coverage. A character-level vocabulary (26 letters + punctuation) is tiny but sequences are long. A word-level vocabulary covers common words but struggles with rare terms and produces massive vocabularies. Subword tokenization compresses common patterns while handling rare words through composition.
Example tokenization for “The cat sat on the mat”:
Text: "The" "cat" "sat" "on" "the" "mat"
Tokens: 464 3797 3332 319 262 3298
The model operates on token IDs, not raw text. The vocabulary size (typically V ≈ 50,000) determines the output dimension: for each input sequence, the model predicts a probability distribution over all V possible next tokens.
The Training Objective: Predicting What Comes Next
Given a sequence of tokens , the model computes:
This is a probability distribution over the vocabulary. The model assigns a probability to every possible next token. For “The cat sat on the ___”, the model might output:
P("mat" | context) = 0.24
P("floor" | context) = 0.18
P("chair" | context) = 0.12
P("couch" | context) = 0.08
P("roof" | context) = 0.02
...
P("quantum" | context) = 0.0001
High probability for plausible continuations, low probability for implausible ones. The model’s parameters (billions of weights in the Transformer) determine these probabilities.
Training adjusts to match reality. The model sees text from the internet: “The cat sat on the mat.” It predicts the probability of “mat” given “The cat sat on the”, then updates its weights to increase that probability. Over billions of examples, the model learns the statistical structure of language.
The loss function is cross-entropy:
Where is the sequence length. This measures surprise: how unexpected was the actual next token given the model’s prediction? If the model assigned high probability to the correct token, loss is low. If it assigned low probability (was surprised), loss is high. Training minimizes average surprise over all sequences in the training data.
Autoregressive generation: The same model that predicts one token can generate entire sequences. Start with a prompt (“Once upon a time”), predict the next token (“there”), append it, and repeat. Each token becomes context for predicting the next:
Prompt: "Once upon a time"
Output: "there" (predicted given prompt)
Append: "Once upon a time there"
Output: "was" (predicted given extended sequence)
Append: "Once upon a time there was"
Output: "a" (predicted given full context)
...
This autoregressive process—using model outputs as inputs for the next step—continues until the model emits a stop token or reaches a length limit. The entire sequence is generated by repeatedly applying the same prediction operation.
The diagram shows the core process: input tokens flow through the Transformer, which outputs a probability distribution over all vocabulary tokens. The model assigns high probability to plausible continuations and low probability to implausible ones.
Why Prediction Creates Understanding
How does predicting the next word lead to capabilities like question answering, reasoning, and code generation? The model was never explicitly trained on these tasks—it only saw next-token prediction. Yet it performs them.
The answer: to predict well, the model must learn patterns that capture the structure of language and the world it describes.
Grammar and syntax: To predict likely next words, the model must learn grammatical structure. “The cat sat” → “on” is likely, “The cat sat” → “purple” is unlikely. The model learns subject-verb agreement, tense, word order—not through explicit rules but through statistical patterns in billions of examples.
Facts and knowledge: To predict “Paris is the capital of ___”, the model needs to have learned that “France” is highly probable. This requires encoding factual knowledge about the world. The model doesn’t memorize individual facts—it compresses statistical regularities. If “Paris is the capital of France” appears thousands of times in training data, the association becomes encoded in the model’s parameters.
Reasoning patterns: To complete “If X is greater than Y, and Y is greater than Z, then X is ___”, the model must recognize logical patterns. It learns that “greater” follows transitivity. Not through formal logic, but through exposure to reasoning-like text where such patterns hold.
Task formats: Training data includes text formatted as questions and answers, instructions and responses, code and comments. The model learns these formats as statistical patterns. When prompted with “Question: … Answer:”, it predicts text that looks like an answer because that’s what followed “Question:” in training data.
This is compression-driven understanding. The model can’t memorize all training data (trillions of tokens won’t fit in billions of parameters). It must compress—extract patterns that generalize. The patterns that best compress language are the ones that capture its underlying structure: grammar, facts, reasoning, conventions.
Prediction doesn’t require “understanding” in a philosophical sense. It requires learning statistical patterns that happen to align with the structure of reality. The model assigns high probability to “Paris” after “capital of France” not because it “knows” geography but because that compression minimizes prediction error on training data.
Sampling and Temperature: Controlling Randomness
Given a probability distribution over tokens, how does the model choose which one to output? Several strategies exist, each with trade-offs.
Greedy decoding: Always pick the highest-probability token. Simple and deterministic but leads to repetitive, boring text. The model falls into loops: “I think that I think that I think…” because “I” has highest probability after “that”, and “that” after “think”.
Temperature sampling: Before selecting, adjust probabilities using a temperature parameter :
Where are the model’s raw logits (pre-softmax scores). Temperature controls randomness:
- T = 1: Original distribution (default)
- T → 0: Peaked distribution (approaches greedy decoding; high-probability tokens dominate)
- T > 1: Flattened distribution (increases probability of lower-ranked tokens)
Temperature trades off quality and diversity. Low T produces safe, predictable completions. High T produces creative but potentially nonsensical completions.
Example: “The weather today is ___”
At T = 0.1 (low temperature, peaked distribution):
P("sunny") = 0.85
P("cloudy") = 0.10
P("rainy") = 0.03
P("purple") = 0.0001
→ Likely output: “sunny” (highly likely)
At T = 1.0 (default):
P("sunny") = 0.42
P("cloudy") = 0.28
P("rainy") = 0.18
P("purple") = 0.001
→ Likely output: one of {sunny, cloudy, rainy} (balanced)
At T = 2.0 (high temperature, flat distribution):
P("sunny") = 0.30
P("cloudy") = 0.25
P("rainy") = 0.20
P("purple") = 0.08
→ Possible output: even unlikely tokens like “purple” have non-negligible probability
Top-k sampling: Only consider the k most probable tokens, then sample from the renormalized distribution over those k. This prevents sampling from the long tail of improbable tokens (which can produce nonsense) while allowing diversity among plausible options.
Nucleus (top-p) sampling: Instead of a fixed k, choose the smallest set of tokens whose cumulative probability exceeds p (e.g., p = 0.9). This adapts to the distribution: peaked distributions use fewer tokens, flat distributions use more. Prevents both repetitive text (low diversity) and nonsense (sampling from very low-probability tokens).
These sampling strategies control the trade-off between quality (selecting high-probability, sensible completions) and diversity (avoiding repetitive, predictable text). Production systems tune these parameters based on the application: factual question answering uses low temperature (prioritize accuracy), creative writing uses higher temperature (prioritize variety).
Engineering Takeaway
Next-token prediction is the foundation of modern language models. Understanding this objective explains their capabilities, limitations, and how to control them effectively.
Next-token prediction is all you need
No task-specific training is required. Models trained solely on next-token prediction can answer questions, write code, translate languages, and engage in conversation. These capabilities emerge from learning to predict text—the objective compresses all tasks that can be expressed in language. This is why pretraining (Chapter 22) on raw internet text produces general-purpose models that transfer to countless applications.
Prompts are control mechanisms
The input to a language model is not a command—it’s a prefix that steers the probability distribution over completions. “Translate to French:” changes the distribution to favor French continuations. “Once upon a time” triggers story-like patterns. Prompt engineering (Part VI) is the art of constructing prefixes that induce desired probability distributions. The model doesn’t “obey” prompts—it predicts plausible continuations given the prompt as context.
Temperature trades off quality and diversity
Low temperature (T < 0.5): Peaked distributions favor high-probability tokens. Use for factual tasks where correctness matters (question answering, code generation). Outputs are safe but repetitive.
High temperature (T > 1.0): Flat distributions increase diversity. Use for creative tasks where variety matters (story writing, brainstorming). Outputs are interesting but may be nonsensical.
Tune temperature based on whether you prioritize accuracy or creativity. Most production systems use T ≈ 0.7–1.0 as a reasonable balance.
Top-k and nucleus sampling prevent nonsense
Sampling from the full distribution (including very low-probability tokens) can produce gibberish. Top-k (k ≈ 40–50) and nucleus (p ≈ 0.9–0.95) sampling restrict attention to plausible tokens while maintaining diversity. These strategies prevent catastrophic failures (nonsensical completions) while avoiding repetitive loops (greedy decoding). Use them in production systems to balance quality and diversity.
Greedy decoding is deterministic but repetitive
Always choosing the highest-probability token produces the same output every time (deterministic) but leads to repetitive loops. Use greedy decoding only when reproducibility is critical and repetition is acceptable (e.g., debugging). For most applications, stochastic sampling with temperature and top-p is preferable.
Evaluation: perplexity measures surprise
Perplexity quantifies how surprised the model is by test data:
Lower perplexity means the model assigns higher probability to the actual text—it’s less surprised. Perplexity is the standard metric for evaluating language models during training. A model with perplexity 100 is “as surprised as if it had to choose uniformly among 100 equally likely alternatives” at each step. Better models have lower perplexity.
Why language modeling transfers to all tasks
Any task that can be expressed as text can be framed as next-token prediction. Question answering: predict the answer tokens after “Q: … A:”. Translation: predict French tokens after English text and “French:”. Code generation: predict code tokens after a description. The model learns these patterns from training data where such formats appear. This universality—all language tasks reduce to prediction—is why language modeling is so powerful.
The lesson: Language models are probability engines. They don’t “understand” in any deep sense—they assign probabilities to token sequences based on statistical patterns learned from training data. But those patterns are rich enough to capture grammar, facts, reasoning, and task conventions. Next-token prediction, applied at scale, compresses the structure of language and the world it describes. Controlling these models means constructing prompts and tuning sampling strategies to shape probability distributions toward desired outputs.
References and Further Reading
Improving Language Understanding by Generative Pre-Training – Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (2018) https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
This is the GPT-1 paper that demonstrated language models trained purely on next-token prediction transfer to diverse tasks. Radford et al. showed that unsupervised pretraining on web text, followed by minimal fine-tuning, achieves strong performance on question answering, classification, and inference tasks. This established the paradigm: pretrain on prediction, fine-tune for tasks. The paper introduced the idea that language modeling isn’t just about text generation—it’s a general-purpose learning objective. Understanding GPT-1 clarifies why next-token prediction is sufficient for building capable AI systems.
The Curious Case of Neural Text Degeneration – Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi (2019) https://arxiv.org/abs/1904.09751
Holtzman et al. analyzed why greedy decoding and beam search produce repetitive, low-quality text despite high-probability outputs. They introduced nucleus (top-p) sampling, which balances quality and diversity by dynamically adjusting the sampling pool based on cumulative probability. The paper explains sampling strategies’ impact on generation quality and provides empirical evidence for why stochastic methods outperform deterministic ones. Reading this clarifies how to control language model outputs in production and why temperature and sampling parameters matter.
The Unreasonable Effectiveness of Recurrent Neural Networks – Andrej Karpathy (2015) http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Karpathy’s blog post (pre-Transformers) demonstrated that character-level language models trained on simple next-character prediction learn surprisingly rich patterns: Shakespeare-like text, Linux source code structure, even LaTeX formatting. Though RNNs are outdated, the core insight remains: prediction objectives drive models to internalize the structure of their training data. This accessible post builds intuition for why next-token prediction is powerful—it’s not about the architecture (RNNs vs Transformers), it’s about the objective forcing compression of statistical regularities. Reading this provides intuition for what language models learn and why prediction creates “understanding.”