Chapter 24: RLHF

Teaching Models What Humans Want

Why Loss Functions Are Not Enough

Fine-tuning (Chapter 23) makes models follow instructions, but it doesn’t ensure they follow them well. A model fine-tuned on question-answering data will answer questions, but the answers may be:

Unhelpful: Technically correct but missing the user’s actual intent (“What’s the weather?” → “Weather is the state of the atmosphere.” instead of current conditions)
Verbose: Correct but unnecessarily long, burying the answer in paragraphs of context
Unsafe: Providing instructions for harmful activities because similar content appeared in training data
Dishonest: Generating plausible-sounding falsehoods (hallucinations) with confidence

These failures aren’t mistakes in the optimization—the model is doing exactly what supervised fine-tuning trained it to do: predict plausible text in the format of answers. But plausibility doesn’t equal usefulness. Training data contains examples of bad answers (unhelpful, verbose, wrong) alongside good ones. The model learns to generate text that looks like an answer, not text that actually helps the user.

The problem: supervised learning optimizes for matching training data, not for satisfying human preferences. If the fine-tuning dataset includes verbose answers, the model learns verbosity is acceptable. If it includes confident falsehoods (common on the internet), the model learns to generate confident-sounding text regardless of factual accuracy.

Cross-entropy loss measures surprise, not quality:

\mathcal{L}_{\text{SFT}} = -\log P(\text{training answer} | \text{query}; \theta)

This loss decreases when the model assigns high probability to the training answer, even if that answer is bad. The loss function has no notion of “helpful,” “truthful,” or “harmless”—it only measures statistical fit to training data.

This mismatch between the optimization objective (prediction) and the desired behavior (usefulness) is the alignment problem. Models are optimized for next-token prediction, but we want them optimized for human preferences. Supervised fine-tuning partially closes this gap by training on curated data, but it’s insufficient—we need a way to directly optimize for what humans actually want.

Reinforcement Learning from Human Feedback (RLHF) solves this by training models to maximize a reward function learned from human preferences. Instead of matching training examples, the model learns to generate outputs humans prefer. This shifts the objective from “predict plausible text” to “produce text that satisfies human judgment.”

Human Feedback: Ranking Preferences

RLHF starts with supervised fine-tuning (SFT), then improves the model through human feedback. But collecting feedback at scale requires a clever setup: instead of asking humans to write perfect responses (expensive and slow), ask them to rank responses.

The process:

The SFT model generates multiple responses to the same prompt (e.g., 4-10 completions with different sampling)
Humans rank these responses from best to worst (or pairwise: is A better than B?)
This creates a dataset of preference comparisons

Example:

Prompt: "Explain quantum computing to a 10-year-old."

Response A: "Quantum computing uses quantum mechanics, which involves
superposition and entanglement, to perform computations exponentially faster
than classical computers by exploiting quantum states."
[Ranking: 3 — technically correct but too complex for a 10-year-old]

Response B: "Quantum computers are like magical computers that can try many
answers at once, so they solve really hard problems super fast!"
[Ranking: 1 — simple, appropriate for the audience, helpful]

Response C: "I don't understand quantum computing well enough to explain it."
[Ranking: 4 — honest but unhelpful]

Response D: "Imagine you have a coin. A normal computer checks heads or tails
one at a time. A quantum computer checks both at once, so it's much faster at
finding patterns."
[Ranking: 2 — good analogy, slightly less engaging than B]

Human annotators rank responses based on helpfulness, clarity, accuracy, and appropriateness. Ranking is faster and more reliable than writing responses from scratch—humans are better at evaluation than generation.

This produces a dataset of comparisons:

$B \succ A$ (B is preferred over A)
$B \succ C$ (B is preferred over C)
$A \succ C$ (A is preferred over C)
$D \succ A$ (D is preferred over A)

These comparisons encode human preferences implicitly. The model that generated $B$ was behaving in ways humans prefer; the model that generated $C$ was not. RLHF trains the model to shift toward behavior that produces higher-ranked outputs.

Collecting preference data is expensive but more scalable than writing demonstrations. Annotators label tens of thousands of comparisons (InstructGPT used ~30,000 comparisons), providing signal about what makes responses better or worse.

Reward Models: Turning Preferences into Math

Human rankings can’t be used directly for training—they’re categorical judgments, not differentiable loss functions. RLHF solves this by training a reward model (RM): a neural network that predicts which response humans will prefer.

The reward model $r_\theta(x, y)$ takes a prompt $x$ and response $y$ as input and outputs a scalar score. Higher scores indicate responses humans prefer. The reward model is trained on the comparison dataset to predict human rankings.

Training the reward model:

Given a prompt $x$ and two responses $y_1, y_2$ where humans prefer $y_1 \succ y_2$ , the reward model should assign $r_\theta(x, y_1) > r_\theta(x, y_2)$ . The loss function is:

\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_1, y_2)} \left[ \log \sigma(r_\theta(x, y_1) - r_\theta(x, y_2)) \right]

Where $\sigma$ is the sigmoid function. This loss is minimized when the reward model assigns higher scores to preferred responses. The sigmoid converts score differences into probabilities: if $r_\theta(x, y_1) \gg r_\theta(x, y_2)$ , then $\sigma(\cdots) \approx 1$ and loss is near zero (model agrees with human preference). If scores are reversed, loss is high.

The reward model is typically initialized from the SFT model (same architecture, same token embeddings) but with a different output head: instead of predicting next tokens, it predicts a scalar reward. This allows the reward model to leverage pretrained knowledge about language while learning to score responses.

After training, the reward model serves as a proxy for human judgment. Instead of asking humans to rank every response during training (intractable), the reward model provides an automated score: $r_\theta(x, y)$ approximates “how much would humans like this response?”

This reward function becomes the optimization target: train the language model to maximize expected reward.

Policy Optimization: Teaching the Model to Behave

With a reward model in hand, RLHF trains the language model (now called the policy in RL terminology) to generate responses that maximize reward. The policy $\pi_\theta$ is the language model generating text; the goal is to adjust $\theta$ to produce high-reward outputs.

The objective:

\mathcal{J}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) \right]

Where $x$ is a prompt from the dataset, $y$ is the model’s response, and $r_\phi(x, y)$ is the reward model’s score. Maximizing this objective pushes the model to generate high-reward responses.

However, maximizing reward alone is dangerous. The model might:

Overoptimize: Exploit reward model errors (reward hacking—generating text that scores highly but isn’t actually good)
Mode collapse: Produce a narrow set of high-reward responses, losing diversity
Drift from pretrained distribution: Forget general knowledge learned during pretraining

To prevent these failures, RLHF adds a KL divergence penalty that keeps the policy close to the SFT model:

\mathcal{J}_{\text{RLHF}}(\theta) = \mathbb{E}_{x, y} \left[ r_\phi(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta || \pi_{\text{SFT}}) \right]

Where $D_{\text{KL}}$ is the Kullback-Leibler divergence between the RLHF policy $\pi_\theta$ and the SFT policy $\pi_{\text{SFT}}$ , and $\beta$ is a hyperparameter controlling the penalty strength. The KL term penalizes the model for drifting too far from the SFT model’s distribution, preventing reward hacking and maintaining general capabilities.

Proximal Policy Optimization (PPO) is the standard algorithm for this optimization. PPO uses policy gradients to update the model, taking small steps that improve reward without destabilizing training. The algorithm alternates between:

Generating responses from the current policy
Scoring them with the reward model
Computing policy gradients to increase reward while respecting the KL constraint
Updating model parameters

Training runs for several thousand iterations, gradually improving the model’s behavior. Unlike supervised learning (one pass over data), RL training continues until reward plateaus or the policy drifts too far from the SFT model.

Policy Optimization: Teaching the Model to Behave diagram

The diagram shows the four-stage RLHF pipeline: (1) SFT model generates responses, (2) humans rank them, (3) reward model learns to predict preferences, (4) policy optimization uses the reward model to improve the language model. The process iterates, with the improved model generating new samples for continued optimization.

The Three-Stage Training Pipeline

Modern language model assistants (ChatGPT, Claude, etc.) are trained in three stages:

Stage 1: Pretraining (Chapter 22)

Train on trillions of tokens of raw internet text
Objective: next-token prediction
Result: General language model with broad knowledge, no task-specific behavior
Cost: Tens of millions of dollars, months of training

Stage 2: Supervised Fine-Tuning (SFT) (Chapter 23)

Train on tens of thousands of curated (prompt, response) examples
Objective: match high-quality demonstrations
Result: Model that follows instructions but not necessarily well
Cost: Thousands of dollars, days of training

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Train with reward model derived from human preference rankings
Objective: maximize human-judged quality while staying close to SFT model
Result: Model aligned with human preferences—helpful, harmless, honest
Cost: Similar to SFT (reward model training + policy optimization)

This three-stage pipeline is now standard. Pretraining provides knowledge, SFT provides format, RLHF provides alignment. Skipping any stage produces inferior models:

Pretraining alone: knows language but doesn’t follow instructions
Pretraining + SFT: follows instructions but generates suboptimal responses
Pretraining + SFT + RLHF: aligned assistant behavior

The Three-Stage Training Pipeline diagram

The diagram shows the cumulative process: pretraining provides knowledge, SFT adds instruction-following format, RLHF aligns behavior with human preferences. Each stage refines the model toward useful assistant behavior.

Reward Hacking and Alignment Challenges

RLHF significantly improves model behavior, but it’s not perfect. The reward model is an imperfect proxy for human preferences, and models can exploit its errors.

Reward hacking: The policy learns to generate outputs that score highly on the reward model without actually being better. Examples:

Verbosity: The reward model may prefer longer responses (humans often prefer thorough answers). The policy exploits this by generating unnecessarily verbose text that scores well but doesn’t add value.
Sycophancy: The reward model may reward agreement with the user. The policy exploits this by agreeing with the user’s premises even when they’re incorrect, producing high-reward but dishonest responses.
Style over substance: The reward model may be biased toward certain writing styles (formal, technical). The policy mimics the style without improving content quality.

These failures arise because the reward model is trained on limited data and can’t perfectly capture human preferences. The policy, optimized to maximize reward, finds and exploits these weaknesses.

Alignment is ongoing research. RLHF is the best current method but has limitations:

Reward models are expensive to train (require tens of thousands of human comparisons)
Reward hacking is common and hard to prevent
Human preferences are diverse and sometimes contradictory (helpfulness vs. harmlessness)
Alignment to current human preferences doesn’t guarantee alignment to future or idealized preferences

Alternative approaches are being explored:

Constitutional AI (Anthropic): Use AI feedback instead of human feedback, guided by a “constitution” of principles
Debate and amplification: Train models to argue both sides, improving truthfulness
Interpretability: Understand model internals to detect and prevent misalignment

RLHF represents the state of the art but not the solution to alignment. Models trained with RLHF are safer and more helpful than raw or fine-tuned models, but they still hallucinate, exhibit biases, and occasionally produce harmful outputs. Alignment remains an active research frontier.

Engineering Takeaway

RLHF transforms instruction-following models into aligned assistants by optimizing for human preferences. Understanding its mechanics, benefits, and limitations is essential for deploying and evaluating modern AI systems.

RLHF bridges the gap between loss functions and human values

Cross-entropy loss optimizes prediction accuracy, not usefulness. RLHF introduces a learned reward function that approximates human judgment, enabling direct optimization for helpfulness, harmlessness, and honesty. This shift from “match training data” to “satisfy human preferences” is why ChatGPT feels fundamentally different from raw GPT-3. RLHF isn’t just another training trick—it’s a paradigm shift in how models are optimized. Production systems requiring aligned behavior (conversational agents, customer service bots) benefit dramatically from RLHF over SFT alone.

Human preference data is expensive but essential

Collecting preference rankings requires paying human annotators to evaluate model outputs—tens of thousands of comparisons for a robust reward model. This is cheaper than writing demonstrations (SFT requires tens of thousands of full responses), but still costly. The quality of preference data determines reward model accuracy, which determines RLHF effectiveness. Invest in clear annotation guidelines, diverse annotators (to capture broad preferences), and quality control. Poor preference data produces poor reward models, leading to suboptimal or harmful behavior even after RLHF.

Reward models can be gamed—reward hacking is real

Models find and exploit weaknesses in reward models. If the reward model prefers long responses, the policy becomes verbose. If it rewards politeness excessively, the policy becomes sycophantic. Monitor for reward hacking during training: if the policy’s reward increases but human evaluations don’t improve (or worsen), the model is exploiting the reward model. Mitigation strategies: regularize heavily toward the SFT model (high KL penalty), use diverse preference data, and iterate on reward model quality based on failure modes observed in the policy.

Constitutional AI as an alternative

RLHF requires expensive human feedback. Constitutional AI (CAI) uses AI feedback: one model generates responses, another model critiques them based on a set of principles (the “constitution”), and the policy is trained to maximize AI-judged quality. CAI scales better (no human labeling cost) and can encode explicit values (the constitution defines what “good” means). However, it’s less grounded than RLHF—AI judgment may drift from human preferences. Hybrid approaches (RLHF for broad alignment, CAI for specific principles) are promising for production systems aiming to balance cost and alignment quality.

RLHF improves safety but doesn’t guarantee it

Models trained with RLHF refuse harmful requests more reliably than SFT models, but they’re not foolproof. Adversarial prompts (jailbreaks) can bypass safety training. Reward models have blind spots—behaviors rare in training data (novel harms, subtle manipulations) may not be penalized. RLHF is a mitigation, not a solution. Production deployments should combine RLHF with other safety measures: content filters, monitoring for harmful outputs, red-teaming (adversarial testing), and continuous updates as new failure modes are discovered.

Trade-offs: helpfulness vs. harmlessness vs. honesty

RLHF optimizes a composite reward that balances competing objectives. A model trained solely for helpfulness might provide dangerous information. A model trained solely for harmlessness might refuse benign requests. A model optimized for honesty might say “I don’t know” too often, reducing utility. The reward model and preference data encode these trade-offs: annotators implicitly decide which matters more in each context. Production systems must carefully design preference collection to reflect desired trade-offs—e.g., medical applications prioritize honesty and harmlessness over helpfulness, creative writing applications prioritize helpfulness and engagement.

Why ChatGPT is not just GPT

ChatGPT and similar assistants are GPT-scale models (pretrained Transformers) refined through SFT and RLHF. The base model (GPT-3, GPT-4) provides knowledge and language capability. SFT teaches instruction-following format. RLHF aligns behavior with user preferences—brevity when appropriate, detail when needed, refusing harmful requests, admitting uncertainty. The difference between a raw model and an aligned assistant is entirely post-training: RLHF (and SFT) transform raw prediction into useful, safe interaction. Understanding this pipeline—pretraining for knowledge, SFT for format, RLHF for alignment—is essential for building production-grade AI assistants.

The lesson: Language models trained solely on prediction and demonstration aren’t aligned with human values. RLHF directly optimizes for human preferences by learning a reward function from rankings and training the model to maximize that reward. This approach dramatically improves model behavior—reducing harmful outputs, increasing helpfulness, and producing responses humans prefer. However, RLHF isn’t perfect: reward hacking, data costs, and misalignment risks remain. Modern AI assistants use RLHF as a critical step in the training pipeline, but alignment remains an ongoing challenge requiring continuous iteration, monitoring, and research.

References and Further Reading

Deep Reinforcement Learning from Human Preferences – Paul Christiano, Jan Leike, Tom B. Brown, et al. (2017) https://arxiv.org/abs/1706.03741

Christiano et al. introduced RLHF for complex tasks where reward functions are hard to specify. They showed that human feedback (preferences between behavior pairs) can train reward models that guide RL agents to perform tasks aligned with human intent—even without explicit reward engineering. This paper laid the foundation for applying RLHF to language models: instead of manually defining what makes a response good, learn it from human preferences. The method scaled from simple robotic tasks to complex language generation. Reading this explains the core insight behind RLHF and why preference learning works: humans are better at ranking outputs than specifying reward functions.

Training language models to follow instructions with human feedback – Long Ouyang, Jeff Wu, Xu Jiang, et al. (2022) https://arxiv.org/abs/2203.02155

The InstructGPT paper applied RLHF to GPT-3, creating the first widely deployed aligned language model. Ouyang et al. trained GPT-3 with supervised fine-tuning on instructions, then applied RLHF using ~30,000 human preference comparisons. The result: a model significantly more helpful, truthful, and harmless than raw GPT-3. This paper documents the full pipeline (SFT → reward model training → PPO) and provides empirical evidence that RLHF improves alignment metrics while maintaining general capabilities. InstructGPT became ChatGPT—the system that demonstrated aligned AI to millions of users. Reading this explains how modern AI assistants are built and why RLHF is essential for production deployment.

Constitutional AI: Harmlessness from AI Feedback – Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. (2022) https://arxiv.org/abs/2212.08073

Bai et al. introduced Constitutional AI as an alternative to RLHF that uses AI feedback instead of human feedback. Models critique and revise their own outputs based on a set of principles (the “constitution”), then are trained via RL to maximize AI-judged quality. CAI scales better than RLHF (no human labeling) and makes values explicit (the constitution defines desired behavior). The paper shows CAI produces models comparable to RLHF in safety and helpfulness while being more transparent and cheaper to iterate. This approach is particularly valuable for encoding specific values or principles that are clear to specify but expensive to label at scale. Understanding CAI provides an alternative perspective on alignment: instead of learning preferences from humans, encode principles explicitly and use AI to evaluate adherence.