Chapter 23: Fine-Tuning

Turning a Brain into a Tool

Why Pretrained Models Are Wild

A freshly pretrained language model is not an assistant. It’s a prediction engine trained to complete any text, regardless of usefulness or appropriateness. Give it “User: What is the capital of France?” and it might continue with:

User: What is the capital of France?
User: How do I cook pasta?
User: Why is the sky blue?
...

The model treats the input as the beginning of a forum thread or FAQ document and generates more questions. Or it might complete:

User: What is the capital of France? I think it's London but I'm not sure.

The model predicts plausible text based on training data patterns—questions often follow questions, uncertainty often follows questions. But this isn’t helpful behavior. The user wants an answer, not more questions or speculation.

This is the distinction between raw prediction and instruction following. Pretrained models predict likely continuations. They don’t distinguish between helpful responses, irrelevant completions, or harmful outputs. They simply model the statistical distribution of text—all text, including bad forum posts, wrong answers, and toxic rants.

Raw models are “wild” in the sense that their outputs are unconstrained. They might:

  • Complete the user’s query with another query
  • Generate factually incorrect information (because wrong answers appear in training data)
  • Produce toxic or harmful content (because such content exists on the internet)
  • Refuse to answer simple questions (if “I don’t know” is a statistically plausible continuation)

This isn’t a bug—it’s the natural behavior of a next-token predictor trained on diverse internet text. The model predicts what comes next, not what would be helpful. Making models useful requires additional training: fine-tuning.

Supervised Fine-Tuning: Teaching Specific Behaviors

Fine-tuning specializes a pretrained model for specific tasks by training on curated examples. Unlike pretraining (self-supervised on raw text), fine-tuning is supervised: each example includes an input and a desired output, explicitly demonstrating the behavior we want.

For instruction following, the training data consists of (prompt, response) pairs:

Input:  "User: What is the capital of France?\nAssistant:"
Output: "The capital of France is Paris."

The model continues to optimize next-token prediction, but now the training data shows helpful responses. By training on thousands or tens of thousands of such examples, the model learns the pattern: when text matches “User: … Assistant:”, generate helpful, relevant answers, not arbitrary completions.

Supervised fine-tuning (SFT) applies gradient descent with a lower learning rate than pretraining. The model’s parameters have already learned language structure from pretraining (Chapter 22)—we don’t want to erase that knowledge. We want to adjust the model to specialize in a particular format (instruction following, dialogue, task completion).

The loss function remains cross-entropy:

LSFT=i=1NlogP(yix,y1,,yi1;θ)\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^{N} \log P(y_i | x, y_1, \ldots, y_{i-1}; \theta)

Where xx is the input (user query), yy is the desired output (assistant response), and θ\theta are the model parameters. The model learns to assign high probability to correct responses given the input format.

The key difference from pretraining: the data is curated. Instead of random internet text, fine-tuning uses high-quality examples written or selected by humans. This shifts the model’s predictions from “what text is statistically likely on the internet?” to “what text is helpful/correct/appropriate for this task?”

Dataset construction is critical. Fine-tuning on 10,000 carefully written examples outperforms fine-tuning on 100,000 noisy examples. Quality trumps quantity because the model is learning specific behavioral patterns, not broad language coverage (which it already learned during pretraining).

Examples of fine-tuning datasets:

  • Question answering: Questions paired with accurate answers
  • Dialogue: Conversational turns with helpful, engaging responses
  • Instruction following: Instructions paired with correct completions
  • Task-specific: Summarization input/output, translation pairs, code explanations

Modern instruction-tuned models (InstructGPT, ChatGPT, Claude) are fine-tuned on diverse instruction datasets covering many tasks. This multi-task fine-tuning improves robustness: the model learns general instruction-following behavior, not just specific tasks.

Supervised Fine-Tuning: Teaching Specific Behaviors diagram

The diagram shows the transformation: pretrained models complete text arbitrarily, fine-tuned models follow instructions helpfully. Fine-tuning doesn’t add knowledge (that comes from pretraining)—it shapes behavior.

Instruction Following: Learning to Obey

The format of fine-tuning data determines learned behavior. To make models follow instructions, the data uses explicit instruction patterns:

User: {instruction or question}
Assistant: {helpful response}

or:

Instruction: {task description}
Output: {correct result}

By training on thousands of examples with this format, the model learns the pattern: text following “User:” or “Instruction:” is a query, text following “Assistant:” or “Output:” is the expected response. The model adjusts its predictions to favor helpful, relevant responses in this context.

This is pattern matching, not understanding. The model doesn’t “understand” it should be helpful—it learns that statistically, after “User: [question]”, text matching “Assistant: [answer]” has high probability in fine-tuning data. The model’s prediction distribution shifts from generic text completion to task-specific completion.

Multi-task instruction fine-tuning trains on diverse tasks simultaneously:

  • Question answering: factual queries with accurate answers
  • Summarization: long text with concise summaries
  • Translation: source language text with target language output
  • Code generation: descriptions with code implementations
  • Creative writing: prompts with stories or essays

Training on diverse tasks teaches the model a general capability: extract intent from the instruction, generate an appropriate response. This is more robust than single-task fine-tuning because the model learns to adapt to different instruction types rather than memorizing specific task formats.

Example progression from raw to fine-tuned model:

Raw pretrained model:

Input:  "Translate to French: Hello, how are you?"
Output: "Translate to Spanish: Hola, ¿cómo estás?"

→ The model completes with another translation instruction (statistically likely in multilingual corpora)

After fine-tuning on translation examples:

Input:  "Translate to French: Hello, how are you?"
Output: "Bonjour, comment allez-vous?"

→ The model recognizes the instruction format and generates the requested translation

The model learns: when the input matches “[Task]: [Input]”, generate [Output for that task], not arbitrary continuations. This behavioral shift comes from training data, not architectural changes. The same Transformer that did raw prediction now does instruction following because the optimization objective was applied to different data.

Catastrophic Forgetting: Why Tuning Must Be Careful

Fine-tuning overwrites the model’s pretrained knowledge if not done carefully. This phenomenon is catastrophic forgetting: when training on new data causes a neural network to “forget” previously learned information.

During pretraining, the model’s parameters encode broad language knowledge—grammar, facts, reasoning patterns. During fine-tuning, gradient updates adjust these parameters to specialize on task data. If the learning rate is too high or training runs too long, the model’s parameters drift far from their pretrained values, erasing general knowledge in favor of task-specific patterns.

Symptoms of catastrophic forgetting:

  • The model becomes excellent at the fine-tuning task but poor at everything else
  • It loses factual knowledge not represented in the fine-tuning data
  • It produces lower-quality outputs on out-of-domain queries

The risk scales with fine-tuning data size relative to pretraining. Fine-tuning on 10,000 examples (tiny compared to trillions of pretraining tokens) is unlikely to cause forgetting if learning rates are low. Fine-tuning on millions of examples with high learning rates can significantly shift the model’s distribution, degrading general capabilities.

Mitigation strategies:

  1. Low learning rate: Use learning rates 10–100× smaller than pretraining (e.g., 1e-5 instead of 1e-3). Small steps adjust behavior without erasing knowledge.

  2. Few epochs: Train for 1-3 passes over fine-tuning data instead of dozens. Minimize total parameter drift.

  3. Early stopping: Monitor validation loss and stop when task performance saturates, before overfitting to fine-tuning data.

  4. Multi-task fine-tuning: Train on diverse tasks simultaneously so the model doesn’t overspecialize on a single distribution.

  5. Regularization: Add penalties that keep parameters close to pretrained values (L2 penalty on parameter changes, KL divergence between fine-tuned and pretrained distributions).

The key insight: fine-tuning is adjustment, not retraining. The heavy lifting (learning language structure from scratch) happened during pretraining. Fine-tuning nudges parameters to specialize without forgetting general capabilities.

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all model parameters—billions of weights. This is computationally expensive and risks catastrophic forgetting. Parameter-efficient fine-tuning (PEFT) updates only a small subset of parameters, freezing most of the pretrained model.

LoRA (Low-Rank Adaptation):

Instead of updating weight matrix WRd×d\mathbf{W} \in \mathbb{R}^{d \times d} directly, LoRA adds a low-rank update:

Wtuned=Wpretrained+BA\mathbf{W}_{\text{tuned}} = \mathbf{W}_{\text{pretrained}} + \mathbf{BA}

Where BRd×r\mathbf{B} \in \mathbb{R}^{d \times r} and ARr×d\mathbf{A} \in \mathbb{R}^{r \times d} with rdr \ll d (e.g., r=8r = 8, d=4096d = 4096). The pretrained weights W\mathbf{W} are frozen; only A\mathbf{A} and B\mathbf{B} are trained. This reduces trainable parameters from billions to millions while maintaining comparable performance to full fine-tuning.

Adapters:

Insert small trainable layers (adapters) between frozen Transformer layers. Adapters have far fewer parameters than full layers (e.g., 100K parameters vs. 100M). During fine-tuning, only adapter parameters are updated. This modularizes specialization: different adapter sets can be swapped for different tasks without retraining the entire model.

PEFT methods reduce fine-tuning cost (less compute, less memory), prevent catastrophic forgetting (most parameters remain unchanged), and enable multi-task deployment (store multiple small adapter sets instead of multiple full models).

Engineering Takeaway

Fine-tuning transforms raw language models into useful tools by teaching them task-specific behaviors. Understanding fine-tuning techniques, trade-offs, and alternatives is essential for deploying models effectively.

Fine-tuning specializes general models for specific domains

Pretrained models are generalists—decent at many tasks, excellent at none. Fine-tuning on domain-specific data (medical records, legal documents, code repositories) makes models experts in that domain. A model fine-tuned on medical literature will predict medical terminology accurately, understand clinical context, and generate domain-appropriate responses. This specialization comes from training on domain data, not architectural changes. For production systems in specialized fields, fine-tuning on proprietary or domain-specific data significantly improves performance over generic pretrained models.

Low learning rates and few epochs prevent catastrophic forgetting

Use learning rates 10–100× smaller than pretraining. Fine-tune for 1-3 epochs (passes over data), not dozens. Monitor validation loss and stop early when task performance plateaus. These strategies adjust behavior without erasing general knowledge. In practice, fine-tuning often uses learning rates around 1e-5 to 1e-6, compared to pretraining rates of 1e-3 to 1e-4. Training for too long or with too high a rate degrades the model’s general capabilities—a fine-tuned model should retain its broad knowledge while gaining task-specific expertise.

Data quality matters more than quantity for fine-tuning

10,000 carefully curated examples outperform 100,000 noisy examples. Fine-tuning data teaches behavior patterns—low-quality examples (wrong answers, poorly formatted, off-topic) teach bad behaviors. Pretrained models already have language coverage; fine-tuning needs precise examples demonstrating desired behavior. Invest in data quality: human-written responses, expert annotations, thorough filtering. For instruction following, quality means clear instructions paired with correct, helpful, appropriately formatted responses. Noisy fine-tuning data degrades model behavior even if the volume is large.

LoRA and adapters enable efficient fine-tuning of massive models

Updating all 175B parameters of GPT-3 requires massive GPU memory and compute. LoRA adds low-rank matrices (millions of parameters) instead of updating billions, reducing memory by 10–100×. Adapters insert small trainable layers, achieving similar savings. These methods make fine-tuning practical on commodity hardware (single or few GPUs instead of clusters). For production deployment, PEFT enables rapid task-specific specialization without the cost of full fine-tuning. Many applications now use LoRA as the default fine-tuning approach, training low-rank updates on task data while keeping base model weights frozen.

Prompt engineering emerges as an alternative to fine-tuning

Not all applications require fine-tuning. For tasks with clear patterns that can be described in prompts, prompt engineering (Part VI) achieves good performance without any training. Compare costs: fine-tuning requires labeled data, compute for training, and careful hyperparameter tuning. Prompting requires only designing effective prompts. The trade-off: fine-tuning produces task-optimized models (higher performance ceiling), prompting is faster and cheaper (lower effort, immediate deployment). For many applications—especially with powerful pretrained models like GPT-4—prompting suffices. Fine-tuning is worthwhile when marginal performance improvements justify the cost or when tasks require learning from proprietary data not seen during pretraining.

Multi-task fine-tuning improves robustness

Training on diverse tasks simultaneously produces more robust models than single-task fine-tuning. A model fine-tuned only on question answering may struggle with summarization. A model fine-tuned on question answering, summarization, translation, and dialogue learns general instruction-following behavior. Multi-task fine-tuning prevents overfitting to a narrow distribution and improves zero-shot performance on unseen tasks. This is the approach used by instruction-tuned models (InstructGPT, FLAN): train on hundreds of tasks to learn adaptable instruction following. For production systems, multi-task fine-tuning is preferable to single-task when the deployment scenario involves diverse user requests.

Why foundation models + fine-tuning is the dominant paradigm

Training large models from scratch is prohibitively expensive for most organizations. Pretraining costs millions of dollars and requires massive datasets. Fine-tuning pretrained models costs orders of magnitude less—thousands of dollars, thousands of examples, days instead of months. The paradigm: large labs (OpenAI, Anthropic, Google, Meta) pretrain foundation models and release them (openly or via API). Users fine-tune on task-specific data to specialize for their applications. This division of labor makes large language models accessible: the fixed cost of pretraining is absorbed by providers, the variable cost of fine-tuning is manageable for practitioners. Understanding how to fine-tune effectively—data curation, hyperparameters, PEFT methods—is now an essential skill for deploying AI systems.

The lesson: Fine-tuning is the bridge between general-purpose language models and task-specific tools. Pretrained models provide broad capabilities; fine-tuning narrows focus to desired behaviors. The process is delicate—adjust parameters enough to learn the task, but not so much that general knowledge is lost. Modern techniques (low learning rates, PEFT, multi-task training) make fine-tuning reliable and efficient. Whether through full fine-tuning, LoRA, or prompting, specializing pretrained models is now the standard approach for building production AI systems.


References and Further Reading

Training language models to follow instructions with human feedback – Long Ouyang, Jeff Wu, Xu Jiang, et al. (2022) https://arxiv.org/abs/2203.02155

The InstructGPT paper introduced the approach behind ChatGPT. Ouyang et al. fine-tuned GPT-3 on human-written instructions and responses, then applied reinforcement learning from human feedback (Chapter 24) to align with human preferences. They showed that fine-tuning on ~13,000 high-quality instruction examples dramatically improved helpfulness, truthfulness, and safety compared to raw GPT-3. The paper demonstrates that fine-tuning data quality matters more than quantity and establishes the instruction-following paradigm now used in all major language model assistants. Reading this clarifies how ChatGPT differs from GPT-3: supervised fine-tuning on instruction data transforms a raw predictor into a helpful assistant.

LoRA: Low-Rank Adaptation of Large Language Models – Edward Hu, Yelong Shen, Phillip Wallis, et al. (2021) https://arxiv.org/abs/2106.09685

Hu et al. introduced LoRA, the most widely used parameter-efficient fine-tuning method. LoRA freezes pretrained weights and adds trainable low-rank matrices, reducing trainable parameters by 10,000× (from billions to millions) while maintaining performance comparable to full fine-tuning. The paper shows LoRA works across tasks (question answering, summarization, translation) and models (GPT, T5, BERT). This breakthrough made fine-tuning massive models practical on accessible hardware—fine-tuning GPT-3-scale models now requires a single GPU instead of a cluster. LoRA is now the default method for task-specific model adaptation in production systems. Understanding LoRA is essential for practitioners deploying large models with limited compute.

Finetuned Language Models Are Zero-Shot Learners – Jason Wei, Maarten Bosma, Vincent Zhao, et al. (2021) https://arxiv.org/abs/2109.01652

The FLAN paper showed that fine-tuning on diverse tasks simultaneously improves zero-shot performance on unseen tasks. Wei et al. fine-tuned models on over 60 NLP tasks with instructions and demonstrated improved generalization to new tasks without further training. Multi-task instruction fine-tuning teaches models adaptable instruction-following behavior rather than memorizing specific task formats. This established the recipe for modern instruction-tuned models: train on hundreds of diverse tasks to learn general helpfulness. The paper explains why InstructGPT, GPT-4, and other assistants perform well on novel tasks—they learned instruction-following as a general capability through diverse fine-tuning. Reading this clarifies how to construct fine-tuning datasets that maximize generalization.