Chapter 10: Loss Functions and Optimization

How Models Know They Are Wrong

A machine learning model doesn’t “know” anything. It has parameters—weights and biases—that determine what it predicts. Training is the process of adjusting those parameters to make better predictions. But “better” needs to be defined precisely. This is where the loss function comes in.

The loss function measures how wrong the model is. It takes the model’s predictions and the true labels and computes a single number: the loss. Training is the process of finding parameters that minimize this number. Everything in machine learning—from linear regression to GPT—is fundamentally about minimizing a loss function.

Understanding loss functions is critical because they define what the model optimizes. If you choose the wrong loss function, the model will optimize for the wrong objective, no matter how sophisticated the architecture. The loss function encodes your goals, your assumptions, and your tradeoffs.

What Loss Is

A loss function, also called a cost function or objective function, maps predictions and ground truth to a scalar value that represents error. For a single training example, the loss measures how far the prediction is from the correct answer. For the entire dataset, the total loss is typically the average loss across all examples.

Mean Squared Error (MSE) is the most common loss for regression. It measures the average squared difference between predictions and targets:

L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Where $y_i$ is the true value and $\hat{y}_i$ is the predicted value for the $i$ -th example. Squaring the error has two effects: it makes all errors positive (so over-predictions and under-predictions don’t cancel out), and it penalizes large errors more than small errors. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1.

This quadratic penalty means the model is strongly incentivized to avoid large errors, even at the cost of making more small errors. If your application cares equally about all errors, MSE is appropriate. If outliers should be tolerated, MSE might be too harsh.

Cross-Entropy Loss is the standard loss for classification. For binary classification:

L_{\text{CE}} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

Where $y_i \in \{0, 1\}$ is the true label and $\hat{p}_i$ is the predicted probability of class 1. This loss measures how surprised the model is by the true label. If the model predicts $\hat{p} = 0.9$ and the true label is 1, the loss is $-\log(0.9) \approx 0.105$ —low surprise. If the model predicts $\hat{p} = 0.1$ and the true label is 1, the loss is $-\log(0.1) \approx 2.30$ —high surprise.

The logarithm makes the loss approach infinity as the predicted probability approaches 0 for the true class. This creates a strong gradient signal for confident wrong predictions, ensuring the model learns from mistakes. Cross-entropy is derived from information theory and is the natural loss for probabilistic classification.

Why Optimization Is Needed

Training a model means finding the parameters that minimize the loss. For a linear model, the parameters are the weights $\mathbf{w}$ and bias $b$ . The loss depends on these parameters because predictions depend on them: $\hat{y} = \mathbf{w} \cdot \mathbf{x} + b$ . Changing $\mathbf{w}$ changes predictions, which changes the loss.

The goal is to find the $\mathbf{w}$ and $b$ that minimize:

L(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \text{loss}(f(\mathbf{x}_i; \mathbf{w}, b), y_i)

Where $f(\mathbf{x}_i; \mathbf{w}, b)$ is the model’s prediction for input $\mathbf{x}_i$ given parameters $\mathbf{w}$ and $b$ .

For simple cases like linear regression with MSE, there’s a closed-form solution: you can solve for the optimal $\mathbf{w}$ algebraically using linear algebra (the normal equations). But for most models—logistic regression, neural networks, decision trees—there’s no closed form. The loss function is nonlinear, high-dimensional, and too complex to solve analytically.

This is why optimization is necessary. We can’t solve for the best parameters directly, so we iteratively improve them using optimization algorithms.

Gradient Descent and Variants

Optimization is a search process. We start with random or initialized parameters and iteratively adjust them to reduce the loss. The most important optimization algorithm is gradient descent.

Gradient descent works by computing the gradient of the loss with respect to the parameters—the direction in which the loss increases most steeply—and then taking a step in the opposite direction to decrease the loss.

The gradient $\nabla L(\mathbf{w}, b)$ is a vector of partial derivatives:

\nabla L = \left[ \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots, \frac{\partial L}{\partial w_n}, \frac{\partial L}{\partial b} \right]

Each component tells us how much the loss changes if we increase that parameter by a small amount. If $\frac{\partial L}{\partial w_1} = 3.2$ , increasing $w_1$ by 0.1 increases the loss by approximately 0.32. To decrease the loss, we move in the opposite direction:

\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} L

Where $\eta$ is the learning rate, a small positive number (e.g., 0.01) that controls the step size. This update rule is applied repeatedly until the loss stops decreasing.

Gradient Descent and Variants diagram

The diagram shows gradient descent navigating a loss surface. The algorithm starts at a random point and iteratively moves downhill by following the negative gradient, eventually reaching a minimum. Each step size is determined by the learning rate.

Variants of gradient descent improve efficiency:

Batch Gradient Descent computes the gradient on the entire training set before making an update. This is accurate but slow for large datasets—computing gradients for millions of examples before taking a single step is inefficient.

Stochastic Gradient Descent (SGD) computes the gradient on a single training example and updates parameters immediately. This is much faster—you take one step per example. The gradient is noisy (high variance) because it’s computed from one example, but this noise can be beneficial: it helps escape shallow local minima and flat regions. SGD is the foundation of deep learning optimization.

Mini-batch Gradient Descent is the practical compromise. Compute the gradient on a small batch of examples (e.g., 32, 128, or 256) and update parameters. Mini-batches balance computational efficiency (batches can be parallelized on GPUs) with gradient accuracy (larger batches reduce noise). Most modern training uses mini-batch gradient descent with batch sizes tuned to hardware (GPUs process batches efficiently).

Why Stochastic Updates Work: The noise in stochastic gradients is not just acceptable—it’s helpful. The loss surface has many flat regions and shallow local minima. Noisy gradients add randomness that helps escape these regions. This is why SGD often generalizes better than batch gradient descent: the noise acts as implicit regularization, preventing the model from settling into sharp, overfit minima.

Momentum accelerates gradient descent by accumulating a velocity vector. Instead of updating parameters directly based on the current gradient, momentum maintains an exponential moving average of past gradients:

v \leftarrow \beta v + \nabla L, \quad \mathbf{w} \leftarrow \mathbf{w} - \eta v

Where $\beta$ (typically 0.9) controls how much past gradients influence the current update. Momentum smooths out oscillations and speeds up convergence in consistent directions. If gradients consistently point in one direction, momentum builds up speed. If gradients oscillate, momentum dampens the oscillations.

Adam (Adaptive Moment Estimation) is one of the most popular optimizers. It adapts the learning rate for each parameter based on the first moment (mean) and second moment (uncentered variance) of the gradients. Adam combines momentum with per-parameter adaptive learning rates. Parameters with large gradients get smaller learning rates, and parameters with small gradients get larger learning rates. This makes training more robust to learning rate choice and speeds up convergence. Adam is the default optimizer for most deep learning applications because it works well with minimal tuning.

Learning Rate Scheduling

The learning rate is the most important hyperparameter in optimization. Too large, and training diverges—parameters overshoot the minimum and bounce around chaotically. Too small, and training is painfully slow—it might take thousands of epochs to converge. But the optimal learning rate changes during training: early on, large steps make rapid progress. Near convergence, small steps avoid overshooting.

Learning rate schedules adapt the learning rate during training to balance these needs:

Fixed Learning Rate: The simplest approach—use the same learning rate throughout training (e.g., $\eta = 0.001$ ). This works for simple problems but is suboptimal for complex models. Early training could benefit from larger steps, and late training could benefit from smaller steps.

Step Decay: Reduce the learning rate by a factor every $N$ epochs. For example, start with $\eta = 0.1$ , then reduce to 0.01 after 30 epochs, 0.001 after 60 epochs. Common reduction factors: 0.1× or 0.5×. Step decay is simple and widely used, but the schedule requires manual tuning (when to decay, by how much).

Exponential Decay: Continuously reduce the learning rate:

\eta_t = \eta_0 \cdot e^{-kt}

Where $\eta_0$ is the initial learning rate, $t$ is the epoch number, and $k$ controls the decay rate. Exponential decay is smoother than step decay but still requires tuning $k$ .

Cosine Annealing: Reduce the learning rate following a cosine curve:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{t\pi}{T}))

Where $T$ is the total number of epochs. The learning rate starts at $\eta_{\max}$ , smoothly decreases to $\eta_{\min}$ , and can optionally restart (SGDR: Stochastic Gradient Descent with Warm Restarts). Cosine annealing is popular in deep learning because it provides smooth transitions and the restart mechanism helps escape local minima.

Learning Rate Warmup: Start with a very small learning rate and gradually increase it over the first few epochs before applying the main schedule. Warmup prevents early instability—with randomly initialized parameters, large gradients can cause divergence in the first few steps. By starting small and ramping up, the model stabilizes before full-speed training. Warmup is standard in transformer training (e.g., BERT, GPT).

Why Scheduling Matters: Without scheduling, you’re stuck with a single learning rate that’s a compromise: not too large (avoid divergence), not too small (avoid slow training). With scheduling, you get the best of both: fast initial progress (large learning rate) and fine-tuning near convergence (small learning rate). Modern training pipelines almost always use some form of learning rate scheduling.

Convexity and Local Minima

The loss surface—the function mapping parameters to loss—is rarely a simple bowl. It can have many local minima (points where the loss is lower than neighboring points but not globally minimal), saddle points (flat regions that are minima in some directions and maxima in others), and plateaus (flat regions with near-zero gradients).

For classical models like linear regression with MSE and logistic regression with cross-entropy, the loss surface is convex: there’s a single global minimum, and gradient descent is guaranteed to find it (with a small enough learning rate). This is why these models train reliably and consistently. Convexity means no local minima to get stuck in—any minimum you find is the global minimum.

For neural networks, the loss surface is highly non-convex with many local minima and saddle points. In high-dimensional spaces (millions of parameters), saddle points are far more common than local minima. A saddle point is a point where the gradient is zero but it’s a minimum in some directions and a maximum in others—like a mountain pass.

Despite non-convexity, gradient descent works well on neural networks because:

Most local minima are nearly as good as the global minimum: In high dimensions, local minima tend to have similar loss values. Getting stuck in a “bad” local minimum is rare.
Stochastic gradient descent’s noise helps escape poor minima: The noise in mini-batch gradients acts as a perturbation that can push the optimization out of shallow local minima.
Overparameterized networks have benign loss landscapes: Modern neural networks often have more parameters than training examples. This overparameterization creates many equivalent good solutions—a high-dimensional space of low-loss parameters. The loss surface is more like a valley or plateau than a collection of isolated peaks.

Training stability depends on the learning rate. Too large, and the updates overshoot the minimum, causing the loss to oscillate or diverge. Too small, and training is extremely slow. Adaptive optimizers like Adam adjust the learning rate automatically, which makes training more robust to poor initial choices.

Regularization and Loss

The total loss function often includes not just the data term (how well predictions match labels) but also a regularization term that penalizes model complexity:

L_{\text{total}} = L_{\text{data}} + \lambda L_{\text{regularization}}

Where $\lambda$ controls the tradeoff between fitting the data and keeping the model simple.

L2 Regularization (Ridge) adds a penalty proportional to the squared magnitude of the weights:

L_{\text{regularization}} = \sum_{i} w_i^2

This encourages small weights. Large weights mean the model is sensitive to specific features, which can lead to overfitting. By penalizing large weights, L2 regularization forces the model to distribute importance more evenly across features, producing smoother, more generalizable solutions. L2 is also called weight decay because it causes weights to shrink toward zero during training.

L1 Regularization (Lasso) adds a penalty proportional to the absolute value of the weights:

L_{\text{regularization}} = \sum_{i} |w_i|

This encourages sparse models where many weights are exactly zero. L1 effectively performs feature selection during training—unimportant features get zero weight and are ignored. This is useful when you have many features but suspect only a few are relevant. L1 produces interpretable models because the non-zero weights reveal which features matter.

Elastic Net combines L1 and L2:

L_{\text{regularization}} = \alpha \sum_{i} |w_i| + (1-\alpha) \sum_{i} w_i^2

This balances sparsity (L1) and stability (L2). Elastic net is useful when features are correlated—L1 alone might arbitrarily pick one feature from a correlated group, while L2 alone doesn’t produce sparsity.

Connection to Compression: Regularization enforces the compression principle from Chapter 3. By penalizing model complexity, regularization forces the model to find simpler explanations of the data—explanations that compress the patterns without memorizing specifics. A regularized model has fewer degrees of freedom and is constrained to learn only the most important patterns, which generalize better.

In practice, regularization is not optional. Almost all production models use L2 regularization (weight decay) to prevent overfitting. The regularization strength $\lambda$ is a hyperparameter tuned on validation data: too large, and the model underfits (too simple); too small, and the model overfits (too flexible).

Engineering Takeaway

The loss function is the most important choice in building a machine learning system because it defines what the model optimizes. If you train a model to minimize squared error, it will minimize squared error—even if that doesn’t align with your actual business objective.

Align loss with business goals. If your goal is to maximize user engagement, but you train on click-through rate, the model will optimize clicks, not engagement. If your goal is to minimize customer churn, but you train on classification accuracy, the model will optimize accuracy, not churn cost. Custom losses encode domain-specific tradeoffs. For fraud detection, false negatives (missed fraud) might cost $1000 while false positives (flagging legitimate transactions) cost $10. The loss should reflect this asymmetry. For ranking systems, the loss should penalize errors at the top of the list more than errors at the bottom. Always ask: does the loss function capture what I actually care about?

Loss curves are debugging tools. The loss curve—plotting training and validation loss over epochs—is the primary diagnostic for debugging training. If training loss doesn’t decrease, the learning rate might be too high, the model might be too simple, or the data might be too noisy. If training loss decreases but validation loss increases, the model is overfitting. If both losses plateau early, the model might be underfitting (increase capacity). If training is unstable (loss jumps around), reduce the learning rate or increase batch size. Understanding loss curves is essential to debugging ML systems.

Learning rate is critical and often scheduled. The learning rate determines convergence speed and final performance. Too high, and training diverges. Too small, and training is slow. Start with a moderate learning rate (0.001 for Adam, 0.01-0.1 for SGD) and use scheduling (step decay, cosine annealing) to reduce it during training. Learning rate warmup prevents early instability in complex models. Tuning the learning rate schedule is often more important than choosing the optimizer—a well-tuned SGD with momentum can outperform poorly tuned Adam.

Optimization is a computational bottleneck. Gradient computation dominates training time. For large models, computing gradients for millions of parameters on large batches is expensive. Mini-batches balance gradient accuracy (larger batches → more accurate gradients) with throughput (batches must fit in GPU memory). Batch size is constrained by hardware—GPUs have limited memory. Modern training pipelines use gradient accumulation (accumulate gradients over multiple small batches before updating) to simulate large batches on limited hardware.

Adaptive optimizers speed up training. Adam and RMSprop adjust learning rates per parameter based on gradient statistics. Parameters with large, noisy gradients get smaller learning rates. Parameters with small, consistent gradients get larger learning rates. This makes training more robust to learning rate choice and accelerates convergence. Adam is the default for most deep learning because it works well out-of-the-box with minimal tuning. But for some tasks (especially large-scale vision models), well-tuned SGD with momentum can generalize better than Adam.

Regularization via loss modification. L1 and L2 regularization directly modify the loss function by adding penalty terms. Dropout (randomly disabling neurons) and early stopping (stopping when validation loss stops improving) also act as regularization but through different mechanisms. All regularization methods constrain the model’s effective capacity, forcing it to learn simpler, more generalizable patterns. Regularization is not optional—almost all production models use weight decay (L2) to prevent overfitting.

Connection to neural networks. All deep learning is gradient descent on loss functions. The principles in this chapter—loss design, gradient descent, learning rate schedules, regularization—apply directly to neural networks. The difference is scale: neural networks have millions of parameters and compute gradients via backpropagation (covered in Part III). But the core idea is identical: define a loss that captures your goal, compute gradients, update parameters to minimize the loss. Master optimization on linear models, and you understand optimization for deep learning.

The lesson: Training is optimization. The model doesn’t “learn”—it searches for parameters that minimize a function you define. If you define the wrong function, the model will optimize for the wrong thing. Choose the loss carefully, tune the learning rate, monitor the loss curves, and regularize to prevent overfitting. These principles are universal across all machine learning.

References and Further Reading

Convex Optimization – Stephen Boyd and Lieven Vandenberghe https://web.stanford.edu/~boyd/cvxbook/

This is the definitive textbook on convex optimization. While much of machine learning involves non-convex problems (especially deep learning), understanding convex optimization is essential for understanding why classical models like linear regression and logistic regression train reliably. Chapters 9-10 cover gradient descent and Newton’s method. It’s mathematical but accessible to engineers with linear algebra background. This book reveals why some models (linear regression, logistic regression) always converge while others (neural networks) require careful tuning.

An Overview of Gradient Descent Optimization Algorithms – Sebastian Ruder (2016) https://arxiv.org/abs/1609.04747

This is a clear, comprehensive survey of gradient descent variants: SGD, momentum, Nesterov, Adam, RMSprop, and others. Ruder explains why each variant exists, what problems it solves, and how to choose between them. Reading this will give you intuition for why Adam is the default optimizer for neural networks and when simpler methods like SGD with momentum are better. This paper is essential for understanding the zoo of optimizers and knowing which to use when.

Adam: A Method for Stochastic Optimization – Diederik Kingma and Jimmy Ba (2014) https://arxiv.org/abs/1412.6980

Adam is the most widely used optimizer in deep learning. This paper introduces the algorithm and explains how it adapts learning rates for each parameter based on the first and second moments of gradients (mean and variance). Understanding Adam is essential for training modern neural networks and knowing when to use learning rate schedules or switch to other optimizers. Adam’s success demonstrates that adaptive per-parameter learning rates are critical for training deep networks efficiently.