Chapter 14: Optimization in Deep Learning

Why Naive Training Fails

Training deep neural networks is harder than training shallow ones. Even with backpropagation providing exact gradients, optimization can fail in ways that don’t occur with classical models. The two primary pathologies are vanishing gradients and exploding gradients—both consequences of repeatedly multiplying gradients through many layers.

Vanishing Gradients:

When backpropagating through many layers, gradients are multiplied by weight matrices and activation function derivatives at each layer. If these multipliers are less than 1, gradients shrink exponentially with depth. By the time the gradient reaches early layers, it’s vanishingly small—close to zero.

When gradients vanish, early layers stop learning. Their weights barely change because the gradient signal is too weak. The network becomes effectively shallow: only the last few layers learn, and the early layers remain random. The network fails to learn deep hierarchical representations.

This was a major problem historically with sigmoid and tanh activations, whose derivatives are at most 0.25 and 1, respectively. Multiplying by these repeatedly causes exponential decay. A gradient passing through 10 sigmoid layers might be multiplied by $0.25^{10} \approx 10^{-6}$ —effectively zero.

Exploding Gradients:

The opposite problem: if weights are large or activation derivatives are greater than 1, gradients grow exponentially. By the time gradients reach early layers, they’re enormous—so large that weight updates drastically overshoot and destabilize training.

When gradients explode, the loss diverges rather than converging. Weights become NaN (not a number) after a few updates. Training collapses. This is especially problematic in recurrent networks, where gradients are backpropagated through many time steps.

Both problems stem from the same root cause: deep networks involve composing many transformations, and the chain rule multiplies gradients across all of them. Without careful design, this multiplication causes numerical instability.

Learning Rates: Stability vs Speed

The learning rate $\eta$ is the single most important hyperparameter in training. It controls how much to change weights per gradient update:

w \leftarrow w - \eta \frac{\partial L}{\partial w}

Too large: Updates overshoot the minimum, causing the loss to diverge. Training is unstable. Weights oscillate wildly or blow up to infinity.

Too small: Updates barely move, and training is extremely slow. The network takes millions of iterations to converge. You might run out of patience or compute before reaching a good solution.

The optimal learning rate balances speed and stability: large enough to make progress quickly, small enough to not overshoot. But this optimal rate changes during training. Early on, when the network is far from a good solution, large steps are safe. Later, when approaching a minimum, smaller steps are needed for fine-tuning.

Learning Rate Schedules:

Instead of using a fixed learning rate, schedules reduce it over time:

Step decay: Reduce $\eta$ by a factor (e.g., 0.1) at fixed intervals (e.g., every 30 epochs).
Exponential decay: Reduce $\eta$ by a constant factor each epoch: $\eta_t = \eta_0 e^{-kt}$
Cosine annealing: Vary $\eta$ following a cosine curve, sometimes with restarts.

These schedules allow the network to make large steps initially (fast progress) and small steps later (precise convergence).

Warmup:

Starting with a very small learning rate for the first few epochs, then increasing to the target rate. Warmup prevents instability early in training when weights are far from optimal and gradients might be noisy or large.

Finding the right learning rate is often trial and error. A common heuristic: start with a small rate (e.g., $10^{-3}$ ), increase by 10x until the loss diverges, then back off to the largest stable rate.

Modern Optimizers: Momentum and Adam

Vanilla gradient descent updates weights directly using gradients. Modern optimizers enhance this with mechanisms that smooth out noise, adapt learning rates per parameter, and accelerate convergence.

Momentum:

Momentum accumulates a velocity vector that smooths gradient updates:

v_t = \beta v_{t-1} + \nabla L

w \leftarrow w - \eta v_t

Where $\beta$ (typically 0.9) controls how much past gradients influence current updates. Momentum reduces oscillations in directions with high variance and accelerates in consistent directions—like a ball rolling downhill, gaining speed in consistent directions.

Momentum helps when gradients are noisy (mini-batch training) or when the loss surface has valleys. It smooths out updates and often converges faster than vanilla SGD.

Adam (Adaptive Moment Estimation):

Adam is the most widely used optimizer. It combines momentum with adaptive per-parameter learning rates:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L

v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L)^2

w \leftarrow w - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}

Where:

$m_t$ is the first moment (mean of gradients, like momentum)
$v_t$ is the second moment (variance of gradients)
$\beta_1 \approx 0.9$ and $\beta_2 \approx 0.999$ are decay rates
$\epsilon$ (e.g., $10^{-8}$ ) prevents division by zero

Adam adapts the learning rate for each parameter based on the history of gradients. Parameters with large, consistent gradients get smaller effective learning rates (to prevent overshooting). Parameters with small or noisy gradients get larger effective learning rates (to make progress).

This adaptation makes Adam robust to learning rate choices. A single learning rate (e.g., $10^{-3}$ ) often works well across different problems, whereas vanilla SGD requires careful tuning. This robustness is why Adam is the default optimizer in most deep learning work.

Bias Correction in Adam:

The formulas above are simplified. The full Adam algorithm includes bias correction. Early in training, $m_t$ and $v_t$ are initialized to zero. This means the exponential moving averages are biased toward zero—especially in the first few steps when most of the weight is still on the initial zero values.

Bias correction compensates for this initialization bias:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Where $\beta_1^t$ means $\beta_1$ raised to the power $t$ (the current timestep). These correction terms approach 1 as $t$ increases, so after many steps, the correction has no effect. But in the first few steps, they scale up the biased estimates to be unbiased.

Without bias correction, learning is too slow in the first few iterations—the effective learning rate is much smaller than intended because $m_t$ and $v_t$ are artificially small. With correction, Adam learns quickly from the start.

When Not to Use Adam:

Adam is robust and widely applicable, but it’s not always the best choice:

For problems with sparse gradients (e.g., embedding layers with sparse inputs), Adam’s second moment estimate can be inaccurate. AdaGrad or sparse-aware optimizers work better.
When using L1 regularization (sparsity-inducing penalties), Adam’s momentum can interfere with the sparse solution. Use sign-based methods instead.
For some vision models (CNNs on ImageNet), well-tuned SGD with momentum often achieves slightly better final accuracy than Adam, though Adam converges faster early in training. The difference is small and task-dependent.

AdamW (Adam with Decoupled Weight Decay):

Standard Adam applies weight decay incorrectly: it adds the L2 penalty to the gradient, which then interacts with the adaptive learning rates. This means weight decay strength varies per parameter, which isn’t what you want.

AdamW fixes this by decoupling weight decay from the gradient-based update:

w \leftarrow w - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w \right)

Now weight decay is applied uniformly to all parameters, independent of the adaptive learning rate. Empirically, AdamW generalizes better than Adam with L2 regularization. It’s the recommended variant for most production systems.

Production Default: Use Adam (or AdamW) with $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ , learning rate $\eta = 10^{-3}$ or $10^{-4}$ . These defaults work well across a wide range of problems. Only tune if you observe specific issues.

Modern Optimizers: Momentum and Adam diagram

The diagram shows different optimizers navigating a loss surface. SGD oscillates due to noisy gradients. Momentum smooths the path. Adam adapts its step size and converges efficiently.

Regularization: Preventing Memorization

Even with good optimization, deep networks can overfit—memorize training data instead of learning generalizable patterns. Regularization techniques constrain the model to prevent memorization.

Weight Decay (L2 Regularization):

Add a penalty to the loss that discourages large weights:

L_{\text{total}} = L_{\text{data}} + \lambda \sum_i w_i^2

This encourages small, diffuse weights rather than large, sparse weights. It’s equivalent to maximum a posteriori (MAP) estimation with a Gaussian prior. In practice, weight decay is implemented by modifying the weight update:

w \leftarrow w(1 - \eta \lambda) - \eta \frac{\partial L_{\text{data}}}{\partial w}

The term $(1 - \eta \lambda)$ shrinks weights slightly each update.

Dropout:

During training, randomly disable a fraction (e.g., 50%) of neurons each forward/backward pass. This prevents neurons from co-adapting—relying on specific other neurons being present. Each neuron must learn a robust feature that works even when random other neurons are missing.

At test time, all neurons are active, but their outputs are scaled by the dropout rate to account for the fact that more neurons are now contributing. Dropout is effectively training an ensemble of networks (each dropout mask is a different subnetwork) and averaging them at test time.

Dropout is incredibly effective at preventing overfitting, especially in fully connected layers. It’s less commonly used in convolutional layers, where other regularization techniques (data augmentation, batch normalization) are more effective.

Batch Normalization:

Normalize activations within each mini-batch:

\hat{z} = \frac{z - \mu_{\text{batch}}}{\sqrt{\sigma_{\text{batch}}^2 + \epsilon}}

Where $\mu_{\text{batch}}$ and $\sigma_{\text{batch}}$ are the mean and variance of activations in the current batch. This keeps activations from growing too large or small, which helps with vanishing/exploding gradients.

Batch normalization also acts as a regularizer: the noise from batch statistics introduces randomness that prevents overfitting, similar to dropout. It’s now a standard component in most architectures.

Early Stopping:

Monitor validation loss during training. When it stops improving (or starts increasing), stop training, even if training loss is still decreasing. This prevents the model from continuing to fit training-specific noise after it has learned the generalizable patterns.

Early stopping is simple, effective, and widely used. It’s the most practical regularization technique—no hyperparameters to tune, just monitor and stop.

Modern Optimization Techniques

Beyond the core optimizers and regularization methods, several modern techniques make training more efficient and enable larger models.

Gradient Accumulation:

GPU memory limits the batch size you can use. A single V100 GPU might only fit batch size 32 for a large model, when you’d prefer batch size 128 for better gradient estimates and generalization.

Gradient accumulation simulates larger batches without extra memory: instead of updating weights after each mini-batch, accumulate gradients across multiple mini-batches, then update once.

Example: Run forward and backward pass 4 times with batch size 32, accumulating gradients without updating weights. After 4 accumulations (effective batch size 128), update weights with the accumulated gradients. Memory usage stays at batch 32, but the optimization dynamics match batch 128.

This is standard practice for training large language models where memory is the primary constraint.

Mixed Precision Training:

Neural networks typically use 32-bit floating point (FP32) numbers. Mixed precision training uses 16-bit floats (FP16) for most operations, falling back to FP32 only when necessary for numerical stability.

Benefits:

2× memory reduction: Activations and weights take half the space
2-3× speedup: Modern GPUs (V100, A100) have specialized Tensor Cores that compute FP16 much faster than FP32
No accuracy loss: With careful implementation, FP16 training matches FP32 final accuracy

Challenge: FP16 has smaller range and precision than FP32. Gradients can underflow (become zero) or overflow (become infinity). Solution: loss scaling. Multiply the loss by a large constant (e.g., 1024) before backpropagation, then scale gradients back down before weight updates. This shifts gradient values into FP16’s representable range without changing the optimization dynamics.

Mixed precision requires hardware support (Tensor Cores on NVIDIA V100/A100 or later). On older GPUs, the overhead of FP16/FP32 conversions can negate the speedup.

Learning Rate Finder:

Finding the right learning rate is critical but usually done by trial and error. The learning rate finder automates this:

Start with a very small learning rate (e.g., $10^{-7}$ )
Train for a few hundred iterations, exponentially increasing the learning rate each iteration
Plot loss vs learning rate
The optimal learning rate is in the steepest descent region—just before the loss starts to increase or diverge

This technique, popularized by Leslie Smith and the fastai library, quickly identifies a good learning rate range without extensive hyperparameter sweeps. It’s especially useful when training new architectures or on new datasets where you have no prior intuition.

Large Batch Training and Learning Rate Scaling:

When you increase batch size, you must adjust the learning rate to maintain similar optimization dynamics. The standard rule: if you double the batch size, double the learning rate.

Why: Larger batches give more accurate gradient estimates, reducing noise. With less noise, you can take larger steps safely. The linear scaling rule (learning rate proportional to batch size) works well empirically up to batch sizes of a few thousand.

Caveat: Very large batches (> 8k) can hurt generalization even with learning rate scaling. The noise from small batches acts as a regularizer, and large batches remove this beneficial noise. Techniques like learning rate warmup and adjusted regularization can mitigate this, but there’s an upper limit to useful batch size for generalization.

Engineering Takeaway

Training deep networks is more engineering than science. Hyperparameters matter enormously. Understanding the failure modes—vanishing gradients, exploding gradients, overfitting—and the solutions—modern optimizers, regularization, careful initialization—separates models that work from models that fail.

Adam (or AdamW) is the default optimizer. Start with Adam with learning rate $10^{-3}$ or $10^{-4}$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ . It works across most problems with minimal tuning. For production, use AdamW instead of Adam—decoupled weight decay generalizes better. Only switch to SGD with momentum if you’re tuning a well-established architecture (like ResNets on ImageNet) where practitioners have found SGD works better. Adam’s robustness makes it the right starting point.

Learning rate is the most important hyperparameter. Too high causes divergence (loss becomes NaN). Too low wastes compute (training takes forever). Use a learning rate finder to identify the steepest descent region, or start with $10^{-3}$ and adjust based on training curves. Add learning rate schedules (step decay, cosine annealing) for the final 20-30% of training to improve convergence. Use warmup (start with small learning rate for first few epochs) to prevent early instability. When you scale up batch size, scale up learning rate proportionally.

Regularization prevents overfitting and is not optional. Use weight decay ( $\lambda = 10^{-4}$ or $10^{-5}$ ), dropout (0.5 for fully connected layers, 0.1-0.2 for convolutional layers), batch normalization, and early stopping. Without regularization, deep networks memorize training data. With too much regularization, they underfit. Tune regularization strength on a validation set. Weight decay is the single most important regularizer—always use it unless you have a specific reason not to.

Gradient clipping prevents exploding gradients. Clip gradient norms to a maximum value (1.0 for RNNs, 5.0 for transformers). This prevents rare large gradients from destabilizing training. It’s a simple safeguard that costs almost nothing and prevents catastrophic failures. Monitor gradient norms during training—if you see spikes to $> 100$ , reduce the clipping threshold or learning rate.

Mixed precision training enables larger models. Use FP16 with loss scaling for 2× memory reduction and 2-3× speedup on modern GPUs (V100/A100). This lets you train larger models or use larger batch sizes on the same hardware. Most frameworks (PyTorch, TensorFlow) provide automatic mixed precision with minimal code changes. The only downside: requires recent GPU hardware. If training on older GPUs, stick with FP32.

Batch size trades off memory, speed, and generalization. Larger batches (128-256) give more accurate gradients, better GPU utilization, and faster training per epoch. Smaller batches (32-64) use less memory, have more noise (which can improve generalization), and update weights more frequently. For most systems, batch size 32-128 is the practical sweet spot. Use gradient accumulation to simulate larger batches if memory is constrained. Scale learning rate linearly with batch size.

Debugging training requires systematic diagnosis. If loss doesn’t decrease: (1) Check gradients—are they non-zero and finite? Use gradient checking or print gradient norms. (2) Check data—is it normalized, correctly labeled, and shuffled? (3) Reduce learning rate by 10× and see if training improves. (4) Try a simpler model (fewer layers, fewer parameters) to verify the problem is learnable. (5) Check for bugs—off-by-one errors, incorrect loss functions, frozen layers. Understanding optimization internals (backprop, gradient flow) helps you identify the failure mode quickly.

The lesson: Optimization is the bottleneck for deep learning. Modern techniques—Adam, batch normalization, learning rate schedules, mixed precision—make training tractable, but training remains fragile. The difference between a model that converges and one that fails often comes down to hyperparameter tuning and understanding why training fails. Master these techniques, and you can train networks that actually work.

References and Further Reading

On the Importance of Initialization and Momentum in Deep Learning – Ilya Sutskever et al. (2013) http://proceedings.mlr.press/v28/sutskever13.html

This paper shows how momentum accelerates training and why initialization matters. Sutskever demonstrates that careful initialization and momentum enable training much deeper networks than vanilla SGD. Reading this will give you intuition for why these techniques are standard practice.

Adam: A Method for Stochastic Optimization – Diederik Kingma and Jimmy Ba (2015) https://arxiv.org/abs/1412.6980

This is the paper that introduced Adam, now the most popular optimizer. Kingma and Ba explain how adaptive per-parameter learning rates work and show empirical results across many tasks. Understanding Adam’s design principles helps you use it effectively and know when to switch to alternatives.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift – Sergey Ioffe and Christian Szegedy (2015) https://arxiv.org/abs/1502.03167

Batch normalization revolutionized training deep networks. Ioffe and Szegedy show that normalizing activations enables much higher learning rates, faster convergence, and acts as a regularizer. Reading this explains why batch norm is in almost every modern architecture.