Chapter 13: Backpropagation

What Went Wrong: Loss as Feedback

Training begins with error. The network makes a prediction, we compare it to the true label, and we measure how wrong it was using a loss function. For a classification task with true label $y$ and predicted probabilities $\hat{y}$ , the cross-entropy loss is:

L = -\sum_{i} y_i \log(\hat{y}_i)

This single number—the loss—quantifies the network’s failure on this example. During training, the goal is to adjust the network’s weights to reduce this loss.

But which weights should change, and by how much? The network has thousands or millions of parameters spread across many layers. Some weights contributed directly to the error; others had only indirect effects. How do we assign responsibility?

This is the credit assignment problem. Backpropagation solves it by computing how much each weight contributed to the error. It does this by propagating the error signal backward through the network, from output to input, calculating gradients at each layer.

The gradient $\frac{\partial L}{\partial w}$ tells us how the loss changes if we increase weight $w$ by a small amount. If the gradient is positive, increasing $w$ increases the loss—we should decrease $w$ . If the gradient is negative, increasing $w$ decreases the loss—we should increase $w$ . The magnitude tells us how sensitive the loss is to this weight.

Once we have gradients for all weights, we can update them to reduce the loss:

w \leftarrow w - \eta \frac{\partial L}{\partial w}

Where $\eta$ is the learning rate. This is gradient descent applied to neural networks. Backpropagation is the algorithm that computes the gradients efficiently.

Credit Assignment: Who Caused the Error?

Consider a three-layer network: input → hidden layer 1 → hidden layer 2 → output. The final prediction is wrong. Which weights should we blame?

Weights in the output layer directly influenced the final prediction. Their effect is straightforward to compute: slightly changing an output weight changes the prediction, which changes the loss. We can calculate $\frac{\partial L}{\partial w_{\text{output}}}$ directly using the chain rule.

But what about weights in hidden layer 2? They don’t directly touch the output. They influence hidden layer 2’s activations, which influence the output layer’s inputs, which influence the predictions, which influence the loss. The chain of causality is longer, but we can still trace it using the chain rule.

And weights in hidden layer 1? They’re even further removed, influencing hidden layer 1’s activations, which influence hidden layer 2’s inputs, which influence hidden layer 2’s activations, which influence the output layer’s inputs, which influence the predictions, which influence the loss. Again, the chain rule lets us compute their gradients.

This is backpropagation: systematically applying the chain rule to compute how every weight in the network affects the loss, no matter how many layers away from the output it is.

Gradients as Blame: How Responsibility Flows Backward

The key insight is that gradients flow backward through the network in the reverse order of the forward pass. We compute gradients starting from the output layer and work backward, layer by layer, to the input.

Output layer gradients:

For the output layer, the gradient of the loss with respect to the pre-activations $\mathbf{z}^{(L)}$ (where $L$ is the last layer) can be computed directly:

\frac{\partial L}{\partial \mathbf{z}^{(L)}} = \hat{\mathbf{y}} - \mathbf{y}

(For cross-entropy loss with softmax, this is the predicted probabilities minus the true labels—a beautifully simple form.)

From this, we compute gradients for the output layer’s weights and biases:

\frac{\partial L}{\partial \mathbf{W}^{(L)}} = \frac{\partial L}{\partial \mathbf{z}^{(L)}} \cdot (\mathbf{a}^{(L-1)})^T

\frac{\partial L}{\partial \mathbf{b}^{(L)}} = \frac{\partial L}{\partial \mathbf{z}^{(L)}}

Where $\mathbf{a}^{(L-1)}$ is the input to this layer (the previous layer’s activations).

Hidden layer gradients:

To compute gradients for earlier layers, we propagate the error backward:

\frac{\partial L}{\partial \mathbf{z}^{(l)}} = (\mathbf{W}^{(l+1)})^T \frac{\partial L}{\partial \mathbf{z}^{(l+1)}} \odot f'(\mathbf{z}^{(l)})

Where $\odot$ is element-wise multiplication and $f'$ is the derivative of the activation function.

This equation says: the gradient with respect to layer $l$ ‘s pre-activations depends on:

The gradient from the next layer ( $\frac{\partial L}{\partial \mathbf{z}^{(l+1)}}$ )
How the next layer’s weights connect back to this layer ( $(\mathbf{W}^{(l+1)})^T$ )
How much this layer’s activations changed given its inputs ( $f'(\mathbf{z}^{(l)})$ )

This is the chain rule in action. The gradient flows backward, weighted by the connections and modulated by the activation derivatives.

Gradients as Blame: How Responsibility Flows Backward diagram

The diagram shows bidirectional flow: data flows forward during the forward pass, producing predictions and loss. Gradients flow backward during the backward pass, computing how each weight should change to reduce loss.

Gradient Checking: Sanity Testing Backprop

Before trusting backpropagation, especially when implementing it from scratch or using custom layers, you should verify that gradients are correct. Gradient checking compares analytical gradients (from backprop) to numerical gradients (from finite differences).

Numerical Gradient: For any parameter $w$ , the gradient can be approximated by computing the loss at slightly different values:

\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}

Where $\epsilon$ is a small value like $10^{-5}$ . This is the central difference approximation—it’s accurate to $O(\epsilon^2)$ . The idea is simple: perturb $w$ slightly in both directions, measure how much the loss changes, and estimate the slope.

Gradient Check Procedure:

Implement forward pass and backpropagation
Compute analytical gradients via backprop: $g_{\text{backprop}}$
For each parameter, compute numerical gradient: $g_{\text{numerical}}$
Compare: compute relative error $\frac{|g_{\text{backprop}} - g_{\text{numerical}}|}{|g_{\text{backprop}}| + |g_{\text{numerical}}|}$
If relative error < $10^{-5}$ , gradients match. If > $10^{-3}$ , there’s a bug.

When to Use: Gradient checking is expensive—it requires $2n$ forward passes for $n$ parameters. Only use it during development on small networks with few examples (e.g., 10 parameters, 5 examples). Once verified, disable gradient checking and rely on backprop.

Common Bugs Caught: Off-by-one errors in indexing, incorrect transpose operations, wrong activation derivatives, forgetting to accumulate gradients across batch elements. Gradient checking catches these bugs that are invisible from loss curves alone.

Example: TensorFlow provides tf.test.compute_gradient_error for automated gradient checking. PyTorch has torch.autograd.gradcheck. Use these during unit tests for custom layers or loss functions.

Gradient checking is a sanity test—not a proof of correctness, but a strong signal that backprop is implemented correctly.

Vanishing and Exploding Gradients

When backpropagating through many layers, gradients can become problematically small (vanishing) or large (exploding), preventing effective learning.

Vanishing Gradients: Recall that gradients propagate backward via:

\frac{\partial L}{\partial \mathbf{z}^{(l)}} = (\mathbf{W}^{(l+1)})^T \frac{\partial L}{\partial \mathbf{z}^{(l+1)}} \odot f'(\mathbf{z}^{(l)})

At each layer, the gradient is multiplied by the weight matrix and the activation derivative. For sigmoid or tanh activations, $f'(z)$ is always less than 1 (max value is 0.25 for sigmoid, 1 for tanh). After backpropagating through 10 layers, gradients are multiplied by 10 such terms, each < 1, causing exponential decay. Gradients in early layers approach zero, making those layers untrainable.

How to Detect Vanishing:

Early layer weights don’t change during training (check gradient norms)
Gradients in first few layers are < $10^{-6}$ while output layer gradients are normal
Training loss decreases slowly or plateaus early despite model having capacity

How to Fix Vanishing:

Use ReLU activation: $\text{ReLU}'(z) = 1$ for $z > 0$ , preventing gradient decay
Better initialization: He initialization (Chapter 11) keeps gradients in a reasonable range
Residual connections: Skip connections (Chapter 16) allow gradients to flow directly through layers
Batch normalization: Normalizes activations, preventing them from saturating

Exploding Gradients: If weights are initialized too large or learning rates are too high, gradients can grow exponentially. After backpropagating through many layers, gradients become huge (> $10^3$ ), causing weight updates that overshoot and destabilize training.

How to Detect Exploding:

Loss becomes NaN or inf during training
Gradients in early layers are > $10^{3}$
Weights oscillate wildly or grow unbounded

How to Fix Exploding:

Gradient clipping: Cap gradient norm at a maximum value (e.g., clip to norm 1.0 for RNNs, 5.0 for transformers)
Lower learning rate: Smaller steps prevent overshooting
Batch normalization: Stabilizes activations and gradients
Better initialization: Xavier/He initialization prevents initial gradients from being too large

Production Monitoring: Track gradient norms per layer during training. Log the ratio of max gradient to min gradient across layers. Set up alerts if gradients fall outside the range $[10^{-5}, 10^{2}]$ . Gradient statistics are essential for debugging training failures.

Computational Graph and Automatic Differentiation

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) don’t require you to implement backpropagation manually. They use automatic differentiation (autodiff) to compute gradients automatically from the forward pass definition.

Computational Graph: When you define a forward pass, the framework builds a computational graph—a directed acyclic graph (DAG) where nodes are operations (matrix multiply, ReLU, add) and edges are tensors. For example, $y = \text{ReLU}(\mathbf{W}\mathbf{x} + \mathbf{b})$ becomes:

Node 1: Multiply $\mathbf{W}$ and $\mathbf{x}$ → intermediate tensor
Node 2: Add $\mathbf{b}$ → pre-activation $\mathbf{z}$
Node 3: Apply ReLU → activation $\mathbf{a}$

Forward and Backward Rules: Each operation has a forward rule (compute output from inputs) and a backward rule (compute gradient of inputs given gradient of output). For matrix multiply $\mathbf{Y} = \mathbf{A}\mathbf{B}$ :

Forward: $\mathbf{Y} = \mathbf{A}\mathbf{B}$
Backward: $\frac{\partial L}{\partial \mathbf{A}} = \frac{\partial L}{\partial \mathbf{Y}} \mathbf{B}^T$ and $\frac{\partial L}{\partial \mathbf{B}} = \mathbf{A}^T \frac{\partial L}{\partial \mathbf{Y}}$

The framework stores these rules for every operation. During backprop, it traverses the graph in reverse, applying backward rules to compute gradients.

Automatic Gradient Computation: You define the forward pass in code. The framework:

Records all operations and builds the computational graph
Executes the forward pass, storing intermediate activations
Computes loss
Traverses the graph backward, applying backward rules to compute $\frac{\partial L}{\partial w}$ for all parameters

This is why you never write gradient code manually in PyTorch or TensorFlow. The framework computes gradients correctly for any differentiable operations you compose.

Memory vs Recomputation Tradeoff: Backprop requires intermediate activations from the forward pass. For deep networks, storing all activations requires huge memory (proportional to depth × batch size × layer width). Gradient checkpointing solves this: store activations only at certain layers (e.g., every 5th layer), recompute intermediate activations during backprop when needed. This trades compute (recompute activations) for memory (don’t store them all). Enables training 2-3× deeper networks on the same hardware.

Why You Don’t Implement Backprop Manually: Frameworks handle edge cases correctly (broadcasting, in-place operations, multiple paths in the graph). They optimize memory usage and parallelization. They support mixed precision and gradient accumulation. Unless you’re implementing a new low-level operation, trust the framework’s autodiff. But understanding how it works helps you debug when gradients are wrong or training fails.

Weight Updates: How Models Improve

Once we have gradients for all weights, updating is simple:

\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(l)}}

\mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} - \eta \frac{\partial L}{\partial \mathbf{b}^{(l)}}

The learning rate $\eta$ controls the step size. Large learning rates change weights dramatically, potentially overshooting the minimum. Small learning rates change weights slowly, requiring many iterations to converge.

This update happens after computing gradients for a batch of examples. In stochastic gradient descent (SGD), we compute the average gradient over a mini-batch (e.g., 32 examples) and update once per batch. This balances computational efficiency (batches parallelize on GPUs) with gradient accuracy (averaging reduces noise).

Over many iterations, the weights slowly adjust to reduce the training loss. The network learns: patterns that reduce error get reinforced (weights increase), patterns that increase error get suppressed (weights decrease). The process is mechanical—just gradient descent—but the result is a network that generalizes.

Engineering Takeaway

Backpropagation is why neural networks can be trained at all. It’s an efficient algorithm for computing gradients in computational graphs, enabling gradient descent on arbitrarily deep networks.

Backprop is automatic in modern frameworks. You define the forward pass, and PyTorch/TensorFlow compute gradients automatically via autodiff. The framework builds a computational graph, records operations, and applies the chain rule in reverse during the backward pass. You don’t implement backprop manually—you just call loss.backward() and the framework handles it. But understanding how backprop works helps you debug when training fails or gradients are wrong.

Gradient checking catches implementation bugs. When implementing custom layers or loss functions, use numerical gradient checking to verify that analytical gradients match finite-difference approximations. Compare backprop gradients to numerical gradients; if relative error < $10^{-5}$ , you’re correct. This is expensive (requires $2n$ forward passes for $n$ parameters) but catches bugs that are invisible from loss curves. Use it during development, disable it in production.

Vanishing and exploding gradients are common in deep networks. Gradients are multiplied by weight matrices and activation derivatives at each layer. With sigmoid/tanh, derivatives < 1 cause gradients to decay exponentially (vanishing). With poor initialization or high learning rates, gradients grow exponentially (exploding). Monitor gradient norms per layer during training: log if gradients fall outside $[10^{-5}, 10^{2}]$ . Fix vanishing with ReLU activations and He initialization; fix exploding with gradient clipping and lower learning rates.

Memory scales with depth and batch size. Backprop requires storing all activations from the forward pass—memory usage is proportional to depth × batch size × layer width. Deep networks with large batches can require tens of GB. Gradient checkpointing trades compute for memory: store activations every $n$ th layer, recompute intermediate activations during backprop. This enables training 2-3× deeper networks on the same hardware at the cost of 20-30% slower training.

Computational cost of backward pass ≈ 2× forward pass. The backward pass must compute gradients for every parameter, requiring matrix multiplies similar to the forward pass but with transposed weight matrices. Plus storing and retrieving activations. In practice, training (forward + backward) takes roughly 3× the time of inference (forward only). This is why inference is cheap but training is expensive—you’re doing 3× the work per example.

Gradient clipping prevents divergence. When gradients explode (norm > threshold), clip them to a maximum norm. For RNNs, clip to norm 1.0; for transformers, clip to 5.0. This prevents single large gradients from destabilizing training. Clipping is standard in production systems—it’s a form of regularization that improves stability without hurting convergence.

Automatic differentiation is a superpower. Autodiff enables rapid experimentation: define a new architecture, run forward pass, call .backward(), and gradients are computed correctly. This abstraction is why deep learning research moves so fast. You can compose operations arbitrarily (attention, convolution, recurrence), and autodiff computes gradients for the composition. Understanding this unlocks the ability to design custom architectures without deriving gradient equations by hand.

The lesson: Backpropagation is the engine of deep learning. It automates gradient computation, enabling gradient descent on deep networks. Understanding how gradients flow backward—how error signals propagate through layers, how they can vanish or explode, how frameworks compute them automatically—explains why training works and why it sometimes fails. Master backprop, and you understand how neural networks learn.

References and Further Reading

Learning Representations by Back-Propagating Errors – Rumelhart, Hinton, Williams (1986) https://www.nature.com/articles/323533a0

This is the paper that popularized backpropagation and sparked the first neural network renaissance. Rumelhart, Hinton, and Williams showed that backpropagation enables training multi-layer networks, overcoming limitations of single-layer perceptrons. The paper is remarkably readable for a foundational work—it explains the chain rule application step-by-step and demonstrates on concrete problems. Reading this gives historical context for why backpropagation was revolutionary—it made deep learning possible by solving the credit assignment problem in a computationally tractable way.

Neural Networks and Deep Learning, Chapter 2 – Michael Nielsen http://neuralnetworksanddeeplearning.com/chap2.html

Nielsen derives backpropagation step-by-step with clear explanations and visualizations. He shows the matrix formulations, explains why the algorithm is efficient (O(n) instead of O(n²)), and provides interactive code. This is the best pedagogical treatment of backpropagation available—it assumes only basic calculus and linear algebra, building up from first principles. Reading this will solidify your understanding of how gradients are computed and why backprop is the only practical way to train deep networks.

Yes You Should Understand Backprop – Andrej Karpathy https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b

Karpathy argues that understanding backpropagation is essential for debugging neural networks. He walks through common issues—vanishing gradients, dead ReLUs, initialization problems—and shows how understanding backprop helps diagnose them. The post includes practical debugging tips: how to detect gradient problems, what gradient norms should look like, when to use gradient checking. Reading this will convince you that backpropagation is not just theory; it’s a practical debugging tool. When training fails, understanding backprop lets you identify whether the problem is in gradient computation, optimization, or architecture.