Chapter 12: The Forward Pass

Layers as Transformations

A neural network is a sequence of transformations. Each layer takes an input vector, transforms it, and passes the result to the next layer. The forward pass is the process of pushing data through these transformations, from raw input to final prediction.

Each layer performs two operations:

Linear transformation: Compute weighted sums (multiply by weight matrix, add bias)
Nonlinear activation: Apply activation function element-wise

If layer $l$ receives input $\mathbf{a}^{(l-1)}$ , it computes:

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

\mathbf{a}^{(l)} = f(\mathbf{z}^{(l)})

Where:

$\mathbf{W}^{(l)}$ is the weight matrix for layer $l$ (rows = neurons in this layer, columns = neurons in previous layer)
$\mathbf{b}^{(l)}$ is the bias vector
$f$ is the activation function (applied element-wise)
$\mathbf{z}^{(l)}$ is the pre-activation (before nonlinearity)
$\mathbf{a}^{(l)}$ is the activation (after nonlinearity)

The output of one layer becomes the input to the next. Data flows forward: input → layer 1 → layer 2 → … → output. At each step, the representation changes. The input space is repeatedly stretched, rotated, folded, and warped by the sequence of linear and nonlinear transformations.

Geometrically, each layer performs an affine transformation (linear map plus shift) followed by a nonlinear warping. The affine transformation aligns the data with learned decision boundaries. The nonlinearity introduces bends and curves, allowing the network to partition the space in complex ways.

Information Flow: What Is Preserved, What Is Lost

As data flows through a network, information is transformed and sometimes discarded. Early layers preserve most of the input information while extracting low-level features. Deeper layers discard irrelevant details and amplify task-relevant patterns.

Consider an image classification network:

Input: A 224×224×3 image (150,528 numbers representing RGB pixel values)
Layer 1: Applies 64 convolutional filters, producing 64 feature maps. The representation is now 64 channels instead of 3, encoding edges and textures.
Layer 2: Applies more filters and pooling, reducing spatial resolution. Details are discarded; patterns are preserved.
Later layers: Continue abstracting. By layer 10, the representation might be a 512-dimensional vector encoding high-level features like “furry texture,” “pointy ears,” “whiskers.”
Output layer: Maps the 512-dimensional feature vector to 1,000 class probabilities.

At each step, the network trades specificity for abstraction. Early representations can reconstruct the input with high fidelity—they preserve details. Late representations cannot reconstruct the input—they’ve thrown away pixel-level information—but they encode what matters for the task.

This progressive abstraction is lossy compression. The network discards information that doesn’t help prediction and amplifies information that does. A dog classifier doesn’t need to remember exact pixel colors or background details. It needs to remember “this looks like dog features.” The forward pass performs this compression hierarchically.

Information Flow: What Is Preserved, What Is Lost diagram

The diagram shows how information is transformed through layers. Raw information content decreases (the network cannot reconstruct the input from deep layers), but task-relevant information increases. The network compresses data into a form useful for prediction.

Intermediate Representations: Why Hidden Layers Exist

Hidden layers—layers between input and output—learn intermediate representations. These representations are internal feature spaces that make the final classification or prediction easier.

Without hidden layers, the network is just a linear model with a nonlinearity at the output. It can only learn linear decision boundaries (or simple nonlinear boundaries if the input features are good). With hidden layers, the network can learn hierarchical features that transform the problem into one that’s linearly separable.

Consider the XOR problem: predict 1 if inputs are (0,1) or (1,0), and predict 0 if inputs are (0,0) or (1,1). This is not linearly separable in the input space—no straight line separates the two classes. A single-layer network (perceptron) cannot solve it.

But a two-layer network can. The first layer transforms the inputs into a new space where XOR becomes linearly separable. The second layer (output) draws a linear boundary in this new space. The hidden layer creates the right representation.

This is the power of deep learning: the network learns the features it needs. Classical machine learning required humans to engineer features that made problems linearly separable. Neural networks engineer features automatically by learning intermediate representations.

The depth of the network—the number of hidden layers—determines how complex these representations can be. Shallow networks learn simple transformations. Deep networks learn compositions of transformations, enabling them to represent far more complex functions with fewer total parameters.

Computational Cost: FLOPs and Memory

Every forward pass has a computational cost measured in floating-point operations (FLOPs) and memory usage. Understanding these costs is essential for deploying models in production.

FLOPs (Floating Point Operations): Matrix multiplication dominates the forward pass. For a dense layer transforming an $m$ -dimensional input to an $n$ -dimensional output, the weight matrix is $n \times m$ , and computing the output requires $n \times m$ multiplications (dot product for each of $n$ neurons) plus $n$ additions for bias. Total: approximately $2nm$ FLOPs (multiply-add pairs).

Consider the MNIST example network:

Layer 1: 784 inputs → 128 neurons: $2 \times 784 \times 128 \approx 200k$ FLOPs
Layer 2: 128 inputs → 64 neurons: $2 \times 128 \times 64 \approx 16k$ FLOPs
Output layer: 64 inputs → 10 neurons: $2 \times 64 \times 10 \approx 1.3k$ FLOPs
Total per forward pass: ~220k FLOPs

On a modern CPU, 1 GFLOP (billion FLOPs) takes roughly 1 millisecond. The MNIST network requires 0.22 MFLOPs, so about 0.2-0.5 ms on a CPU depending on implementation efficiency. On a GPU, 1 GFLOP takes ~0.01 ms, so the same network runs in ~0.002-0.005 ms—about 100× faster.

Memory Requirements: The forward pass requires storing:

Weights: Fixed per layer. A 784×128 weight matrix contains ~100k parameters. At 32-bit floats, that’s 400KB.
Activations: Depend on batch size. For batch size $b$ and layer output dimension $n$ , activations require $b \times n$ values. With batch size 32 and a 128-dimensional hidden layer, you store 32 × 128 = 4,096 values (16KB at 32-bit).
Total for MNIST network: ~500KB for weights, ~50KB for activations with batch size 32.

Large networks (ResNet-50, GPT-2) have millions or billions of parameters. A 1 billion parameter model requires 4GB for weights alone (at 32-bit precision). Activations scale with batch size and depth—deep networks with large batches can require tens of GB of activation memory during inference.

Why Batch Size Matters: GPUs parallelize across the batch dimension. Processing 32 examples with batch size 32 takes almost the same time as processing 1 example with batch size 1—the GPU executes 32 operations simultaneously. But larger batches require more memory. There’s a tradeoff:

Batch size 1: Low latency (single example processed immediately), poor GPU utilization
Batch size 32-128: Good GPU utilization, moderate latency (must wait for batch to fill)
Batch size > 512: Excellent throughput, but high memory usage and high latency (long wait for batches)

Rule of Thumb: 1 GFLOP ≈ 1ms on CPU, ≈ 0.01ms on modern GPU. Actual performance depends on memory bandwidth, batch size, and framework optimizations. Use this to estimate whether your model meets latency requirements.

From Raw Input to Output: A Concrete Example

Let’s walk through a simple network classifying handwritten digits (MNIST):

Input: A 28×28 grayscale image (784 pixels, values 0-255)

Layer 1 (Hidden): 128 neurons with ReLU activation

Weight matrix: 128 × 784 (each neuron has 784 weights)
Compute: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$ (128-dimensional vector)
Activate: $\mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)})$ (128 features, each detecting some pattern in the image)

Layer 2 (Hidden): 64 neurons with ReLU activation

Weight matrix: 64 × 128
Compute: $\mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + \mathbf{b}^{(2)}$ (64-dimensional vector)
Activate: $\mathbf{a}^{(2)} = \text{ReLU}(\mathbf{z}^{(2)})$ (64 higher-level features)

Output Layer: 10 neurons with softmax activation (one per digit class)

Weight matrix: 10 × 64
Compute: $\mathbf{z}^{(3)} = \mathbf{W}^{(3)} \mathbf{a}^{(2)} + \mathbf{b}^{(3)}$ (10 scores)
Activate: $\mathbf{a}^{(3)} = \text{softmax}(\mathbf{z}^{(3)})$ (10 probabilities summing to 1)

The forward pass takes 784 input values, transforms them through two hidden layers (creating 128-dimensional and 64-dimensional representations), and outputs 10 probabilities. The network has learned that certain patterns in pixels → certain mid-level features → certain high-level features → certain digit classifications.

During inference (making predictions), this forward pass is all that happens. The network applies the learned transformations to produce a prediction. Training is what determines the weights; inference just uses them.

Softmax: Converting Scores to Probabilities

The output layer produces raw scores (logits)—unbounded real numbers indicating how strongly the network believes each class is correct. To interpret these as probabilities, we apply the softmax function:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Where $z_i$ is the score for class $i$ and $K$ is the number of classes.

Why Exponential: The exponential $e^{z_i}$ ensures all values are positive. It also amplifies differences—a score of 10 becomes $e^{10} \approx 22,000$ , while a score of 5 becomes $e^5 \approx 148$ . This makes the highest-score class dominate the probability distribution.

Why Normalize: Dividing by the sum $\sum_j e^{z_j}$ ensures the probabilities sum to 1, making them a valid probability distribution. The output can be interpreted as: “the model is 85% confident this is a 3, 10% confident it’s an 8, and 5% distributed across other digits.”

Numerical Stability: Naively computing softmax can cause overflow when scores are large (e.g., $e^{1000}$ exceeds floating-point range). The standard trick is to subtract the maximum score before exponentiating:

\text{softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_{j} e^{z_j - \max(z)}}

This doesn’t change the result (it’s mathematically equivalent) but keeps values in a safe range.

Cross-Entropy Loss with Softmax: During training, the cross-entropy loss measures how well predicted probabilities match true labels:

L = -\sum_{i} y_i \log(p_i)

Where $y_i$ is the true label (1 for correct class, 0 for others) and $p_i$ is the predicted probability. The loss is minimized when predicted probabilities concentrate on the correct class. The softmax + cross-entropy combination has a clean gradient (Chapter 13), making it the standard for classification.

In Production: Often you don’t need full probabilities—you just need the top-k predictions. For example, an image classifier might return the top-5 classes and their probabilities. You can skip computing probabilities for classes with very low scores, saving computation. For many applications, the argmax (highest-scoring class) is all you need.

Inference in Production

The forward pass is what runs in production systems. Optimizing inference means minimizing latency and maximizing throughput while staying within hardware constraints.

Concrete Latency Numbers: The MNIST network (784 → 128 → 64 → 10) runs at:

CPU (single core): ~0.5ms per example (batch size 1)
GPU (V100): ~0.05ms per example (batch size 32), ~3ms for the full batch of 32

For real production systems:

Ad ranking: Needs < 10ms end-to-end latency (including feature extraction, network inference, post-processing). The network must run in < 2-3ms.
Recommendation systems: Can tolerate 50-100ms latency since recommendations are precomputed or cached.
Real-time speech recognition: Requires < 100ms latency for natural interaction, constraining model size severely.
Image classification: Offline batch processing can use large models; real-time applications (mobile AR) need models < 10-50ms on mobile CPUs.

Batch Size Tradeoff: Production systems face a latency-throughput tradeoff:

Batch size 1 (online): Process each request immediately. Low latency (~1ms), but GPU is underutilized (only 5-10% of theoretical throughput).
Batch size 32-128 (mini-batch): Buffer requests for a few milliseconds, process together. Moderate latency (~10-50ms including buffering), high throughput (80-95% GPU utilization).
Batch size 512+ (offline): Process large batches. High latency (seconds for batch to fill), maximum throughput.

Most production systems use mini-batching with dynamic batch sizes: collect requests for 5-10ms, process the batch, return results. This balances latency and throughput.

GPU vs CPU Tradeoff: GPUs win for batch sizes > 8-16. For single examples, CPUs are often faster because GPU kernel launch overhead (~0.1-0.5ms) dominates computation time. Choose based on deployment constraints:

CPU: Edge devices, low-cost servers, single-example inference
GPU: Cloud services, high-throughput batch processing, large models

Mobile and Edge Deployment: Smartphones and embedded devices lack GPUs powerful enough for large models. Solutions:

Quantization: Use 8-bit integers instead of 32-bit floats. Reduces memory 4×, speeds up inference 2-4× on mobile CPUs, with < 1% accuracy loss.
Pruning: Remove unimportant weights, reducing model size 3-10× with minimal accuracy loss.
Knowledge distillation: Train a small “student” model to mimic a large “teacher” model, achieving 80-90% of the performance with 10× fewer parameters.

A ResNet-50 model (25M parameters, ~4 GFLOPs per image) runs at ~100ms on a modern smartphone CPU. Quantized to 8-bit and pruned 50%, it runs at ~20-30ms.

Serving Infrastructure: Production ML systems use model servers (TensorFlow Serving, TorchServe, Triton) that handle batching, load balancing, and hardware management automatically. These servers accept individual requests over HTTP/gRPC, batch them intelligently, run inference on GPU, and return results. This abstracts away the complexity of optimal batching and hardware utilization.

Engineering Takeaway

Understanding the forward pass is critical because it’s what runs in production. Inference is forward pass only—no training, no backpropagation, just matrix multiplies and activations. Optimizing inference means optimizing these operations.

Forward pass is just matrix multiplies and nonlinearities. The entire inference process reduces to: multiply by weight matrices, add biases, apply activations, repeat. These are simple operations, but billions of them. Modern hardware (GPUs, TPUs) is optimized specifically for matrix multiplication, which is why neural networks can make millions of predictions per second despite millions of parameters.

Computational cost scales with width and depth. Each layer adds $O(\text{width}^2)$ FLOPs for dense layers (or $O(\text{width} \times \text{channels} \times \text{kernel\_size}^2)$ for convolutions). Deeper networks have more sequential operations; wider networks have more operations per layer. Latency is proportional to total FLOPs, so architecture design must balance expressiveness with speed.

Batch processing is critical for throughput. GPUs parallelize across the batch dimension. Processing 32 examples together is almost as fast as processing 1 example alone—the GPU executes 32 operations simultaneously. This is why production systems buffer requests and process them in batches. But batching adds latency (waiting for the batch to fill), so there’s a tradeoff between latency and throughput.

Inference latency depends on hardware. CPUs are better for batch size 1 (single-example, low-latency inference). GPUs are better for batch sizes > 16 (high-throughput inference). A100 GPUs are 100× faster than CPUs for large batches, but only 2-5× faster for single examples due to kernel launch overhead. Choose hardware based on your latency and throughput requirements.

Memory is often the bottleneck. Large models (billions of parameters) require tens of GB just for weights. Activations scale with batch size and depth, adding GB more. GPU memory is limited (16-80GB on modern GPUs), constraining model size and batch size. Techniques like gradient checkpointing (recompute activations instead of storing them) and model parallelism (split model across multiple GPUs) are necessary for very large models.

Quantization enables edge deployment. Converting from 32-bit floats to 8-bit integers reduces memory 4× and speeds up inference 2-4× on CPUs, with < 1% accuracy loss. This makes neural networks viable on smartphones and embedded devices. Combined with pruning (removing unimportant weights) and distillation (training smaller models to mimic larger ones), 8-bit models run on devices with < 1GB RAM.

Softmax converts logits to probabilities. Raw network outputs are unbounded scores (logits). Softmax normalizes them to a probability distribution: all positive, sum to 1. This is essential for classification tasks where outputs must be interpretable as confidence levels. Softmax is combined with cross-entropy loss during training, which has a clean gradient that makes optimization efficient.

The lesson: The forward pass is how neural networks make predictions. It’s a sequence of linear transformations and nonlinearities, transforming raw input into task-specific representations and finally into outputs. Understanding this flow—what happens at each layer, computational cost, memory usage—is essential for deploying models in production. Optimize the forward pass, and you optimize inference.

References and Further Reading

Neural Networks and Deep Learning, Chapter 2 – Michael Nielsen http://neuralnetworksanddeeplearning.com/chap2.html

Nielsen walks through the forward pass step-by-step with concrete examples and visualizations. He shows how data flows through the network and how representations change at each layer. The interactive visualizations let you see activations propagate through the network in real-time. Reading this will solidify your understanding of what actually happens during a forward pass, making the abstract matrix operations concrete and intuitive.

Deep Learning, Chapter 6.1-6.2 – Goodfellow, Bengio, Courville https://www.deeplearningbook.org/contents/mlp.html

This covers the mathematics of feedforward networks rigorously, including the matrix formulations and computational graphs. It explains why deeper networks can represent functions more efficiently than shallow ones (the depth efficiency of composition). The theoretical analysis of representational power—what functions networks can and cannot learn—is essential for understanding the limits and capabilities of neural architectures. This is the authoritative reference for the theory behind layer composition.

Efficient Processing of Deep Neural Networks: A Tutorial and Survey – Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer (2017) https://arxiv.org/abs/1703.09039

This comprehensive survey covers all aspects of neural network inference optimization: computational costs, memory bandwidth, hardware accelerators, quantization, pruning, and architecture design for efficiency. It provides concrete FLOPs counts for common architectures and explains the hardware-software co-design principles that make modern inference fast. Essential reading for anyone deploying models in production or designing efficient architectures. The paper connects algorithmic choices (layer widths, activations) to hardware realities (GPU memory hierarchy, CPU cache), showing why certain designs are faster than others.