Chapter 11: Neurons as Math

A Neuron Is Not a Brain Cell

The term “neural network” invites biological metaphors. But artificial neurons are not brain cells, and neural networks do not work like brains. An artificial neuron is a mathematical function—nothing more, nothing less. It takes numbers as input, multiplies them by learned weights, adds a bias, and passes the result through a nonlinear function. That’s it.

This distinction matters because biological metaphors mislead. Brain cells are analog, adaptive, and interconnected in complex ways science barely understands. Artificial neurons are deterministic mathematical operations—addition, multiplication, and a simple nonlinearity. They’re called “neurons” for historical reasons, but the name is a distraction from what they actually do: they compute weighted sums and apply activations.

Understanding neural networks requires abandoning the biological framing and thinking geometrically. A neuron computes a function. A layer is a parallel collection of such functions. A network is a composition of layers. Training adjusts the parameters of these functions to minimize error. There’s no consciousness, no firing patterns, no synapses. Just functions, gradients, and optimization.

Weighted Sums: The Core Operation

A neuron’s computation begins with a weighted sum. Given input values $x_1, x_2, \ldots, x_n$ , the neuron multiplies each input by a learned weight $w_1, w_2, \ldots, w_n$ , then adds a bias term $b$ :

z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^T \mathbf{x} + b

This is exactly the same computation as a linear model from Chapter 6. The weights encode the importance of each input. The bias shifts the result. The value $z$ is called the pre-activation—it’s what the neuron computes before applying the nonlinearity.

Geometrically, $\mathbf{w}^T \mathbf{x} + b = z$ defines a hyperplane in the input space. Points on one side have $z > 0$ ; points on the other have $z < 0$ . The neuron’s weights determine the orientation of this hyperplane, and the bias determines its position.

Consider a neuron with two inputs, $x_1$ and $x_2$ :

If $w_1 = 3$ , $w_2 = -2$ , $b = 1$ , then $z = 3x_1 - 2x_2 + 1$
When $x_1 = 2$ and $x_2 = 1$ , we get $z = 3(2) - 2(1) + 1 = 5$
When $x_1 = 0$ and $x_2 = 3$ , we get $z = 3(0) - 2(3) + 1 = -5$

The neuron separates the input space into two regions. Inputs that produce $z > 0$ are on one side of the decision boundary; those that produce $z < 0$ are on the other.

Multiple neurons in a layer compute multiple weighted sums in parallel, each with its own weights and bias. If a layer has 100 neurons and receives 50 inputs, it performs 100 weighted sums, producing 100 outputs. This parallelism is why neural networks scale efficiently on GPUs—all neurons in a layer compute simultaneously.

Activation Functions: Why Nonlinearity Matters

The weighted sum alone would make the neuron a linear function. A network of linear functions is still linear—stacking linear transformations produces another linear transformation. To enable neural networks to approximate nonlinear functions, we apply a nonlinear activation function to the weighted sum:

a = f(z) = f(\mathbf{w}^T \mathbf{x} + b)

Where $f$ is the activation function and $a$ is the neuron’s output (the activation). Common activation functions include:

ReLU (Rectified Linear Unit):

\text{ReLU}(z) = \max(0, z)

ReLU is zero for negative inputs and identity for positive inputs. It’s simple, computationally cheap, and works well in practice. Most modern networks use ReLU or its variants. ReLU dominates because it’s fast (just a comparison and max operation), doesn’t saturate for positive values (gradient is 1), and empirically trains deep networks better than sigmoid or tanh.

Sigmoid:

\sigma(z) = \frac{1}{1 + e^{-z}}

Sigmoid squashes any input to the range $(0, 1)$ . It was historically popular but has fallen out of favor for hidden layers due to vanishing gradients (Chapter 14). It’s still used for output layers in binary classification where outputs should be probabilities.

Tanh (Hyperbolic Tangent):

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Tanh outputs values in $(-1, 1)$ , centered at zero. It’s smoother than ReLU but also suffers from vanishing gradients. The zero-centering makes it slightly better than sigmoid for hidden layers, but ReLU is still preferred.

GELU (Gaussian Error Linear Unit):

\text{GELU}(z) = z \cdot \Phi(z)

Where $\Phi(z)$ is the cumulative distribution function of the standard normal distribution. GELU is a smooth approximation to ReLU that has become standard in transformers (BERT, GPT). It’s smoother than ReLU, which helps optimization in very deep networks, and it allows small negative values to pass through (unlike ReLU which zeroes them completely).

Swish / SiLU (Sigmoid Linear Unit):

\text{Swish}(z) = z \cdot \sigma(z)

Swish multiplies the input by its sigmoid. Like GELU, it’s a smooth alternative to ReLU that can improve performance on some tasks. It’s used in EfficientNet and other modern vision architectures.

Activation Functions: Why Nonlinearity Matters diagram

The diagram shows three common activation functions. ReLU is piecewise linear (zero for negatives, identity for positives). Sigmoid smoothly transitions from 0 to 1. Tanh transitions from -1 to +1.

Without nonlinearity, stacking layers does nothing—you just get a deeper linear function equivalent to a single layer. With nonlinearity, each layer can learn complex transformations. The network becomes a universal function approximator, capable of learning any continuous function given enough neurons and layers.

When to Use Which Activation:

ReLU: Default for hidden layers in convolutional networks (CNNs). Fast, works well, but watch for dead neurons.
GELU or Swish: Use in transformers and very deep networks where smoothness helps optimization.
Sigmoid: Output layer for binary classification (returns probability).
Tanh: Sometimes used in recurrent networks (RNNs) for centering, but ReLU variants often work better.
Leaky ReLU: $\text{LeakyReLU}(z) = \max(0.01z, z)$ allows a small gradient for negative values, preventing dead neurons.

Dead ReLU Problem: A ReLU neuron “dies” when its weights shift such that it always outputs 0 (pre-activation always negative). Once dead, the neuron receives zero gradient and never recovers. This happens when learning rates are too high or initialization is poor. Solutions: use Leaky ReLU or ensure proper initialization (next section).

Initialization: Why Starting Points Matter

Before training, network weights are randomly initialized. This initialization determines whether training succeeds or fails. Poor initialization causes vanishing or exploding activations, which lead to vanishing or exploding gradients, which prevent learning.

Why Random Initialization: If all weights start at the same value (e.g., zero), all neurons in a layer compute the same function and receive the same gradient updates. They stay identical throughout training—the network never learns diverse features. Random initialization breaks this symmetry: each neuron starts with different weights, computes different functions, and learns different patterns.

Xavier / Glorot Initialization: Designed for sigmoid and tanh activations. Weights are sampled from a distribution with variance:

\text{Var}(w) = \frac{1}{n_{\text{in}}}

Where $n_{\text{in}}$ is the number of inputs to the neuron. This keeps activations from growing or shrinking as they pass through layers. If weights are too large, activations explode; too small, they vanish. Xavier initialization keeps activations in a reasonable range.

He Initialization: Designed specifically for ReLU activations. Weights are sampled with variance:

\text{Var}(w) = \frac{2}{n_{\text{in}}}

The factor of 2 accounts for ReLU zeroing out half the inputs. He initialization prevents activations from vanishing in ReLU networks. Without it, deep ReLU networks often fail to train—activations shrink exponentially with depth until they become negligible.

What Happens with Bad Initialization:

Weights too large: Activations grow exponentially through layers, causing numerical overflow (values become inf or NaN).
Weights too small: Activations shrink exponentially, approaching zero. Gradients also shrink, making learning impossibly slow.
All weights the same: Neurons in a layer remain identical, the network can’t learn different features.

In Practice: Modern frameworks (PyTorch, TensorFlow) use He initialization for ReLU layers and Xavier for others by default. Unless you have a good reason, use the defaults. Poor initialization is a common cause of training failures, especially in deep networks.

How Many Neurons Per Layer?

Layer width (number of neurons per layer) determines the network’s capacity—its ability to represent complex functions.

Too Few Neurons: If a layer has too few neurons, it cannot express the patterns needed for the task. A 10-neuron hidden layer trying to learn 100 distinct features will underfit. The network lacks the representational capacity to fit the training data, leading to high training error. This is high bias from Chapter 4—the model is too simple.

Too Many Neurons: If a layer has too many neurons relative to training data, the network can memorize the training set without learning generalizable patterns. A 1000-neuron layer trained on 100 examples will overfit. The network has too much capacity relative to the signal in the data. This is high variance—the model is too flexible.

Rule of Thumb: Start with powers of 2 (64, 128, 256, 512) because GPUs are optimized for these sizes. Matrix multiplications are fastest when dimensions are multiples of 32 or 64 due to hardware parallelism. A 128-dimensional hidden layer runs significantly faster than a 100-dimensional layer on a GPU.

Width vs Depth Tradeoff: Neural network theory proves that deep narrow networks are more efficient than shallow wide networks. A network with 3 layers of 100 neurons each (300 neurons total) can represent more complex functions than a single layer with 300 neurons. Depth enables compositional learning: early layers learn simple features, later layers combine them into complex features. This hierarchical structure (Chapter 15) is why modern networks are deep rather than wide.

Connection to Universal Approximation Theorem: A single hidden layer with infinitely many neurons can approximate any continuous function. But “infinitely many” is impractical. Deep networks achieve similar expressiveness with far fewer neurons by exploiting composition. Two hidden layers with 128 neurons each ( $2 \times 128 = 256$ neurons) can represent functions that a single layer would need thousands of neurons to approximate.

In Practice: For tabular data, 2-3 hidden layers with 128-512 neurons per layer work well. For vision, use deep convolutional architectures (ResNet, EfficientNet). For language, use transformers with 512-4096 hidden dimensions. Start with standard architectures and adjust layer widths based on validation performance.

Neurons as Feature Detectors

A neuron doesn’t just compute—it detects patterns. The weights encode a pattern, and the neuron activates strongly when the input matches that pattern. The activation function determines how sharp the response is.

Consider a neuron in an image recognition network with weights that form an edge detector—positive weights in one region, negative in another. When the input image has an edge aligned with this pattern, the weighted sum is large and positive. When there’s no edge, the weighted sum is near zero. The ReLU activation passes the strong response through and suppresses the weak response.

This is pattern matching by dot product. High weights amplify inputs that align with the pattern; low or negative weights suppress irrelevant inputs. The bias determines the threshold—how strong the match must be before the neuron activates.

In deeper layers, neurons detect more abstract patterns. A neuron in layer 1 might detect a vertical edge. A neuron in layer 2 combines edge detectors to recognize a corner. A neuron in layer 3 combines corners to recognize shapes. Each neuron becomes a feature detector for increasingly complex patterns.

The power of neural networks comes from composing many simple feature detectors. Each neuron is just a weighted sum and a nonlinearity, but thousands of neurons arranged in layers can detect arbitrarily complex patterns. A cat detector is built from shape detectors, which are built from edge detectors, which are built from pixel patterns—all learned automatically during training.

A Production Example: Recommendation Ranking

Consider a real recommendation system ranking products for users. The input is a feature vector combining user features (age, location, purchase history) and item features (price, category, popularity). The network predicts a ranking score—higher scores mean the user is more likely to engage with the item.

Architecture:

Input: 128 features (user: 64 dims, item: 64 dims)
Hidden layer 1: 256 neurons, ReLU activation
Hidden layer 2: 128 neurons, ReLU activation
Hidden layer 3: 64 neurons, ReLU activation
Output: 1 neuron, no activation (raw score)

Why These Widths: The first hidden layer expands from 128 to 256 to create a richer representation space, allowing the network to learn complex interactions between user and item features. Subsequent layers compress this representation, distilling it into a single ranking score. The funnel shape (256 → 128 → 64 → 1) is common: expand to learn features, then compress to make decisions.

Inference Latency: On a CPU, this network takes ~1-2ms per example. On a GPU with batch size 32, it takes ~0.1ms per example (~3ms for the batch). Production systems batch requests to exploit GPU parallelism. Ad ranking systems require < 10ms end-to-end latency (including feature extraction), so ~1-2ms for the network is acceptable.

Why Powers of 2: The widths 256, 128, 64 are powers of 2, optimizing GPU memory access patterns. A 250-neuron layer would be padded to 256 internally by the GPU, wasting 6 neurons worth of computation. Using explicit powers of 2 makes this padding intentional and ensures efficient hardware utilization.

This architecture is simple but effective: 3 hidden layers with ReLU activations can learn complex nonlinear relationships between user and item features, producing accurate ranking scores with low latency.

Engineering Takeaway

Neural networks are not mysterious. They’re differentiable programs—functions composed of simple operations that can be trained by gradient descent.

Neurons are weighted sums plus nonlinearity. Each neuron computes $f(\mathbf{w}^T \mathbf{x} + b)$ where $f$ is an activation function. This simple computation, repeated thousands of times in parallel across layers, produces powerful function approximation. The geometry is clear: each neuron defines a hyperplane, and the activation determines behavior relative to that hyperplane.

Activation functions enable universal approximation. Without nonlinearity, stacking layers produces another linear function—the network collapses to the equivalent of linear regression. With nonlinearity (ReLU, sigmoid, tanh), networks can approximate any continuous function given sufficient neurons and depth. The universal approximation theorem guarantees this, but in practice, deep networks with ReLU are the most efficient architecture.

ReLU dominates practice for hidden layers. It’s fast (just max(0, z)), doesn’t saturate for positive values (gradient is 1), and trains deep networks reliably. Alternatives like GELU and Swish work well in transformers where smoothness helps optimization. Sigmoid is reserved for output layers when probabilities are needed. But for most hidden layers in most architectures, ReLU is the default.

Initialization breaks symmetry and enables training. Random initialization ensures neurons start with different weights and learn different features. He initialization (for ReLU) and Xavier initialization (for sigmoid/tanh) scale weights based on layer width to prevent vanishing or exploding activations. Poor initialization is a common failure mode—if training doesn’t converge, check your initialization scheme before adjusting learning rates or architectures.

Width determines expressiveness but has diminishing returns. More neurons per layer = more capacity to learn patterns. But doubling the width doesn’t double accuracy—gains diminish as width increases. A 256-neuron layer is much more powerful than a 64-neuron layer, but a 1024-neuron layer is only slightly better than 512. Width should match problem complexity: use wider layers for complex tasks (vision, language) and narrower layers for simpler tasks (tabular data).

Dead neurons are a real problem you must monitor. ReLU neurons can “die” during training if their pre-activations become permanently negative, causing them to always output 0. Once dead, they receive zero gradient and never recover. Monitor activation statistics (fraction of neurons with non-zero outputs) during training. If > 30% of neurons are dead, your learning rate is too high or initialization is poor. Use Leaky ReLU or adjust hyperparameters.

Foundation for all architectures. CNNs, RNNs, transformers, GANs—every neural network architecture is built from neurons. Convolutional layers are neurons with weight-sharing. Attention mechanisms are neurons with learned weights. Transformers are deep stacks of neurons with specific connectivity. Understand the neuron, and you understand the fundamental building block of all modern AI systems.

The lesson: Neurons are simple mathematical functions, not biological mysteries. Each neuron computes a weighted sum and applies a nonlinearity. The magic isn’t in individual neurons—it’s in how thousands of them compose through layers, how training adjusts their weights to minimize error, and how architectural choices (depth, width, activation functions) determine what functions the network can learn. Master the neuron, and you demystify neural networks.

References and Further Reading

Neural Networks and Deep Learning – Michael Nielsen http://neuralnetworksanddeeplearning.com/

This is one of the clearest introductions to neural networks ever written. Nielsen explains neurons, layers, backpropagation, and training from first principles with interactive visualizations. Chapter 1 covers what a neuron computes and why networks can learn complex functions. Reading this will give you a solid intuitive foundation before diving into deeper theory. Nielsen’s approach prioritizes understanding over formalism, making it perfect for engineers learning neural networks for the first time.

Deep Learning, Chapter 6 – Ian Goodfellow, Yoshua Bengio, Aaron Courville https://www.deeplearningbook.org/

This is the canonical textbook on deep learning. Chapter 6 covers feedforward networks, including the mathematical details of neurons, activations, and universal approximation theorems. It’s more rigorous than Nielsen but essential for understanding why neural networks work mathematically. The universal approximation theorem is proved, activation functions are analyzed formally, and the connection to computational complexity is explored. If you want the theoretical foundations, read this.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification – Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015) https://arxiv.org/abs/1502.01852

This paper introduces He initialization, now the standard for ReLU networks, and shows why proper initialization is critical for training very deep networks. He et al. demonstrate that Xavier initialization fails for ReLU activations because ReLU zeros out half the gradients. By scaling the initialization variance by 2 instead of 1, He initialization compensates for this and enables training of networks with 30+ layers—previously impossible. The paper also introduces the Parametric ReLU (PReLU), a learnable alternative to Leaky ReLU. Understanding initialization is essential for debugging training failures, and this paper is the definitive reference.