Chapter 15: Representation Learning

From Pixels to Meaning

A raw image is a grid of numbers representing pixel intensities. These numbers, taken literally, contain no semantic meaning. Pixel (142, 87) being red tells you nothing about whether the image contains a dog, a cat, or a car. Yet humans instantly recognize objects. How?

The human visual system doesn’t process pixels. It extracts features hierarchically: edges, then shapes, then parts, then objects. By the time visual information reaches higher brain areas, it’s represented as concepts (“dog,” “running,” “outdoors”) rather than photoreceptor activations. The brain learned these representations through experience.

Neural networks do the same thing automatically. Early layers learn low-level features (edges, colors, textures). Middle layers combine these into mid-level features (corners, patterns, parts). Late layers combine those into high-level features (objects, scenes, categories). By the final layer, the network has transformed pixels into a representation where the task—classification, detection, segmentation—is easy.

This is representation learning: automatically discovering features that make subsequent prediction simple. It’s why deep learning succeeded where classical machine learning struggled on perceptual tasks. Hand-engineering features for images, speech, or text is extraordinarily hard. Learning them automatically is what neural networks do best.

Distributed Representations: Why Neurons Don’t Map 1-to-1

In classical feature engineering, each feature represents a specific, interpretable property: “contains the word ‘dog’,” “has vertical edges,” “red color histogram peak.” Each feature is independent and interpretable.

Neural networks don’t work this way. Features in deep networks are distributed: each neuron participates in representing many concepts, and each concept is represented by many neurons. A single neuron in a late layer doesn’t encode “dog”—it responds to some combination of shapes, textures, and patterns that happen to correlate with dogs (and other things).

This distributed encoding is more efficient. Suppose you want to represent 1,000 concepts. With one-hot encoding (one neuron per concept), you need 1,000 neurons. With distributed representations, you might only need 100 neurons, where each concept is encoded as a pattern of activations across those neurons. A binary pattern of length 100 can represent $2^{100} \approx 10^{30}$ distinct concepts—exponentially more than one-hot encoding.

This exponential efficiency comes from composition. Features are reusable. A “pointy ear” feature is useful for cats, dogs, foxes, and rabbits. A “vertical line” feature is useful for buildings, trees, and text. By combining reusable features, the network can represent a vast number of concepts without needing a neuron for each one.

The cost is interpretability. You cannot point to a single neuron and say “this detects dogs.” Instead, “dog” is encoded as a distributed pattern across many neurons. This makes neural networks harder to understand but far more powerful.

Emergence: How Concepts Appear

Deep networks don’t start with meaningful representations. Initially, weights are random, and activations are noise. But through training—adjusting weights to minimize loss—the network organizes its internal representations to support the task.

As training progresses, structure emerges:

Early layers develop general low-level features (edges, blobs) useful across many tasks
Middle layers develop task-specific mid-level features (textures, patterns, parts)
Late layers develop task-specific high-level features (objects, categories, concepts)

This emergence is not programmed. The network is only told to minimize classification loss. The intermediate representations—what features to learn at each layer—are discovered automatically. The hierarchy emerges because it’s an efficient way to compress the mapping from inputs to outputs.

Why does hierarchy emerge? Because deep networks can represent hierarchical functions more efficiently than shallow ones. A function that combines low-level patterns into high-level concepts can be represented with $O(n)$ parameters in a deep network but might require $O(2^n)$ parameters in a shallow network. The exponential efficiency of depth encourages hierarchical organization.

This is why deep learning works: not because we told the network to learn hierarchical features, but because the optimization process discovers that hierarchical representations are efficient. The structure of the solution is shaped by the architecture, the data, and the task.

Emergence: How Concepts Appear diagram

The diagram shows how structure emerges during training. Before training, activations are random. After training, the network has organized into meaningful features at each layer, discovered automatically by gradient descent.

Why Emergence Happens: The Implicit Bias of SGD

Emergence isn’t magic. It happens because stochastic gradient descent has an implicit bias toward simple solutions—a form of Occam’s razor built into the optimization algorithm.

When there are multiple functions that fit the training data equally well (and in overparameterized networks, there are many such functions), SGD preferentially finds the “simplest” one. Simple here means something precise: the function with the smallest norm in parameter space, or equivalently, the function that compresses the data most efficiently.

Why does SGD prefer simple solutions? Because gradient descent follows the shortest path in weight space from the initialization to a solution. Starting from small random weights (near zero), gradient descent takes small steps, and the first solution it finds is the one that requires the smallest weight changes. This naturally favors low-complexity solutions—functions that can be expressed with small weights.

This is regularization through optimization. Even without explicit regularization (weight decay, dropout), SGD implicitly regularizes by preferring simple functions. This implicit bias prevents overfitting: among all the functions that perfectly fit training data, the network learns the one that generalizes.

The hierarchical structure that emerges is the simplest way to compress the mapping from inputs to outputs. Instead of memorizing every input-output pair, the network discovers reusable patterns. “Fur” is learned once and reused for cats, dogs, foxes. “Vertical edge” is learned once and reused for buildings, trees, letters. This compositional reuse is the compressed representation.

This connects back to the compression view of learning (Chapter 3). Networks that generalize well are those that compress training data into simple, reusable representations. The features that emerge during training are the compression: they’re the minimal description length of the patterns in the data. SGD’s implicit bias toward simplicity is why neural networks discover hierarchical features rather than memorizing examples.

Mathematical intuition: Gradient noise from mini-batch sampling acts as a regularizer. When gradients are noisy, optimization can only follow the strong, consistent signals—the generalizable patterns—because noise washes out the weak, idiosyncratic patterns. This is why very large batch sizes (low noise) sometimes hurt generalization: the noise has a beneficial effect by preventing memorization.

What Do Layers Learn? Concrete Examples

The hierarchy isn’t abstract—you can visualize it. Examining what neurons respond to at different depths reveals the progression from pixels to meaning.

Vision Networks (ResNet, VGG, AlexNet):

Layer 1: Simple features—edges at various orientations, color blobs, frequency gradients. These are universal: every image has edges, and every vision network learns similar Layer 1 features regardless of the task.
Layers 2-3: Intermediate features—corners, curves, simple patterns, texture repetitions (stripes, grids, dots). These combine edges into slightly more complex structures.
Layers 4-5: Object parts—wheels, eyes, fur, windows, faces, legs. These are recognizable components that appear in multiple object categories.
Final layers: Full objects and scenes—dogs, cars, buildings, outdoor scenes. By the final layer, the network has transformed pixels into semantic categories.

Language Models (BERT, GPT, Transformers):

Early layers: Syntax and grammar—part-of-speech tagging, syntactic dependencies, phrase structure. Early layers parse the structure of language.
Middle layers: Semantics—word meanings, coreference resolution (what “it” refers to), entity relationships. Middle layers understand what the text is about.
Late layers: Task-specific features—sentiment (positive/negative), named entities (person, organization, location), question-answering patterns. Late layers adapt to the specific prediction task.

How We Know This:

Activation maximization: Find inputs that maximize a neuron’s activation. For a Layer 1 neuron, you get simple edges. For a Layer 5 neuron, you get complex object parts.
Saliency maps: Compute gradient of output with respect to input. Shows which pixels most influence the prediction. Early layers have diffuse saliency; late layers focus on semantically meaningful regions.
GradCAM: Weighted combination of activations at a layer, visualized as a heatmap. Shows where the network is “looking” at different depths. Early layers attend everywhere; late layers attend to objects.
Linear probing: Freeze all layers except the final layer, train a linear classifier on the frozen representations. If linear probe accuracy is high, the representation is linearly separable—a sign of good features.

These techniques aren’t just research tools. In production, you can use them to debug why a network makes certain predictions and whether it has learned the right features.

Representation Quality: How to Measure What’s Learned

Not all representations are equally good. How do you measure whether your network has learned useful features?

Linear Probing:

Freeze the network’s weights, remove the final classification layer, and train a simple linear classifier on the frozen representations (activations from the second-to-last layer). If the linear classifier achieves high accuracy, the representation is linearly separable—the network has done the hard work of transforming data into a space where a linear boundary works.

Linear probing is a diagnostic. If probe accuracy is low despite good end-to-end accuracy, the representation is poor and the final layer is doing too much work. If probe accuracy is high, the representation is good, and you can use it for transfer learning.

Representation Similarity:

Compare representations across different models or layers. Centered Kernel Alignment (CKA) measures how similar two sets of representations are, even if they live in different dimensional spaces. High CKA means the networks have learned similar features; low CKA means they’ve learned different features.

This is useful for understanding when two architectures are fundamentally similar (ResNet vs VGG) vs when they’re genuinely different (CNN vs Transformer). It also helps track training dynamics: you can see when representations stabilize during training.

Transfer Learning Performance:

The ultimate test: fine-tune the representation on a downstream task. If the pretrained representation generalizes well to new tasks with minimal fine-tuning, it’s a good representation. If it requires extensive retraining, the representation is overfitted to the original task.

In production, evaluate representations on multiple downstream tasks simultaneously. A representation that works across many tasks (object detection, segmentation, classification) is more valuable than one tuned for a single task.

Practical Tip: When training a new model, log linear probe accuracy on a validation set every few epochs. If probe accuracy plateaus while training accuracy keeps increasing, you’re overfitting. The representation has stopped improving, and the network is just memorizing through the final layer.

Domain-Specific Representations

The structure of learned representations depends on the domain. Different modalities have different inductive biases built into their architectures.

Vision (Images, Video): Spatial hierarchy. Nearby pixels are related, and objects have spatial coherence. Convolutional networks (CNNs) encode this bias: they detect local patterns (edges, textures) and build up to global patterns (objects, scenes). The hierarchy mirrors the spatial structure: low-level features are local, high-level features are global.

Language (Text, Code): Sequential dependencies. Words depend on previous words, and meaning emerges from context. Recurrent networks (RNNs, LSTMs) and Transformers encode this bias: they process sequences left-to-right or attend to relevant context. The hierarchy mirrors linguistic structure: tokens → phrases → sentences → documents.

Audio (Speech, Music): Time-frequency patterns. Audio is both temporal (events over time) and spectral (frequencies present at each moment). Networks for audio (WaveNet, Conformer) combine convolutional layers (for frequency patterns) with recurrent or attention layers (for temporal dependencies). The learned features are spectrograms, phonemes, words, prosody.

Graphs (Social Networks, Molecules): Node and edge relationships. Information flows along edges, and structure determines function. Graph neural networks (GNNs) encode this bias: they aggregate information from neighbors and propagate messages. The learned features are node embeddings that capture both local structure (immediate neighbors) and global topology (community structure, paths).

The key insight: architecture choice encodes assumptions about the data structure. When the architecture’s inductive bias matches the domain structure, learning is efficient and representations generalize. When they mismatch, the network must work harder to learn basic patterns, and generalization suffers.

Why Deep Beats Shallow

A shallow network (one or two hidden layers) can approximate any continuous function—this is the universal approximation theorem. So why use deep networks?

Because deep networks are exponentially more efficient. A shallow network might require exponentially many neurons to approximate a function that a deep network represents with linearly many neurons. Depth enables compositional efficiency.

Consider a function that checks if an image contains specific combinations of patterns: “fur AND pointy ears” or “wheels AND windows.” A shallow network must have separate neurons for every possible combination. If there are $n$ binary patterns, this requires $2^n$ neurons.

A deep network can compute the same function hierarchically:

Layer 1: Detect $n$ individual patterns (edges, textures)
Layer 2: Detect $n$ combinations of layer 1 patterns (fur, ears, wheels, windows)
Layer 3: Detect combinations of layer 2 patterns (cat, car)

This requires $O(n)$ neurons per layer and a few layers—linear in $n$ , not exponential. The depth allows the network to reuse computations: “fur” is computed once and used in multiple higher-level concepts.

This exponential advantage is why deep learning succeeded. Shallow networks need prohibitively many neurons for complex tasks. Deep networks achieve the same expressiveness with far fewer parameters, enabling better generalization and faster training.

Engineering Takeaway

Representation learning is the core insight of deep learning. Networks automatically discover hierarchical features that make prediction easy. Understanding what representations are learned, how to measure their quality, and how to leverage pretrained representations is essential for building effective systems.

Representations are the key to generalization. If the learned features are good, the final task is easy—even a linear classifier achieves high accuracy. If the features are poor, no amount of tuning the final layer helps. When performance plateaus, improving representations (more data, better architecture, better pretraining) often matters more than hyperparameter tuning. Use linear probing (freeze the network, train a linear classifier on frozen features) to diagnose whether the representation is the bottleneck. If linear probe accuracy is low, improve the representation; if it’s high, tune the final layer.

Transfer learning is standard practice, not a research trick. A network pretrained on ImageNet learns general visual features: edges, textures, object parts. You can reuse these features for different tasks (medical imaging, satellite analysis, industrial inspection) by fine-tuning the final layers or using the pretrained network as a fixed feature extractor. This works because early layers learn general features and late layers learn task-specific features. Transfer learning reduces data requirements by 10-100× and training time by similar factors. In production, almost no one trains vision or language models from scratch—they start with pretrained models (ResNet, BERT, GPT, CLIP) and adapt them.

Embeddings are the interface between models and systems. The intermediate representations—especially from the second-to-last layer—are useful features for downstream tasks. Word embeddings (Word2Vec, GloVe, BERT) capture semantic meaning and power search, recommendation, clustering. Image embeddings (ResNet, CLIP) capture visual concepts and enable similarity search, zero-shot classification, multimodal retrieval. Embeddings are how you integrate neural networks into larger systems: extract embeddings, store them in a vector database, use them for retrieval or ranking. This is the foundation of modern search and recommendation.

Visualization reveals what the network learned. Use activation maximization (find inputs that maximize neuron activations) to see what patterns neurons detect. Use t-SNE or UMAP on final-layer embeddings to see if the network has separated classes. Use GradCAM to see where the network is “looking” when making predictions. If visualizations look random or nonsensical, the network hasn’t learned meaningful features—you have a training problem (bad data, bad architecture, bad optimization). Visualization is essential for debugging and building trust in production systems.

Depth has diminishing returns beyond a point. Deeper networks are more expressive but harder to train, slower to run, and eventually plateau in performance. ResNet-50 is usually better than ResNet-18 but only marginally better than ResNet-34. Very deep networks (ResNet-152, ResNet-1000) show small gains and require careful engineering (skip connections, normalization, initialization) to train at all. The practical sweet spot for most vision tasks: 18-50 layers. For language: 6-24 transformer layers. Beyond that, gains are small unless you’re training on massive datasets (billions of examples).

Architecture choice encodes inductive biases. CNNs assume spatial locality (nearby pixels are related). Transformers assume flexible attention (any token can attend to any other). RNNs assume sequential dependencies (current state depends on previous state). Graph neural networks assume relational structure (nodes connected by edges). When the architecture’s bias matches the data structure, learning is efficient and generalizes. When they mismatch, the network must work harder and generalization suffers. Choose architectures that match your domain: CNNs for images, Transformers for language, GNNs for graphs.

Monitor representation quality during training. Log linear probe accuracy on a validation set every few epochs. If probe accuracy plateaus while training accuracy increases, you’re overfitting—the representation has stopped improving, and the network is memorizing through the final layer. If probe accuracy and training accuracy both increase, the representation is still learning. Use representation similarity metrics (CKA) to track when different layers stabilize. In production, continuously evaluate whether representations remain useful on downstream tasks—distribution shift can degrade representation quality even if training metrics look fine.

The lesson: Deep learning works because networks learn hierarchical representations automatically. Early layers learn general features, late layers learn task-specific features. This eliminates the bottleneck of feature engineering and enables end-to-end learning from raw data. Understanding representation learning—what features emerge at each layer, how to measure representation quality, how to leverage pretrained representations—is the key to using deep learning effectively in production.

References and Further Reading

Representation Learning: A Review and New Perspectives – Yoshua Bengio, Aaron Courville, Pascal Vincent (2013) https://arxiv.org/abs/1206.5538

This paper surveys representation learning and explains why it’s the key to deep learning’s success. Bengio explains distributed representations, hierarchical features, and why deep networks learn better representations than shallow ones. Reading this gives you the theoretical foundation for understanding what makes deep learning powerful.

Visualizing and Understanding Convolutional Networks – Matthew Zeiler and Rob Fergus (2013) https://arxiv.org/abs/1311.2901

Zeiler and Fergus visualize what CNNs learn at each layer, showing that early layers detect edges, middle layers detect textures and patterns, and late layers detect object parts. The visualizations make abstract concepts concrete: you can see the hierarchical features emerge. Reading this (and examining the figures) will give you intuition for what “representation learning” actually looks like.

How Transferable Are Features in Deep Neural Networks? – Jason Yosinski et al. (2014) https://arxiv.org/abs/1411.1792

This paper shows that features learned on one task transfer to others. Yosinski demonstrates that early layers learn general features and late layers learn task-specific features, and quantifies how much performance degrades when transferring between tasks. Reading this explains why transfer learning works and when it’s most effective.