Chapter 3: Models Are Compression Machines

What a Trained Model Really Is

A trained machine learning model is a compressed representation of the training data. It distills millions of training examples into a set of parameters—weights, biases, decision thresholds—that capture the essential patterns while discarding the noise and details. This compression is not a side effect of learning; it is learning.

Consider a linear regression model trained on 100,000 housing price examples. The dataset might be hundreds of megabytes: addresses, sale dates, square footages, prices. The trained model is just a handful of numbers—maybe 10 weights and a bias. These 11 numbers encode what the model learned from 100,000 examples.

This is radical compression. The model has gone from 100,000 specific facts (this house sold for this price) to 11 general rules (square footage contributes this much to price, location contributes that much). The model cannot reproduce the training data exactly—it has thrown away the details. But it can approximate the patterns well enough to make useful predictions on new examples.

The size of a model relative to its training data is a measure of compression ratio. A neural network with 1 million parameters trained on 1 billion examples has compressed the data by a factor of 1,000x. A decision tree with 50 leaf nodes trained on 10,000 examples has compressed by 200x. The compression ratio reflects how much generalization is happening: higher compression means more patterns are being abstracted away from specific examples.

Modern language models demonstrate extreme compression. GPT-3 has 175 billion parameters (each a 32-bit float, totaling ~700GB uncompressed) and was trained on roughly 300 billion tokens (about 2TB of text). The compression ratio is roughly 3x—the model has distilled 2TB of text into 700GB of weights. But this understates the compression: GPT-3 can generate coherent text on topics not in its training data, meaning it has learned compressible patterns (grammar, facts, reasoning styles) rather than memorizing strings.

Different architectures impose different compression constraints. Convolutional neural networks (CNNs) for vision have built-in assumptions about spatial locality—nearby pixels are correlated. This architectural prior reduces the parameter count needed to model images compared to fully connected networks. Recurrent networks for sequences assume temporal dependencies. These architectural choices are forms of compression: they restrict the hypothesis space to functions that align with domain structure, enabling better generalization from less data.

This perspective—learning as compression—is not just a metaphor. It’s a formal framework from information theory. The Minimum Description Length (MDL) principle formalizes this: the best model is the one that minimizes the total description length of the model plus the data given the model. A simple model with poor fit requires many bits to encode the residual errors. A complex model that fits perfectly requires many bits to encode its parameters. The optimal model balances these: enough complexity to capture patterns, enough simplicity to avoid encoding noise.

Pattern Discovery as Compression

Compression works by finding regularities. A file compressor looks for repeated sequences—if “the” appears 1,000 times in a document, the compressor encodes it once and references it repeatedly. A machine learning model does something similar: it finds patterns that recur across examples and encodes them as parameters.

Consider learning to predict the next element in a sequence: 2, 4, 6, 8, 10, …

A naive encoding stores each number explicitly: 5 numbers, each requiring a few bytes. But if you recognize the pattern—“start at 2, add 2 each time”—you can encode the entire sequence with just two numbers: the starting value and the increment. This is compression through pattern discovery.

Now consider a less obvious sequence: 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, …

This is the digits of π. If you know the rule—compute π and extract digits—you can compress the sequence into an algorithm. But if you don’t recognize this pattern, you’re stuck storing each digit individually. The sequence looks random even though it’s not.

This notion is formalized by Kolmogorov complexity: the complexity of a string is the length of the shortest program that generates it. The sequence 2, 4, 6, 8, 10 has low Kolmogorov complexity (short program: for i in range(5): print(2*(i+1))). The digits of π also have low complexity (there’s a formula for computing π). A truly random sequence has high complexity—it cannot be compressed; the shortest program is “print these specific digits.”

Machine learning models do this automatically. They search for compressible patterns in data. When they find them, they encode them as parameters. A language model trained on English text learns that “the” is common, that “New York” often appears together, that verbs follow subjects. These patterns allow the model to compress text—to represent it with fewer parameters than the raw character sequence would require.

Language models perform next-word prediction, which is equivalent to compression. Given text “The cat sat on the ___”, a model assigns high probability to “mat” and low probability to “theorem.” This probability distribution compresses the data: common continuations are assigned short codes (high probability), rare continuations get long codes (low probability). The better the model predicts, the more it compresses. Lossless compression algorithms like gzip use this principle explicitly—they build a probability model and encode symbols with lengths inverse to their predicted probability.

Machine learning is lossy compression. Unlike gzip, which can perfectly reconstruct the original data, a trained model discards details. It cannot reproduce each training example exactly—it has forgotten the specifics and retained only the patterns. This is not a bug; it’s essential. Lossless compression memorizes. Lossy compression generalizes. By discarding instance-specific noise, the model retains only the transferable signal.

The quality of compression reflects the quality of learning. If the model compresses well—captures the patterns with few parameters—it has learned something general. If it compresses poorly—requires many parameters to fit the data—it’s memorizing rather than learning.

Occam’s Razor

Occam’s Razor is the principle that simpler explanations are more likely to be true. In machine learning, this translates to: simpler models generalize better. A model with fewer parameters is less likely to overfit because it cannot encode complex, dataset-specific idiosyncrasies. It’s forced to find broad patterns that transfer to new data.

This is why regularization works. Regularization adds a penalty to the loss function that punishes model complexity. For a linear model, L2 regularization (ridge regression) penalizes large weights:

L_{\text{total}} = L_{\text{data}} + \lambda \sum_{i} w_i^2

Where $L_{\text{data}}$ is the error on training data and $\lambda$ controls the strength of the penalty. This forces the model to keep weights small unless they’re truly necessary to fit the data. The result is a simpler model that generalizes better.

L1 regularization (lasso) uses absolute values instead: $\lambda \sum_i |w_i|$ . This encourages sparsity—many weights are driven to exactly zero, effectively removing features from the model. Sparse models are interpretable and fast: you can ignore zero-weight features entirely. L1 is useful when you have many features and suspect most are irrelevant.

Dropout, used in neural networks, randomly disables neurons during training. This prevents the network from relying on any single neuron—it must learn redundant representations. Dropout acts as regularization by reducing the effective capacity of the network. At test time, dropout is off, but the redundancy learned during training makes predictions robust.

Early stopping is another form of regularization. Even without explicit penalties, training a model for too many iterations causes overfitting—it starts memorizing training examples rather than learning patterns. Early stopping monitors validation error and stops training when it stops improving, even if training error is still decreasing. This prevents the model from using its full capacity to overfit.

Cross-validation provides a systematic way to tune regularization strength. You split data into folds, train on some folds, validate on others, and repeat. For each setting of $\lambda$ , you measure average validation error across folds. The $\lambda$ that minimizes validation error balances underfitting (too much regularization) and overfitting (too little regularization). Cross-validation finds the right compression level: enough complexity to capture signal, enough simplicity to ignore noise.

Why does simplicity improve generalization? Because complex models can fit spurious patterns—patterns that happened to occur in the training data by chance but don’t reflect the underlying process. A complex model with many parameters can bend and twist to fit every quirk of the training set, including the noise. A simple model cannot. It’s constrained to find only the strongest, most consistent patterns.

Consider fitting a polynomial to data points. A degree-1 polynomial (a line) is simple but might underfit. A degree-10 polynomial is complex and can fit every training point exactly—but it will wildly oscillate between points, fitting noise rather than signal. A degree-3 polynomial might strike the right balance: flexible enough to capture the true curve, simple enough to ignore noise.

Occam's Razor diagram

The diagram shows three models: a simple linear model that underfits, a complex polynomial that overfits by passing through every point, and a moderate model that captures the true pattern without fitting noise. The best model achieves good compression—it captures the signal with reasonable complexity.

Occam’s Razor is not just a philosophical preference—it’s a statistical necessity. Given limited data, you must choose the simplest model consistent with the observations because simpler models make fewer assumptions and are thus more likely to transfer to unseen data.

Overfitting as Memorization

Overfitting occurs when a model memorizes the training data instead of learning general patterns. The model achieves perfect training accuracy but poor test accuracy because it has encoded dataset-specific details that don’t transfer.

Think of a student memorizing answers to practice problems without understanding the underlying concepts. They’ll ace the practice test but fail on new questions that require applying the concepts in unfamiliar ways. The student has compressed nothing—they’ve stored each example verbatim.

This is what happens when models are too complex relative to the amount of training data. A neural network with 1 million parameters trained on 100 examples will overfit catastrophically. It has enough capacity to memorize all 100 examples exactly, including every bit of noise, without learning anything generalizable.

Consider a decision tree trained without depth limits on a small dataset. The tree will grow until each leaf contains a single training example—perfect training accuracy, zero compression. Each leaf encodes a rule like “if feature1=0.5 and feature2=3.2 and feature3=1.1, then class=A.” These rules are utterly specific to the training data and won’t generalize. A shallower tree forced to group similar examples learns broader rules that transfer better.

Memorization happens when:

Model capacity exceeds data size: Too many parameters, too few examples.
Training runs too long: Even a well-sized model can overfit if trained until it perfectly fits every training example.
Noise is present: If the data contains randomness or mislabeled examples, the model can memorize these errors.

The signal that overfitting is occurring is divergence between training and validation performance. Training error keeps decreasing (the model is fitting the training data better), but validation error stops decreasing or starts increasing (the model is not generalizing). This divergence indicates memorization: the model is learning patterns specific to the training set.

Interestingly, very large modern neural networks sometimes escape this pattern. The double descent phenomenon (Nakkiran et al., 2019) shows that test error can decrease again after the overfitting regime if you make the model large enough. The classic U-shaped bias-variance curve (small models underfit, large models overfit) becomes double-descent: small models underfit, medium models overfit, but very large models can generalize well again. This happens because overparameterized models have many solutions that fit the training data, and optimization implicitly finds solutions that generalize—a form of implicit regularization.

This doesn’t invalidate Occam’s Razor—it reveals that model “complexity” isn’t just parameter count. Very large models trained with SGD, dropout, and batch normalization have implicit constraints that enforce simplicity despite their size. The effective capacity (how complex functions the training procedure actually learns) is smaller than the nominal capacity (how complex functions the architecture could represent).

Preventing overfitting requires limiting model complexity relative to data:

Regularization: Penalize complexity in the loss function (L1, L2 penalties, weight decay).
Early stopping: Stop training when validation error stops improving, even if training error is still decreasing.
Data augmentation: Create more training examples by applying transformations (rotation, cropping for images; paraphrasing for text).
Dropout and noise injection: Force the model to learn robust representations that don’t rely on specific neurons or features.
Architecture choices: Use simpler models when data is limited.

The fundamental tradeoff is between fitting the training data and compressing it. Perfect fit means memorization. Imperfect but parsimonious fit means compression—and compression generalizes.

Engineering Takeaway

Understanding learning as compression changes how you approach model design and debugging.

Match model capacity to data size. Don’t use a 100-million parameter neural network on 10,000 training examples. The model will overfit. Use simpler models (linear, shallow trees) when data is limited. Scale model capacity as data scales. The rule of thumb: you need roughly 10x as many training examples as parameters to avoid overfitting without strong regularization.

Apply regularization systematically. Almost all production models use regularization—L2 penalties, dropout, weight decay, early stopping. These techniques prevent memorization by penalizing complexity. Tune regularization strength ( $\lambda$ ) on a validation set: stronger regularization means simpler models that might underfit; weaker regularization means complex models that might overfit. Use cross-validation to find the right balance.

Monitor training vs validation loss continuously. If training loss is much lower than validation loss, you’re overfitting. The model is memorizing rather than compressing. Increase regularization, reduce model capacity, or collect more data. If both losses are high, you’re underfitting—the model is too simple. Increase capacity or use better features. The gap between training and validation loss is your diagnostic for compression quality.

Leverage compression for transfer learning. Models that compress well—that learn general patterns rather than memorizing specifics—transfer better to new domains. This is why pretraining works: a language model trained on billions of words compresses the patterns of language, and these patterns transfer to specific tasks like sentiment analysis or translation. Good compression is good representation. Pretrained models are compressed knowledge that you can fine-tune with little task-specific data.

Compress models for deployment. After training, many parameters contribute little to predictions. Model compression techniques reduce size and speed up inference without hurting accuracy:

Pruning: Remove weights with small magnitude (often 50-90% of weights without accuracy loss).
Quantization: Reduce precision from 32-bit floats to 8-bit integers (4x smaller, faster).
Knowledge distillation: Train a small model to mimic a large model’s outputs (compress the compressed representation). These techniques make models practical for mobile devices and real-time systems.

Use compression as a debugging tool. If your model has 1 million parameters but achieves the same performance as a 10,000 parameter model, it’s not compressing well—it’s learning redundant or irrelevant patterns. Simplify the architecture. If your model compresses well (good validation performance with few parameters), it has discovered meaningful structure. Inspect what it learned—visualize weights, analyze feature importance—to understand the patterns.

Expect implicit regularization in modern deep learning. Large neural networks trained with SGD, batch normalization, and dropout often generalize better than classical theory predicts. They have implicit biases toward simple solutions despite their nominal complexity. This is still an active research area, but practically: don’t be afraid to use large models if you have compute and data—implicit regularization provides compression.

The lesson: Learning is compression. Models that compress data well—capturing patterns with few parameters—generalize well. Models that memorize data—requiring many parameters to fit specific examples—do not. Design systems that favor compression over memorization, and you’ll build models that generalize.

References and Further Reading

Kolmogorov Complexity and Algorithmic Information Theory – Ming Li and Paul Vitányi https://homepages.cwi.nl/~paulv/papers/info.pdf

Kolmogorov complexity formalizes the idea that learning is compression. It defines the complexity of a dataset as the length of the shortest program that generates it. Learning means finding that program. This paper connects information theory, compression, and machine learning in a rigorous framework. Reading this will give you a theoretical foundation for why simpler models generalize better.

Occam’s Razor – Kevin Murphy, Section 1.2 in Machine Learning: A Probabilistic Perspective https://probml.github.io/pml-book/

Murphy’s textbook provides an accessible introduction to Occam’s Razor in the context of Bayesian machine learning. He explains how the principle of parsimony emerges naturally from probability theory: simpler models are preferred unless the data strongly justifies complexity. This connects philosophical intuition (simplicity) to mathematical formalism (Bayesian model selection).

Deep Double Descent: Where Bigger Models and More Data Hurt – Preetum Nakkiran et al. (2019) https://arxiv.org/abs/1912.02292

This paper documents the double descent phenomenon—test error decreases, then increases (classic overfitting), then decreases again as model size grows. This challenges conventional wisdom about the bias-variance tradeoff and reveals that very large models can generalize well despite having capacity to memorize. Understanding this phenomenon is essential for modern deep learning, where overparameterized models are standard. It shows that the relationship between model complexity and generalization is more nuanced than classical theory suggests.