Chapter 6: Linear Models

The Simplest Thinking Machine

A linear model is the most fundamental prediction machine in machine learning. It takes input data, multiplies each input by a learned weight, and adds them together to produce a prediction. Despite this simplicity, linear models power some of the most important systems in production: ad ranking, credit scoring, fraud detection, and pricing engines.

Understanding linear models is not just about learning a specific algorithm. It’s about understanding how machines convert data into decisions through weighted combinations—a pattern that appears everywhere in machine learning, including deep neural networks.

The Geometry of Linear Models

At its core, a linear model computes a weighted sum of input features. If you’re predicting house prices, you might have features like square footage, number of bedrooms, and age. The model learns a weight for each feature that captures how much that feature contributes to the final price.

Mathematically, a linear model computes:

\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

Where:

$\hat{y}$ is the predicted value
$x_1, x_2, \ldots, x_n$ are the input features
$w_1, w_2, \ldots, w_n$ are the learned weights
$b$ is the bias term (the prediction when all features are zero)

This is equivalent to computing a dot product between the weight vector $\mathbf{w}$ and the feature vector $\mathbf{x}$ , then adding the bias: $\hat{y} = \mathbf{w} \cdot \mathbf{x} + b$ .

Consider a concrete example. Suppose we’re predicting apartment rent based on:

Square footage: 850 sq ft
Distance from downtown: 3 miles
Number of bedrooms: 2

A trained linear model might have learned these weights:

$w_1 = 2.5$ (dollars per square foot)
$w_2 = -150$ (penalty per mile from downtown)
$w_3 = 300$ (value per bedroom)
$b = 500$ (base rent)

The prediction becomes:

\hat{y} = 2.5(850) + (-150)(3) + 300(2) + 500 = 2125 + (-450) + 600 + 500 = 2775

The model predicts $2,775 per month. Each weight encodes how much the model has learned that feature matters. The positive weight on square footage means bigger apartments cost more. The negative weight on distance means being farther from downtown reduces rent.

Why hyperplane separation works: Geometrically, a linear model defines a hyperplane in feature space. For classification, this hyperplane is the decision boundary. Points on one side are predicted as one class, points on the other side as another. In two dimensions, this is a line; in three dimensions, a plane; in higher dimensions, a hyperplane. The weights $\mathbf{w}$ define the orientation of this hyperplane, and the bias $b$ shifts it away from the origin.

This geometric view reveals why linear models are so fast: prediction is just checking which side of a hyperplane a point falls on. The computation is a single dot product— $\mathbf{w} \cdot \mathbf{x}$ —which modern CPUs execute in nanoseconds. This makes linear models suitable for real-time systems where millions of predictions per second are required.

Connection to neural networks: A linear model is equivalent to a single-layer neural network with no activation function—a perceptron. Deep learning adds layers and nonlinearity, but each layer still computes weighted sums. Understanding linear models is understanding the building block of neural networks.

Feature space transformation: The power of linear models increases dramatically when you transform the input space. By adding polynomial features ( $x^2$ , $x_1 x_2$ ), trigonometric features ( $\sin(x)$ , $\cos(x)$ ), or domain-specific transformations, you can make nonlinear problems linearly separable. The kernel trick (used in SVMs) implicitly maps data to very high-dimensional spaces where linear separation becomes possible.

Regularization: Controlling Complexity

While linear models are simple, they can still overfit—especially in high dimensions where the number of features approaches or exceeds the number of training examples. Regularization prevents overfitting by penalizing model complexity.

L2 Regularization (Ridge Regression) adds a penalty term to the loss function that penalizes large weights:

L_{\text{total}} = L_{\text{data}} + \lambda \sum_{i} w_i^2

Where $L_{\text{data}}$ is the error on training data (e.g., mean squared error) and $\lambda$ controls the strength of the penalty. This forces the model to keep weights small unless they’re truly necessary to fit the data.

Geometrically, L2 regularization shrinks all weights toward zero proportionally. It prefers solutions where the predictive power is distributed across many features rather than concentrated in a few. This improves generalization because it prevents the model from relying too heavily on any single feature, which might be noisy or spurious.

L1 Regularization (Lasso) uses absolute values instead:

L_{\text{total}} = L_{\text{data}} + \lambda \sum_{i} |w_i|

L1 has a special property: it drives many weights to exactly zero, producing sparse models. Sparse models are faster (ignore zero-weight features) and more interpretable (focus on the few features that matter). L1 performs automatic feature selection: it keeps important features and discards irrelevant ones.

Why does L1 produce sparsity? Consider the geometry. L2 regularization constrains weights to lie within a circle (in 2D) or sphere (in higher dimensions). L1 constrains weights to lie within a diamond (in 2D) or hypercube (in higher dimensions). The corners of a diamond/hypercube touch the axes, which means many weights land exactly at zero during optimization.

Elastic Net combines both:

L_{\text{total}} = L_{\text{data}} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2

This gives you the sparsity of L1 and the stability of L2. Elastic net is useful when you have correlated features: L1 alone might arbitrarily select one feature and discard the other, while elastic net keeps both with small weights.

Tuning lambda with cross-validation: The regularization strength $\lambda$ controls the bias-variance tradeoff. Small $\lambda$ allows the model to fit training data closely (low bias, high variance). Large $\lambda$ forces the model to be simple (high bias, low variance). The optimal $\lambda$ is found using cross-validation: split data into folds, train with different $\lambda$ values, and select the $\lambda$ that minimizes validation error.

The regularization path shows how weights change as $\lambda$ varies. Start with large $\lambda$ (all weights near zero), gradually decrease $\lambda$ , and watch which features gain weight first. Features that gain weight at small $\lambda$ are the most important.

Why regularization matters in high dimensions: When the number of features $d$ approaches or exceeds the number of training examples $n$ , unregularized linear models can perfectly fit the training data (interpolation). But this fit is meaningless—it’s overfitting to noise. Regularization adds inductive bias that favors simpler explanations, enabling generalization even when $d > n$ .

Why Linear Models Generalize

Linear models generalize well because they make a strong simplifying assumption: the relationship between inputs and outputs is linear. This assumption is wrong for most real-world problems, but it’s productively wrong. By forcing the model to find the best linear approximation, we prevent it from memorizing noise in the training data.

This is a direct manifestation of Occam’s Razor from Part I. A linear model has only as many parameters as there are features, plus a bias term. It cannot encode complex interactions or memorize individual training examples without those patterns being consistent across the data.

Consider the bias-variance tradeoff. Linear models have high bias—they assume linearity even when the true relationship is nonlinear. But they have low variance—small changes in training data don’t drastically change the learned weights. For many problems, especially those with limited training data or noisy measurements, this tradeoff favors linear models over more complex alternatives.

Why Linear Models Generalize diagram

The diagram shows a linear decision boundary separating two classes. The boundary is a straight line, which means the model can only capture linear patterns. This constraint is a feature, not a bug—it prevents overfitting.

When Linear Models Work

Linear models excel in several scenarios:

High-dimensional sparse data: When the number of features is large relative to the number of samples ( $d \geq n$ ), linear models with regularization often outperform complex models. This regime is common in text classification (thousands of word features), genomics (thousands of gene expression levels), and web-scale systems (millions of user/item features). The curse of dimensionality hurts complex models more than simple ones—in high dimensions, data is sparse, and complex decision boundaries overfit. Linear models with L1 regularization automatically select relevant features and ignore the rest.

Feature engineering makes problems linear: Many nonlinear problems become linear after appropriate feature engineering. Ad click prediction models use millions of features encoding user history, ad characteristics, and context. With these rich features, a linear model captures complex behavior. Similarly, polynomial features ( $x^2$ , $x_1 x_2$ ) make quadratic problems linear. Fourier features make periodic problems linear. The investment in feature engineering pays off because linear models train and serve quickly.

Real-time systems with latency constraints: Ad auctions, fraud detection, and recommendation systems require predictions in milliseconds. Linear models can make millions of predictions per second on a single CPU core because prediction is a dot product. Complex models (deep networks, ensembles) are too slow for real-time serving without specialized hardware. Hybrid architectures use neural networks offline to learn features, then train a linear model on those features for fast online serving.

Explainability is required: In finance, healthcare, and legal applications, model decisions must be explainable. Linear models provide direct interpretability: the weight on each feature shows its contribution to the prediction. You can explain to a loan applicant why they were denied: “Your debt-to-income ratio (weight +0.7) was too high, which outweighed your good credit score (weight -0.3).” This level of transparency is impossible with black-box models.

When Linear Models Fail

Linear models fail when the relationship between inputs and outputs is fundamentally nonlinear in ways that feature engineering cannot fix.

Non-linearly separable data: The classic example is the XOR problem. Two classes arranged in a checkerboard pattern cannot be separated by any line, no matter how you orient it. A linear model cannot solve XOR without manually adding interaction features ( $x_1 \cdot x_2$ ). This generalizes: whenever the decision boundary is circular, curved, or otherwise non-linear, a linear model will underfit unless you transform the feature space.

Perceptual data without feature engineering: Raw pixels, audio waveforms, and video frames are not linearly related to object identities, speech content, or actions. A linear model cannot learn that specific configurations of pixels represent a cat without hand-designed features (edges, textures, shapes). Deep learning succeeded on perceptual tasks precisely because it learns hierarchical features that make the problem more linearly separable in the final layer.

Complex feature interactions: Many problems involve interactions between features. The effect of “time of day” on ad clicks depends on “device type”—mobile users in the evening behave differently than desktop users in the morning. A linear model treats these independently unless you manually create interaction features. With many features, the number of possible interactions grows quadratically ( $O(d^2)$ ), making exhaustive feature engineering impractical. Decision trees and neural networks discover interactions automatically.

Extrapolation beyond training distribution: Linear models extrapolate their learned line infinitely. If all training data shows house prices between $200K and$ 1M, the model will still confidently predict prices for a 10,000 sq ft mansion, even though it has never seen anything remotely similar. The prediction might be wildly wrong, but the model has no mechanism to express uncertainty outside the training distribution. Tree-based models plateau at the maximum training value, which is often more sensible behavior for extrapolation.

Regression vs Classification

Linear models can be used for both regression (predicting continuous values) and classification (predicting discrete categories), but the interpretation differs.

Linear Regression predicts a continuous output. The model directly outputs $\hat{y}$ as a real number. This is used for pricing, forecasting, and any scenario where the target is a quantity. The model minimizes the squared error between predictions and actual values during training.

Linear Classification predicts a category. The weighted sum $\mathbf{w} \cdot \mathbf{x} + b$ produces a score, and the sign of that score determines the predicted class. If the score is positive, predict class 1; if negative, predict class 0. The magnitude of the score indicates confidence—a score of +5 is more confident than +0.1.

For classification, the decision boundary is the hyperplane where $\mathbf{w} \cdot \mathbf{x} + b = 0$ . Points on one side are classified as one class, points on the other side as the other class. In two dimensions, this is a line. In three dimensions, it’s a plane. In higher dimensions, it’s a hyperplane—but the concept is the same.

The weights $\mathbf{w}$ define the orientation of this boundary, and the bias $b$ shifts it. Training a linear classifier means finding the $\mathbf{w}$ and $b$ that best separate the training data.

Engineering Takeaway

Linear models remain dominant in production systems despite their simplicity—or rather, because of it.

Speed and scalability dominate at web scale. Computing a dot product takes nanoseconds. Linear models can make millions of predictions per second on a single CPU core, with latency measured in microseconds. This matters for real-time systems like ad auctions, where you have milliseconds to rank thousands of ads, or fraud detection, where you must approve or block a transaction instantly. Scalability extends to training: stochastic gradient descent (SGD) enables online learning on billions of examples, updating weights incrementally as new data arrives.

Interpretability is non-negotiable in regulated domains. The weights are the explanation. If $w_{\text{credit score}} = -0.5$ and $w_{\text{debt ratio}} = 0.8$ , you know exactly how each feature affects the prediction. This transparency is critical in finance (loan approvals, credit scoring), healthcare (diagnostic models), and legal contexts (risk assessments). Regulators require explainable models, and linear models provide this by design. You can audit decisions, debug biases, and explain outcomes to stakeholders.

Feature engineering is the highest-leverage activity. With linear models, the quality of features determines success. Investing in feature engineering—polynomial features, interaction terms, domain-specific transformations—pays enormous dividends. Modern systems often use neural networks to learn features offline, then train a linear model on those learned features for fast online serving. This hybrid approach combines deep learning’s representational power with linear models’ speed and interpretability.

Regularization is essential in high dimensions. When features outnumber examples ( $d > n$ ), unregularized linear models overfit catastrophically. L2 regularization shrinks weights and improves stability. L1 regularization performs automatic feature selection, keeping only relevant features and driving others to zero. In production, almost all linear models use regularization—it’s not optional. Tune $\lambda$ with cross-validation to balance underfitting (too much regularization) and overfitting (too little).

Linear models are the ultimate baseline. Before trying complex models, fit a linear model. If you can’t beat it with good features, you probably have a data problem, not an algorithm problem. Linear models expose issues: if performance is poor, either you need better features, more data, or the problem is fundamentally nonlinear. Conversely, if a linear model works well, deploy it—simplicity is valuable. The best production system is the simplest one that solves the problem.

High-dimensional regime: linear models shine when $d \geq n$ . Counterintuitively, linear models excel when features outnumber samples. This is common in text (thousands of words), genomics (thousands of genes), and web-scale systems (millions of user-item interactions). Complex models overfit in high dimensions due to the curse of dimensionality. Linear models with L1 regularization navigate this regime by selecting sparse subsets of features, effectively reducing dimensionality.

Foundation for neural networks: understand this, understand deep learning. Every layer in a neural network computes $\mathbf{W} \mathbf{x} + \mathbf{b}$ —a linear transformation. Nonlinearity (ReLU, sigmoid) is applied after. Deep learning is stacked linear models with nonlinearity between layers. The final layer is often a linear model (logistic regression for classification, linear regression for regression). Understanding linear models is understanding the building block of deep learning. When you debug a neural network, you’re debugging chains of weighted sums.

References and Further Reading

Elements of Statistical Learning, Chapter 3 – Trevor Hastie, Robert Tibshirani, and Jerome Friedman https://hastie.su.domains/ElemStatLearn/

This is the canonical reference for linear models in machine learning. Chapter 3 covers linear regression in depth, including the statistical foundations, geometric interpretations, and connections to other methods. Chapter 4 covers classification, and Chapter 6 covers regularization (ridge, lasso, elastic net). It’s mathematical but readable for engineers willing to engage with the equations. Reading this will give you a complete understanding of why linear models work and where they fit in the broader landscape of statistical learning.

An Introduction to Statistical Learning – Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani https://www.statlearning.com/

This is a more accessible version of ESL, aimed at practitioners rather than statisticians. Chapter 3 (Linear Regression) and Chapter 6 (Linear Model Selection and Regularization) are excellent introductions to the core concepts without heavy mathematical prerequisites. The book includes R code and exercises, making it practical for hands-on learning. If ESL feels too dense, start here.

Large Scale Online Learning – Léon Bottou and Olivier Bousquet (2008) https://leon.bottou.org/publications/pdf/nips-2007.pdf

This paper explains why linear models scale to billions of examples better than any other approach. It introduces stochastic gradient descent in the context of large-scale learning and shows that for many web-scale problems, the bottleneck is data processing, not model complexity. Linear models trained with SGD can process data as fast as it arrives, enabling online learning on infinite data streams. This is why Google, Facebook, and other tech companies use linear models at the core of their ranking and recommendation systems.