Chapter 7: Logistic Regression

Turning Scores into Decisions

A linear model produces a score—a single number that represents how strongly the model believes something is true. For house prices, that score is the prediction. But for classification, the score needs to be converted into a probability. This is what logistic regression does.

Logistic regression is not a different model architecture from linear models—it’s the same weighted sum. The difference is in what happens after computing the score. Instead of using the raw score as the prediction, logistic regression passes it through the sigmoid function to produce a probability between 0 and 1. This probability can then be thresholded to make a binary decision.

This transformation from score to probability is fundamental to how modern systems make decisions under uncertainty. Every time you see “90% likely to click” or “high confidence prediction,” there’s a calibrated probability model behind it.

Why the Sigmoid?

The sigmoid function, also called the logistic function, squashes any real-valued score into the range (0, 1):

\sigma(z) = \frac{1}{1 + e^{-z}}

Where $z = \mathbf{w} \cdot \mathbf{x} + b$ is the linear model’s output. The function has several useful properties:

When $z = 0$ , $\sigma(z) = 0.5$ (maximum uncertainty)
When $z$ is large and positive, $\sigma(z) \to 1$ (confident prediction of class 1)
When $z$ is large and negative, $\sigma(z) \to 0$ (confident prediction of class 0)
The function is smooth and differentiable, which makes training with gradient-based optimization straightforward

Why this specific form? The sigmoid is not arbitrary. It arises naturally from modeling the log-odds (logit) of the probability. If $p = P(y=1|x)$ is the probability of class 1, the odds are $\frac{p}{1-p}$ , and the log-odds (logit) are:

\text{logit}(p) = \log \frac{p}{1-p} = z = \mathbf{w} \cdot \mathbf{x} + b

The logistic regression model assumes that the log-odds are a linear function of the input features. This is equivalent to saying that the probability itself is:

p = \sigma(z) = \frac{1}{1 + e^{-z}}

Why is log-odds useful? Because it maps probabilities from $[0,1]$ to $(-\infty, \infty)$ , allowing us to use a linear model. The sigmoid then inverts this transformation to recover a probability. This connection to odds is why logistic regression coefficients can be interpreted in terms of multiplicative effects on odds—a weight of $w_i = 1.2$ means increasing $x_i$ by 1 multiplies the odds by $e^{1.2} \approx 3.3$ .

The sigmoid also connects logistic regression to the exponential family of distributions. Assuming a Bernoulli distribution for the target and applying maximum likelihood estimation yields the logistic regression model with the sigmoid transformation. This probabilistic foundation is why logistic regression produces calibrated probabilities—it’s not just a heuristic squashing function.

Why the Sigmoid? diagram

The sigmoid function transforms unbounded scores into calibrated probabilities. At $z = 0$ , the model is maximally uncertain and outputs 0.5. As $z$ increases, confidence in class 1 increases. As $z$ decreases, confidence in class 0 increases.

Consider a spam classifier. The linear model computes a score based on features like “number of exclamation marks,” “contains word ‘free’,” and “sender reputation.” If the score is +4.2, the sigmoid maps this to $\sigma(4.2) \approx 0.985$ —the model is 98.5% confident this is spam. If the score is -2.1, the sigmoid maps this to $\sigma(-2.1) \approx 0.109$ —about 11% probability of spam, or 89% probability of legitimate email.

Decision Boundaries

The decision boundary in logistic regression is the set of points where the model outputs exactly 0.5 probability—where it’s maximally uncertain. This occurs when $z = 0$ , which means:

\mathbf{w} \cdot \mathbf{x} + b = 0

This is identical to the decision boundary of a linear classifier from Chapter 6. The sigmoid doesn’t change where the boundary is—it only changes how we interpret distances from the boundary. Points far from the boundary get mapped to probabilities close to 0 or 1. Points near the boundary get mapped to probabilities near 0.5.

This geometric interpretation is important. The model’s weights $\mathbf{w}$ define a direction in feature space. The dot product $\mathbf{w} \cdot \mathbf{x}$ measures how far a point projects along that direction. The bias $b$ shifts the threshold. Together, they define a hyperplane that splits the space into “likely class 1” and “likely class 0” regions.

In two dimensions, this is a line. In higher dimensions, it’s still a linear boundary—which means logistic regression has the same limitations as linear models. If the true decision boundary is nonlinear, logistic regression will underfit unless you engineer features to make the problem linearly separable.

Decision Thresholds and Cost-Sensitive Learning

Once you have a probability, you need to decide: at what threshold do we predict class 1? The default is 0.5—if $P(y=1) > 0.5$ , predict class 1. But this assumes that false positives and false negatives are equally costly, which is rarely true.

Medical diagnosis: A false negative (missing cancer) can be fatal. A false positive (flagging a healthy patient for follow-up tests) is stressful and expensive, but not life-threatening. The cost asymmetry is extreme: false negatives might cost lives, false positives cost time and money. In this scenario, you set a low threshold—maybe 0.1—so you flag anything that’s at least 10% likely to be cancer for further investigation. This maximizes sensitivity (recall) at the cost of specificity (precision). Follow-up tests with higher specificity then filter out the false positives.

Fraud detection: A false negative (missing a fraudulent transaction) can cost thousands of dollars and damages customer trust. A false positive (blocking a legitimate transaction) is annoying but reversible—you can unblock it. Here, you might set the threshold at 0.3—flag anything that’s more than 30% likely to be fraud for manual review or additional authentication. The exact threshold depends on the distribution of fraud (base rate) and the costs of each error type.

Spam filtering: A false positive (marking legitimate email as spam) is worse than a false negative (letting spam through). Users tolerate spam but don’t tolerate missing important messages. So you set a high threshold—only mark as spam if you’re 90-95% confident. This prioritizes precision (high confidence when you do predict spam) over recall (catching all spam).

The choice of threshold is a business decision based on the cost matrix: the cost of each type of error. If false negatives cost $C_{FN}$ and false positives cost $C_{FP}$ , the optimal threshold minimizes expected cost:

\text{Expected Cost} = C_{FP} \cdot P(FP) + C_{FN} \cdot P(FN)

In practice, you don’t know these costs precisely, but you can estimate them or use business metrics as proxies. You then perform a grid search over thresholds: for each threshold in [0, 1], compute precision, recall, F1, or the business metric on a validation set, and select the threshold that optimizes the metric.

ROC curves and precision-recall curves visualize this tradeoff. An ROC curve plots true positive rate (recall) vs false positive rate as the threshold varies. The area under the ROC curve (AUC) measures the model’s discriminative ability—the probability that a random positive example is ranked higher than a random negative example. A perfect model has AUC = 1; a random model has AUC = 0.5.

For imbalanced datasets (e.g., fraud is 0.1% of transactions), precision-recall curves are more informative than ROC curves. They show the tradeoff between precision (how many flagged cases are actually fraud) and recall (how much fraud you catch). High recall with low precision means you’re flagging too many false positives. High precision with low recall means you’re missing too much fraud.

The key insight: the threshold is not a hyperparameter—it’s a business parameter. Don’t choose it by maximizing accuracy. Choose it by understanding the cost structure of errors in your application and optimizing for the metric that matters: customer satisfaction, revenue, regulatory compliance.

Calibration

A probability is only meaningful if it’s calibrated. Calibration means that when the model says “80% probability,” it should be correct 80% of the time across all predictions at that confidence level.

Logistic regression trained with maximum likelihood estimation naturally produces calibrated probabilities, assuming the training data is representative of the deployment distribution. This is a major advantage over other models. A support vector machine, for instance, produces scores that are not probabilities and require post-processing to be interpretable. Decision trees can produce probabilities, but they’re often poorly calibrated (too confident) without pruning or ensembling.

When calibration breaks: Calibration can degrade if the training distribution doesn’t match the deployment distribution. If the model is trained on data where spam is 50% of examples, but in production spam is 10%, the raw probabilities will be miscalibrated—the model will overestimate spam probability. This is a distribution shift problem.

Calibration also breaks if the model is too simple (high bias) or too complex (overfit). An underfit model can’t distinguish confidently between classes, so all probabilities cluster around 0.5. An overfit model is overconfident: it assigns probabilities near 0 or 1 even when the true uncertainty is higher.

Recalibration techniques:

Platt scaling: Train a logistic regression model on top of the model’s outputs. Use a held-out validation set to fit a sigmoid: $p_{\text{calibrated}} = \sigma(a \cdot z + b)$ , where $z$ is the model’s raw score. This corrects for systematic over- or underconfidence. Platt scaling is fast and works well when the miscalibration is monotonic.
Isotonic regression: Fit a non-parametric, monotonically increasing function to map raw probabilities to calibrated probabilities. This is more flexible than Platt scaling and can correct for non-monotonic miscalibration. It requires more validation data and can overfit if the validation set is small.
Calibration plots: Visualize calibration by binning predictions (e.g., 0-10%, 10-20%, …, 90-100%) and plotting the predicted probability vs the observed frequency of positives in each bin. A perfectly calibrated model has points along the diagonal. Deviations reveal miscalibration: points above the diagonal mean underconfident, points below mean overconfident.

When calibration matters: Any time you use probabilities for decision-making rather than just ranking. Medical risk scoring, insurance pricing, credit scoring, and weather forecasting all require calibrated probabilities. If you’re only ranking (e.g., showing top-k recommendations), calibration is less critical—you care about relative ordering, not absolute probabilities.

Multi-Class Extension

Logistic regression naturally handles binary classification. For multi-class problems with $K$ classes, the extension is softmax regression (also called multinomial logistic regression).

Instead of a single linear model producing a score, you have $K$ linear models, one per class:

z_k = \mathbf{w}_k \cdot \mathbf{x} + b_k \quad \text{for } k = 1, \ldots, K

The softmax function converts these $K$ scores into probabilities that sum to 1:

P(y = k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

Each class gets a probability proportional to $e^{z_k}$ . The class with the highest score gets the highest probability. Softmax is a generalization of the sigmoid: when $K=2$ , softmax reduces to binary logistic regression.

Softmax is the final layer in many neural network classifiers. It’s also used standalone for text classification (categorizing documents into topics), image classification (multi-class object recognition), and any multi-class problem where you want calibrated probabilities for each class.

One-vs-rest and one-vs-one are alternative strategies for multi-class problems:

One-vs-rest (OvR): Train $K$ binary classifiers, each distinguishing one class from all others. At prediction time, run all $K$ classifiers and pick the class with the highest confidence. Simple and parallelizable, but probabilities across classes don’t sum to 1.
One-vs-one (OvO): Train $\binom{K}{2}$ binary classifiers, one for each pair of classes. At prediction time, run all classifiers and use voting to pick the final class. More classifiers to train, but each classifier sees a simpler problem (just two classes).

Softmax is generally preferred because it produces a single, coherent probability distribution and is more efficient to train. One-vs-rest is useful when classes are imbalanced or when you want to parallelize training.

Engineering Takeaway

Logistic regression is one of the most deployed models in production because it balances simplicity, speed, and interpretability while producing calibrated probabilities.

Calibrated probabilities are a killer feature. Unlike many other models, logistic regression naturally outputs probabilities that can be trusted (assuming the training data is representative). This makes it suitable for systems where decisions are based on confidence thresholds, risk scoring, or cost-sensitive classification. You can directly use these probabilities in expected value calculations, Bayesian decision theory, or downstream models.

Threshold tuning is as important as model tuning. In production, the threshold is often more important than the model itself. Teams spend significant effort determining the right operating point on the precision-recall curve based on business metrics. This means evaluation isn’t just about accuracy—it’s about understanding the full ROC curve and choosing where to operate. Use validation sets to grid-search over thresholds and optimize for the metric that matters: revenue, customer satisfaction, regulatory compliance.

Cost-sensitive learning requires explicit cost modeling. Don’t assume false positives and false negatives are equally bad. Quantify the costs (even approximately), and tune the threshold accordingly. In some cases, you can weight training examples by cost during training (class weighting) or adjust the threshold post-training. The former changes the model; the latter changes the decision boundary. Both are valid depending on the application.

Fast training and inference enable real-time deployment. Like linear models, logistic regression trains via convex optimization (gradient descent, L-BFGS), which converges reliably. Inference is a dot product plus a sigmoid—nanoseconds per prediction. Many ranking and recommendation systems use logistic regression as the final scoring layer because it can handle millions of predictions per second on a single CPU core.

Regularization prevents overfitting and stabilizes training. In high dimensions, unregularized logistic regression can overfit. L2 regularization (ridge) is standard and ensures numerical stability. L1 regularization (lasso) performs feature selection, driving irrelevant weights to zero. Almost all production logistic regression models use regularization—it’s not optional.

Interpretability remains strong despite nonlinearity. The weights still tell you feature importance, though the interpretation is in terms of log-odds rather than raw probability. A weight of $w_i = 0.5$ means increasing $x_i$ by 1 multiplies the odds of class 1 by $e^{0.5} \approx 1.65$ . This is less intuitive than linear regression but far more interpretable than neural networks. You can audit decisions, explain predictions to users, and debug biases by inspecting weights.

Scalable to billions of examples with stochastic gradient descent. Logistic regression scales to massive datasets using mini-batch SGD or online learning. You can update weights incrementally as new data arrives, enabling continuous learning. This is why web companies (Google, Facebook) use logistic regression at the core of click prediction, feed ranking, and ad targeting systems—it scales to billions of users and trillions of events.

References and Further Reading

StatQuest: Logistic Regression – Josh Starmer https://statquest.org/video-index/

StatQuest videos are among the clearest explanations of statistical and machine learning concepts available. The logistic regression content breaks down the sigmoid transformation, maximum likelihood estimation, and the connection to odds ratios in a way that’s both rigorous and intuitive. Watching these videos will give you a solid foundation in how logistic regression works and why it’s trained the way it is.

Pattern Recognition and Machine Learning, Section 4.3 – Christopher Bishop https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/

Bishop’s treatment of logistic regression covers the probabilistic foundations, including the derivation from Bernoulli likelihood and the connection to generalized linear models. It’s more mathematical than StatQuest but essential if you want to understand why logistic regression produces calibrated probabilities and how it connects to Bayesian inference. This chapter also covers multi-class extensions (softmax) and iterative reweighted least squares for training.

Predicting Good Probabilities With Supervised Learning – Alexandru Niculescu-Mizil and Rich Caruana (2005) https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf

This paper empirically evaluates the calibration of different machine learning models and shows that logistic regression produces well-calibrated probabilities, while other models (SVMs, decision trees) often do not. It also introduces Platt scaling and isotonic regression for recalibrating poorly calibrated models. If you’re deploying models where probability estimates matter (risk scoring, medical diagnosis, betting), this paper is essential reading.