Chapter 9: Ensembles

Why Many Weak Models Beat One Strong One

An ensemble is a collection of models that make predictions together. Rather than training a single powerful model, you train many weaker models and combine their predictions. This counterintuitive approach—that many mediocre predictors can outperform one sophisticated predictor—is one of the most important ideas in machine learning.

Ensembles dominate Kaggle competitions, fraud detection systems, recommendation engines, and risk scoring models. They achieve state-of-the-art performance on tabular data, often exceeding the accuracy of neural networks. Understanding why ensembles work requires understanding the bias-variance tradeoff from Part I and how averaging reduces variance without increasing bias.

Wisdom of Crowds

The core insight behind ensembles is that independent errors cancel out when averaged. If you have ten models that each make different mistakes, the average of their predictions will be more accurate than any individual prediction—assuming the errors are uncorrelated.

Consider predicting house prices. Model A might overestimate prices for houses near parks because it overfits to this pattern. Model B might underestimate prices for recently renovated houses. Model C might struggle with outliers in expensive neighborhoods. If these errors are independent, averaging the three predictions produces a more accurate estimate than any single model.

This is the same principle behind the “wisdom of crowds” phenomenon: asking 100 people to guess the number of jellybeans in a jar and averaging their guesses often produces a more accurate estimate than any individual guess. Individual guesses have random errors that cancel out in the average, leaving a more accurate central estimate.

Mathematically, if each model has error variance σ2\sigma^2 and the errors are uncorrelated, the variance of the average of nn models is:

Var(average)=σ2n\text{Var}(\text{average}) = \frac{\sigma^2}{n}

This means the variance decreases linearly with the number of models. With 100 models, the variance is 1/100th of a single model’s variance. This is variance reduction without increasing bias—which directly improves generalization.

The critical requirement is that the models’ errors are uncorrelated. If all models make the same mistakes, averaging doesn’t help. This is why ensemble methods intentionally introduce diversity—by training on different subsets of data, using different features, or randomizing the training process.

Random Forests

A random forest is an ensemble of decision trees where each tree is trained on a different random subset of the training data and considers a random subset of features at each split. This randomness ensures that the trees are diverse enough that their errors are uncorrelated.

The algorithm works as follows:

  1. Bootstrap sampling: For each tree, sample the training data with replacement. This creates a dataset of the same size where some examples appear multiple times and some don’t appear at all. Each tree sees a slightly different view of the data.

  2. Random feature selection: At each split in each tree, only consider a random subset of features. If there are 100 features, each split might only evaluate 10 randomly chosen features. This prevents trees from all using the same strong features and forces them to explore different patterns.

  3. Grow deep trees: Each tree is grown deep without pruning, allowing it to fit the training data closely. Normally this would cause overfitting, but because we’re averaging many trees, the ensemble generalization is good despite each individual tree overfitting.

  4. Average predictions: For regression, average the predictions of all trees. For classification, take a majority vote or average the predicted probabilities.

Random Forests diagram

The diagram shows how random forests work: bootstrap sampling creates diverse training sets, each tree makes a prediction, and the average is the final output. Individual predictions vary, but the average is stable and accurate.

Why does this work so well? Decision trees are high-variance models—they overfit and are sensitive to small changes in data. But they have relatively low bias—they can approximate complex functions. Random forests reduce variance through averaging while preserving the low bias. This directly addresses the bias-variance tradeoff.

Out-of-Bag Error Estimation: Random forests provide a clever way to estimate test error without a separate validation set. During bootstrap sampling, each tree sees about 63% of the training data (due to sampling with replacement). The remaining 37%—the “out-of-bag” (OOB) examples—were not used to train that tree. You can use these OOB examples as a validation set for that tree. Aggregate the OOB predictions across all trees, and you get an unbiased estimate of test error. This is essentially free cross-validation—you get a validation estimate without holding out data.

Feature Importance: Random forests compute feature importance by measuring how much each feature reduces impurity (Gini or entropy) across all splits in all trees. Features used frequently at the top of trees (where they split large, impure nodes) have high importance. Features rarely used or used only in deep nodes have low importance. This aggregated importance score is more robust than single-tree importance because it averages across many trees with different structures.

Hyperparameters: Key hyperparameters for random forests include:

  • Number of trees (nestimatorsn_{\text{estimators}}): More trees almost always improve performance, with diminishing returns after 100-500 trees. Random forests rarely overfit from too many trees—more trees just reduce variance further.
  • Max features per split (max_features\text{max\_features}): Typical values are d\sqrt{d} for classification and d/3d/3 for regression, where dd is the number of features. Smaller values increase diversity but may miss important features.
  • Max depth and min samples per leaf: These control individual tree complexity. Random forests typically use deep trees (high max depth, low min samples per leaf) to reduce bias, relying on averaging to control variance.

Gradient Boosting

Boosting takes a different approach to ensembles. Instead of training many trees independently and averaging them, boosting trains trees sequentially, where each new tree focuses on the mistakes of the previous trees. This sequential error correction produces highly accurate models.

The most important boosting algorithm is gradient boosting, which works as follows:

  1. Initialize: Start with a simple prediction (e.g., the mean of the target variable).

  2. Compute residuals: Calculate the errors (residuals) of the current ensemble on the training data.

  3. Train a tree on residuals: Train a new tree to predict these residuals—not the original target, but the errors the ensemble currently makes.

  4. Add to ensemble: Add this new tree to the ensemble with a small weight (learning rate). The ensemble now predicts: old prediction + (learning rate × new tree).

  5. Repeat: Iterate, each time training a new tree to correct the remaining errors.

After many iterations, the ensemble consists of hundreds or thousands of trees, each one correcting the mistakes of its predecessors. The final prediction is the sum of all trees’ predictions, scaled by the learning rate.

Gradient boosting is called “gradient” boosting because it’s actually performing gradient descent in function space. The residuals are the negative gradient of the loss function with respect to the predictions. By fitting trees to residuals, we’re moving the predictions in the direction that reduces loss—just like gradient descent adjusts parameters to reduce loss.

This connection to optimization is fundamental. Boosting is not just an ensemble method—it’s an optimization algorithm that incrementally builds a model by fitting additive components that reduce loss.

Learning Rate (Shrinkage): The learning rate controls the contribution of each tree. A small learning rate (e.g., 0.01-0.1) means each tree has a small effect, requiring more trees but producing better generalization. A large learning rate (e.g., 0.3-1.0) means fewer trees are needed but the model may overfit. There’s a tradeoff between number of trees and learning rate: small learning rate + many trees ≈ large learning rate + few trees, but the former generalizes better.

Why Boosting Reduces Bias: Early trees capture the main patterns—the low-hanging fruit. Later trees capture subtler patterns, interactions, and edge cases that earlier trees missed. This sequential refinement reduces bias by allowing the model to fit increasingly complex functions. Each tree adds a small correction, and thousands of corrections sum to a highly accurate predictor.

Overfitting Risk: Unlike random forests, gradient boosting can overfit if you train too many trees. Each new tree fits the training residuals, and eventually, it starts fitting noise rather than signal. This is why boosting uses early stopping: monitor validation error and stop adding trees when validation error stops improving. Alternatively, use regularization techniques like tree depth limits, subsampling, or stronger learning rate decay.

Modern Variants: Modern gradient boosting implementations add optimizations and regularization:

  • XGBoost (Extreme Gradient Boosting): Adds L1 and L2 regularization on tree weights, handles sparse data efficiently, and includes advanced splitting algorithms. It’s highly optimized for speed and dominates Kaggle competitions.

  • LightGBM (Light Gradient Boosting Machine): Uses histogram-based splitting instead of sorting features at each split, making it much faster for large datasets. It also uses leaf-wise tree growth (versus XGBoost’s level-wise growth), which can produce more accurate trees with the same number of leaves.

  • CatBoost (Categorical Boosting): Handles categorical features natively without requiring one-hot encoding. It uses ordered boosting to reduce overfitting and target leakage. It’s particularly strong when your data has many categorical features.

All three are production-grade implementations with GPU support, distributed training, and extensive tuning options. They dominate tabular data problems in industry and competitions.

Other Ensemble Methods

Beyond random forests and gradient boosting, several other ensemble techniques are used in practice:

Bagging (Bootstrap Aggregating): Bagging is the general technique behind random forests. Train multiple models on bootstrap samples (sampling with replacement) and average their predictions. Bagging works with any base model—decision trees, neural networks, even linear models—but it’s most effective with high-variance models like deep trees. Bagging reduces variance without increasing bias, making unstable models more robust.

Voting Ensembles: Combine predictions from multiple different model types. For example, train a random forest, a gradient boosting model, and a logistic regression model, then:

  • Hard voting (classification): Each model votes for a class, and the majority wins.
  • Soft voting (classification): Average the predicted probabilities from each model and choose the class with the highest average probability.
  • Averaging (regression): Average the predictions from each model.

Voting works best when the base models are diverse—they make different types of errors. Combining a tree-based model with a linear model can capture both nonlinear patterns (trees) and linear trends (logistic regression).

Stacking (Stacked Generalization): Stacking trains a meta-learner on top of base models. First, train several base models (e.g., random forest, gradient boosting, SVM). Then, use their predictions as features to train a meta-learner (often logistic regression or a simple neural network) that learns how to best combine them. The meta-learner discovers which base models are reliable in which situations. Stacking can outperform simple averaging but requires careful cross-validation to avoid overfitting.

Blending: Blending is a simpler version of stacking. Split the training data into two parts. Train base models on the first part, get their predictions on the second part (holdout set), then train the meta-learner on these holdout predictions. Blending is easier to implement than full stacking but uses less data for training base models.

When to Use Each:

  • Bagging: When you have a high-variance model and want to stabilize it without changing the model type.
  • Voting: When you have multiple trained models and want a simple way to combine them.
  • Stacking: When you want the best possible accuracy and can afford the complexity of training a meta-learner.
  • Boosting: When you want state-of-the-art accuracy on tabular data and can tune hyperparameters carefully.

Hyperparameter Tuning for Ensembles

Ensembles have many hyperparameters that control the bias-variance tradeoff. Effective tuning is critical for production performance.

Random Forest Hyperparameters:

  • Number of trees (nestimatorsn_{\text{estimators}}): Start with 100, increase to 200-500 if you have compute. More trees rarely hurt.
  • Max features (max_features\text{max\_features}): Default of d\sqrt{d} (classification) or d/3d/3 (regression) works well. Decrease to increase diversity, increase to let trees find best features.
  • Max depth: Default is unlimited (grow until leaves are pure). If overfitting, limit to 10-30.
  • Min samples per leaf: Default is 1. Increase to 5-20 if you have noisy data or want to prevent overfitting.

Random forests are robust to hyperparameters—defaults usually work well. The main tuning knob is the number of trees.

Gradient Boosting Hyperparameters:

  • Learning rate (shrinkage): Typical values: 0.01-0.3. Smaller is better but requires more trees. Use 0.01-0.05 for large datasets, 0.1-0.2 for small datasets.
  • Number of trees (nestimatorsn_{\text{estimators}}): Depends on learning rate. With learning rate 0.1, try 100-1000 trees. With learning rate 0.01, try 1000-5000 trees. Use early stopping to find the optimal number.
  • Max depth: Typical values: 3-10. Shallow trees (3-6) prevent overfitting and are faster. Deeper trees (7-10) capture more interactions.
  • Subsample: Fraction of training data to use per tree (stochastic gradient boosting). Typical: 0.5-1.0. Lower values (0.5-0.8) prevent overfitting by adding randomness.
  • Min child weight / min samples per leaf: Regularization to prevent overfitting. Typical: 1-10.

Gradient boosting requires more tuning than random forests. Start with learning rate 0.1, max depth 6, and 100-500 trees, then tune.

Tuning Strategies:

  • Grid search: Try all combinations of hyperparameters from predefined lists. Exhaustive but slow.
  • Random search: Sample random combinations of hyperparameters. Often finds good solutions faster than grid search because not all hyperparameters matter equally.
  • Bayesian optimization: Use a surrogate model to predict which hyperparameters will perform well, then test those. More efficient than random search but requires more setup.

Always use cross-validation to evaluate hyperparameters. K-fold cross-validation (typically k=5 or k=10) ensures your tuning doesn’t overfit to a single train/validation split. For time series data, use time-based splitting (train on past, validate on future).

Bias-Variance Reduction

Why do ensembles work? Because they reduce variance without significantly increasing bias.

Random forests reduce variance by averaging many high-variance, low-bias trees. Each tree can overfit, but the average cannot—random errors cancel out, leaving the systematic patterns. This is pure variance reduction.

Boosting reduces bias by sequentially adding trees that correct errors. Early trees capture the main patterns (low bias). Later trees capture subtler patterns and interactions (further reducing bias). The small learning rate and early stopping prevent overfitting, keeping variance controlled.

Both methods exploit the bias-variance tradeoff from Chapter 4, but in different ways:

  • Random forests: low bias (flexible trees), reduced variance (averaging)
  • Boosting: reduced bias (sequential correction), controlled variance (small learning rate, regularization)

This is why ensembles dominate tabular data. Neural networks are powerful for images, text, and audio because they learn hierarchical representations. But for structured data—database tables with mixed features—ensembles of trees are often more accurate, more robust, and faster to train.

Engineering Takeaway

Ensembles are the workhorses of production machine learning for structured data. Random forests and gradient boosting (especially implementations like XGBoost, LightGBM, and CatBoost) power fraud detection, credit scoring, recommendation systems, and ranking models at companies like Google, Amazon, and financial institutions.

State-of-the-art on tabular data. Ensembles consistently win Kaggle competitions involving structured data. When the input is rows and columns rather than images or text, ensembles are the default choice. They handle missing values, mixed data types, and noisy features better than neural networks. In production systems processing transactional data, user behavior logs, or sensor readings, gradient boosting is often the go-to algorithm.

Feature importance and interpretation. Ensembles provide reliable feature importance scores, which help with feature selection, model debugging, and explaining predictions to stakeholders. SHAP (SHapley Additive exPlanations) values extend this further by attributing each prediction to specific feature contributions, making ensembles viable in regulated domains like lending and healthcare. You can explain individual predictions while maintaining high accuracy.

Robustness to hyperparameters. Random forests work well with default settings—they’re remarkably forgiving. Gradient boosting requires more tuning (learning rate, number of trees, max depth) but is far more forgiving than neural networks, which have dozens of architectural and optimization hyperparameters. This robustness reduces iteration time in production: you can get a strong baseline quickly, then tune incrementally.

Fast parallel training. Random forests parallelize perfectly—each tree is independent. Modern implementations (scikit-learn, XGBoost, LightGBM) train hundreds of trees in parallel across CPU cores. Gradient boosting is inherently sequential (each tree depends on previous residuals), but modern implementations parallelize the tree-growing step and support GPU acceleration. You can train models on millions of rows in minutes.

Handles complex interactions naturally. Trees discover feature interactions automatically through sequential splits. Ensembles aggregate these interactions across hundreds of trees, capturing complex nonlinear relationships without manual feature engineering. This makes ensembles powerful when you don’t fully understand which features interact or when the interaction structure is too complex to engineer manually.

Overfitting resistance varies by method. Random forests rarely overfit—adding more trees almost always helps. Gradient boosting can overfit if you train too many trees, but early stopping prevents this by monitoring validation error. Both methods are more resistant to overfitting than deep neural networks on small-to-medium tabular datasets (< 100k examples), where neural networks tend to memorize.

Scalability with modern implementations. XGBoost, LightGBM, and CatBoost scale to billions of examples and millions of features through algorithmic optimizations: histogram-based splitting, sparsity-aware algorithms, and distributed training. LightGBM’s leaf-wise growth strategy trains faster than XGBoost’s level-wise strategy on large datasets. CatBoost’s categorical feature handling eliminates expensive one-hot encoding. These implementations make ensembles production-ready at scale.

The lesson: If your data is tabular—features in columns, examples in rows—your first attempt should be gradient boosting or random forests. Reserve neural networks for domains where hierarchical feature learning is critical. Ensembles are not just competitive; they’re often the best choice for structured data.


References and Further Reading

Random Forests – Leo Breiman (2001) https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

This is the original paper that introduced random forests. Breiman explains bootstrap aggregating (bagging), random feature selection, and why the method works so well. He also introduces out-of-bag error estimation and feature importance measures. Reading this paper gives you the foundational intuition for why diversity in ensembles leads to variance reduction. It’s remarkably accessible for a machine learning paper and shows the thought process behind one of ML’s most successful algorithms.

Greedy Function Approximation: A Gradient Boosting Machine – Jerome Friedman (2001) https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

This is the paper that formalized gradient boosting as an optimization procedure. Friedman shows that boosting is equivalent to gradient descent in function space and introduces regularization techniques (shrinkage, subsampling) that make boosting practical. This paper is more technical than Breiman’s random forests paper, but it’s essential for understanding why boosting works and how modern implementations like XGBoost are designed. It reveals the deep connection between ensemble learning and optimization.

XGBoost: A Scalable Tree Boosting System – Tianqi Chen and Carlos Guestrin (2016) https://arxiv.org/abs/1603.02754

XGBoost is one of the most widely used machine learning libraries in production and Kaggle competitions. This paper explains the engineering optimizations that make gradient boosting fast and scalable: sparsity-aware learning, cache-aware algorithms, parallel tree construction, and distributed training. Reading this shows how algorithmic ideas become production systems. XGBoost’s dominance in ML competitions demonstrates that engineering and optimization matter as much as algorithmic novelty.