Chapter 8: Decision Trees

Learning Rules from Data

A decision tree is a model that learns a hierarchy of yes/no questions to make predictions. Unlike linear models that compute weighted sums, decision trees partition the input space by recursively splitting data based on feature thresholds. The result is a set of learned rules that are both interpretable and powerful.

Decision trees represent a fundamentally different approach to learning. Instead of finding smooth boundaries defined by equations, they learn discrete decision logic directly from data. This makes them particularly well-suited for problems where human reasoning follows a similar branching structure—credit approval, medical diagnosis, fault detection—and where interpretability is as important as accuracy.

Splitting Data

The core operation in a decision tree is the split: choosing a feature and threshold that divides the data into two subsets that are more homogeneous with respect to the target variable. Homogeneity means that within each subset, most examples have the same label.

Consider a dataset of loan applications with features like income, credit score, and debt-to-income ratio. The root split might ask: “Is credit score ≥ 700?” This divides applicants into two groups. One group (credit score ≥ 700) is mostly approved loans. The other group (credit score < 700) is more mixed. We continue recursively, asking new questions of each subset.

The algorithm measures homogeneity using impurity metrics. Two common measures are:

Gini Impurity: Measures the probability of misclassifying a randomly chosen example if it were labeled according to the distribution in the node. For a node with $n$ examples and $k$ classes where $p_i$ is the proportion of examples in class $i$ :

\text{Gini} = 1 - \sum_{i=1}^{k} p_i^2

If all examples are the same class, $p_i = 1$ for one class and Gini = 0 (perfect purity). If classes are evenly distributed, Gini is maximum (maximum impurity).

Entropy: Measures the average amount of information needed to specify the class of an example drawn from the node:

\text{Entropy} = -\sum_{i=1}^{k} p_i \log_2(p_i)

Entropy is 0 when the node is pure and maximum when classes are uniformly distributed. It’s derived from information theory and has a slightly different penalty structure than Gini, but in practice, the two produce similar trees.

The algorithm evaluates every possible split—every feature and every threshold value—and chooses the one that maximally reduces impurity. This greedy approach is computationally efficient and works well despite not being globally optimal.

Computational Complexity: For a dataset with $n$ samples and $d$ features, finding the best split at a node requires $O(d \times n \log n)$ time. For each feature, the algorithm sorts the values ( $O(n \log n)$ ), then evaluates all possible thresholds ( $O(n)$ ). Building the entire tree with depth $h$ takes approximately $O(d \times n \log n \times h)$ . This makes trees efficient for moderately sized datasets but expensive for high-dimensional data with millions of features.

Handling Continuous Features: For continuous features, the algorithm identifies candidate split points by sorting the feature values and considering midpoints between consecutive values that differ in their labels. If income values are [25k, 30k, 35k, 50k] and the first two are denied while the last two are approved, the algorithm tests thresholds like 32.5k and 42.5k. This is efficient because you only need to test thresholds where the class distribution changes.

Handling Missing Values: Decision trees can handle missing data gracefully through surrogate splits. When a feature value is missing, the algorithm uses a backup feature that correlates with the primary split. If credit score is missing but income is correlated with credit score, the tree can route the example using income instead. Alternatively, some implementations send examples with missing values down both branches and combine predictions.

Why Greedy Splitting Works: The greedy strategy—choosing the best split at each step without looking ahead—is not globally optimal. A split that looks poor now might enable excellent splits later. But exhaustive search over all possible tree structures is computationally intractable ( $O(2^n)$ for $n$ examples). Greedy splitting is a practical compromise that produces good trees efficiently.

Tree Growth

Building a decision tree is a recursive algorithm:

Start with all training data at the root node
Find the best split (feature + threshold) that reduces impurity the most
Create two child nodes and partition the data based on the split
Recursively apply the same process to each child node
Stop when a stopping criterion is met

Stopping criteria include:

The node is pure (all examples have the same label)
The node has fewer than some minimum number of examples
The tree has reached a maximum depth
No split reduces impurity below some threshold

Without stopping criteria, the tree grows until every leaf contains exactly one training example. This results in perfect training accuracy—the tree memorizes every training example—but it will not generalize. This is the classic overfitting problem from Part I.

Tree Growth diagram

The diagram shows a decision tree for loan approval. Each internal node asks a question about a feature. Each leaf node makes a prediction. The path from root to leaf represents the learned rule for that prediction. For example: “If credit ≥ 700 and income ≥ 50k, then approve.”

Interpretability

Decision trees are interpretable in a way that linear models are not. A linear model tells you how much each feature contributes on average across all examples. A decision tree tells you exactly which features matter for a specific prediction and in what order.

Consider the loan approval tree. If an applicant is denied, you can trace the path through the tree and explain: “You were denied because your credit score is below 700 and your income is below $30,000.” This is a complete explanation, not a statistical correlation.

This interpretability makes decision trees valuable in domains where users need to understand and contest decisions—credit lending, medical diagnosis, hiring, criminal justice. Regulatory requirements like GDPR’s “right to explanation” favor models that can provide concrete reasoning.

But interpretability has limits. As trees grow deeper with more splits, they become harder to understand. A tree with 20 levels and hundreds of leaves is not interpretable, even though each path is technically a rule. In practice, shallow trees (depth 3-5) are interpretable; deeper trees are not.

Feature Interactions and Nonlinearity

Decision trees automatically discover feature interactions without requiring explicit feature engineering. A linear model treats features independently—it learns a weight for income and a separate weight for credit score, then adds them together. To capture interactions like “high income compensates for moderate credit score,” you must manually create interaction terms like income × credit_score.

Trees discover these interactions naturally through sequential splits. The loan approval tree splits first on credit score, then on income within each credit score bucket. This creates different income thresholds for different credit score ranges—exactly the interaction a linear model would miss.

Consider predicting whether a customer will buy a product. The pattern might be: “Young customers with high income buy, regardless of location. Older customers only buy if they live in urban areas.” This is a three-way interaction between age, income, and location. A linear model cannot represent this without manually engineering interaction features. A decision tree discovers it automatically by splitting on age, then on income for young customers and on location for older customers.

Trees also handle nonlinearity without transformation. If the relationship between a feature and the outcome is a complex curve—house price increases logarithmically with square footage—a linear model requires you to manually apply a log transform. A tree learns the nonlinearity by creating multiple splits at different thresholds, effectively approximating the curve with a step function.

This power comes from axis-aligned splits. Each split divides the feature space into rectangular regions. The decision boundary is a combination of vertical and horizontal lines (or hyperplanes in higher dimensions). This differs from linear models, which create a single smooth boundary, and from neural networks, which can create arbitrarily shaped boundaries. Trees occupy a middle ground: more flexible than linear models, more structured than neural networks.

The downside: axis-aligned splits are inefficient for diagonal boundaries. If the true boundary is “x + y = 1,” a tree needs many splits to approximate it, while a linear model captures it perfectly with a single hyperplane. Trees excel when the true boundary aligns with feature axes and involves complex interactions.

Why Trees Overfit

Decision trees overfit aggressively. Without constraints, they grow until they memorize the training data. Every training example ends up in its own leaf, resulting in 100% training accuracy and poor test performance.

Why does this happen? Trees are extremely flexible. They can approximate any function by adding enough splits. This flexibility means they have high variance—small changes in training data can produce completely different trees. If one noisy example causes a different split at the root, the entire downstream tree structure changes. This instability is the hallmark of overfitting.

Consider training a tree on 1,000 loan applications. One example is a misclassified approval—someone with low credit score who got a loan due to special circumstances not captured in the features. If this example happens to fall near a split boundary, it might force the tree to create a new branch specifically to accommodate it. That branch is based on noise, not signal, and won’t generalize.

To visualize variance, imagine training trees on different random subsets of the same dataset. Each tree will have a different structure—different splits, different thresholds, different leaf predictions. The trees agree on general patterns (high credit score is good) but disagree on specifics (exactly what income threshold to use). This disagreement is variance: the model’s predictions fluctuate with the training sample.

Compare this to linear models, which have low variance. A linear model trained on different subsets will learn similar weight vectors—the training data constrains the model to a narrow hypothesis space. Trees have a vast hypothesis space (all possible tree structures), so different training samples can lead to vastly different hypotheses.

Preventing overfitting requires limiting this flexibility through regularization. Without regularization, trees will always memorize. Even a single deep tree on a small dataset will overfit catastrophically.

Pruning Strategies

Pruning constrains tree complexity to prevent overfitting. There are two main approaches: pre-pruning (stop growing early) and post-pruning (grow fully, then remove branches).

Pre-Pruning (Early Stopping)

Pre-pruning sets stopping criteria during tree growth. Once a criterion is met, the algorithm stops splitting that node, making it a leaf. Common pre-pruning criteria:

Max depth: Limit tree depth to $h$ levels. Shallow trees are less likely to memorize. Typical values: 3-10 for interpretable trees, 10-30 for ensembles.
Min samples per leaf: Require each leaf to contain at least $n$ examples. This prevents creating leaves for individual outliers. Typical values: 5-20 for classification, 1-10 for regression.
Min samples per split: Require at least $m$ examples to consider splitting a node. Similar to min samples per leaf but applies before the split.
Min impurity decrease: Only split if the reduction in impurity exceeds some threshold $\delta$ . This prevents splits that barely improve purity but add complexity.

Pre-pruning is efficient—you avoid growing branches that will be removed later. But it’s greedy: you might stop too early. A split that looks useless now might enable great splits later. If max depth is 3, the tree never discovers a pattern that requires 4 levels of splits.

Post-Pruning (Cost-Complexity Pruning)

Post-pruning grows the tree to maximum depth (or until leaves are pure), then removes branches that don’t improve validation performance. This overcomes pre-pruning’s greediness—you see the fully grown tree before deciding what to remove.

Cost-complexity pruning is the most principled approach. It defines a cost function that balances training accuracy and tree size:

\text{Cost} = \text{Error}_{\text{training}} + \alpha \times \text{Number of leaves}

Where $\alpha$ is a regularization parameter controlling the tradeoff. For each subtree, the algorithm computes whether collapsing it (removing the split and making the parent a leaf) reduces the cost. It prunes the least useful branches first, continuing until the cost stops decreasing.

By varying $\alpha$ from 0 (no pruning) to $\infty$ (prune everything, just a root node), you get a sequence of trees of different complexities. Use cross-validation to choose the $\alpha$ that minimizes validation error. This is analogous to tuning regularization strength in linear models.

Reduced Error Pruning is simpler. Grow a full tree on training data. For each internal node, try replacing the subtree with a leaf (predicting the majority class at that node). Measure accuracy on a separate validation set. If accuracy improves or stays the same, prune the subtree. This is less principled than cost-complexity pruning but easier to implement.

Tradeoff: Interpretability vs Accuracy

Pruning forces a choice between simplicity and performance. A tree with 3 splits is interpretable but may miss important patterns. A tree with 100 splits captures more detail but is not human-readable. In production, the right tradeoff depends on the domain:

High interpretability required (lending, medical diagnosis): Use aggressive pruning, max depth 3-5, accept lower accuracy.
Accuracy matters most (fraud detection, ranking): Use deeper trees, light pruning, or skip individual trees entirely and use ensembles.

Even with optimal pruning, single decision trees have high variance. This is why ensembles—random forests and gradient boosting—are almost always better than single trees. Ensembles average away variance while preserving the benefits of tree-based learning.

Engineering Takeaway

Decision trees remain widely used in production for several reasons:

Interpretability is unmatched. In domains where decisions must be explained—lending, insurance, healthcare—decision trees provide human-readable rules. You can visualize the entire decision logic. You can trace any prediction back to the specific conditions that caused it. Regulatory compliance and user trust often outweigh small gains in accuracy from black-box models. When a loan application is denied, “your credit score is below 700 and your income is below $30k” is far more acceptable than “the neural network assigned you a low score.”

No preprocessing needed. Trees naturally handle both continuous and categorical features without requiring encoding schemes or normalization. They can split on “country = USA” as easily as “age > 30.” They’re robust to outliers—an extreme value just creates one split, rather than skewing the entire model. They handle missing values through surrogate splits or by sending examples down both branches. This simplifies feature engineering and makes trees resilient to messy real-world data.

Discovers feature interactions automatically. Unlike linear models, trees automatically capture interactions and nonlinear relationships. They don’t assume any functional form—they learn the structure directly from data. A tree can discover that “high income compensates for low credit score” without you creating an income × credit_score interaction term. This makes trees powerful when feature interactions are complex and unknown.

Feature importance is built-in. Trees provide natural feature importance scores based on how much each feature reduces impurity across all splits. Frequently used features with large impurity reductions are important; rarely used features are not. This helps with feature selection, debugging, and understanding what the model has learned. Feature importance from trees is more reliable than linear model coefficients when features are correlated or interactions exist.

Overfits without regularization. Single decision trees are high-variance models that memorize training data without constraints. You must limit depth, enforce minimum samples per leaf, or use pruning to prevent overfitting. Even with regularization, single trees are unstable—small changes in data cause large changes in structure. This limits their standalone use in production, where stability matters.

High variance makes single trees unreliable. Retrain a tree on a slightly different sample, and you get a completely different structure. This instability is why single trees are rarely deployed in modern systems. But variance is exactly what ensembles exploit: by averaging many high-variance trees, random forests and gradient boosting reduce variance without increasing bias. Understanding single-tree variance is essential to understanding why ensembles work.

Foundation for random forests and gradient boosting. The real power of decision trees comes from ensembles. Random forests and gradient boosting machines (covered in Chapter 9) are the dominant methods for tabular data in Kaggle competitions and production systems. These ensembles rely on trees as the base learner. Understanding individual trees—how they split, why they overfit, what makes them interpretable—is essential to understanding why ensembles are so effective.

The lesson: Decision trees trade smoothness for interpretability and flexibility. They overfit easily but provide the building blocks for some of the most powerful models in machine learning. Master single trees, and you understand the foundation of modern ensemble methods.

References and Further Reading

Induction of Decision Trees – J.R. Quinlan (1986) https://link.springer.com/article/10.1007/BF00116251

This is the original paper that introduced the ID3 algorithm, the foundation of modern decision tree learning. Quinlan describes how trees are grown using information gain (entropy reduction) and explores when trees overfit. Reading this paper gives you the historical context and core intuition behind recursive partitioning. ID3 was later extended to C4.5 and CART, but the fundamental ideas originated here.

Classification and Regression Trees (CART) – Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone (1984) https://www.taylorfrancis.com/books/mono/10.1201/9781315139470/classification-regression-trees-leo-breiman-jerome-friedman-richard-olshen-charles-stone

CART is the canonical reference for decision trees. It covers both classification and regression trees, introduces Gini impurity, and describes pruning strategies for regularization. This book is mathematical and thorough—it’s the definitive treatment of tree-based learning. Most modern tree implementations (scikit-learn, XGBoost) are based on CART. If you want to deeply understand how trees work, this is the authoritative source.

C4.5: Programs for Machine Learning – J.R. Quinlan (1993) Morgan Kaufmann Publishers

C4.5 is the successor to ID3 and one of the most influential decision tree algorithms. Quinlan addresses ID3’s limitations: handling continuous features efficiently, dealing with missing values, and pruning to prevent overfitting. C4.5 introduces gain ratio (normalized information gain) and post-pruning via error-based pruning. This book is practical—it describes a working system with code. C4.5 was the dominant tree algorithm until CART-based implementations became widespread. Understanding C4.5 shows how trees evolved from theoretical constructs to production-ready tools.