Decision Trees
Decision trees are among the most intuitive machine learning models ever devised. A trained decision tree reads like a flowchart: ask a question about a feature, follow one branch or another based on the answer, ask another question, and eventually arrive at a prediction. A doctor diagnosing a condition by ruling out possibilities one at a time is roughly following a decision tree. That intuitive structure is both the algorithm's greatest strength and the source of its principal weakness.
Structure and Mechanics
A decision tree is a hierarchical model consisting of: - A root node: the starting point, representing the entire training set. - Internal nodes: each applies a test to one feature. For numerical features the test is of the form 'xⱼ ≤ threshold.' For categorical features it might be 'xⱼ ∈ {value_a, value_b}.' - Branches: the possible outcomes of each test. - Leaf nodes: the terminal nodes that output a prediction — a class label for classification, a numerical value for regression. A concrete worked example — classifying whether a mushroom is edible: Root: Is the cap color red? Yes → Leaf: Poisonous (prediction based on training data majority) No → Internal: Does it have an unpleasant odor? Yes → Leaf: Poisonous No → Leaf: Edible This three-node tree (one root, one internal, three leaves) represents the learned rule. Given a new mushroom — blue cap, no unpleasant odor — we follow: Is cap red? No. Unpleasant odor? No. Prediction: Edible. For numerical predictions (regression trees), leaf nodes output the mean of the training examples that reach that leaf, rather than a class label.
A decision tree's prediction can always be explained as a sequence of human-readable rules. For high-stakes domains — medicine, law, credit scoring — this interpretability is not merely convenient, it is sometimes legally required. Regulators may demand that an institution explain why a loan was denied. A decision tree provides that explanation directly. A 200-layer neural network does not.
How does a tree decide where to split? It uses an impurity measure to quantify how mixed the classes are at each node, then chooses the split that reduces impurity the most. Gini impurity for a node with K classes: Gini = 1 - Σ pₖ² where pₖ is the fraction of examples in the node belonging to class k. A pure node (all one class) has Gini = 0. A maximally mixed two-class node (50/50) has Gini = 0.5. Information gain (entropy-based): uses entropy H = -Σ pₖ log₂(pₖ). A split is chosen to maximize the decrease in entropy. A worked example with Gini: Suppose a node contains 10 examples: 7 edible, 3 poisonous. p_edible = 0.7, p_poisonous = 0.3 Gini = 1 - (0.7² + 0.3²) = 1 - (0.49 + 0.09) = 1 - 0.58 = 0.42 After splitting on 'has ring on stem,' suppose one child has 6 edible / 0 poisonous (pure, Gini=0) and the other has 1 edible / 3 poisonous (Gini = 1-(0.25²+0.75²) = 1-0.625 = 0.375). The weighted average Gini after the split is: (6/10)(0) + (4/10)(0.375) = 0 + 0.15 = 0.15 Gini improvement = 0.42 - 0.15 = 0.27. This is the information gain of that split.
A fully grown decision tree will partition the training data until every leaf is pure — it will memorize every training example, including noise. On the test set, performance collapses. The cure is pruning: either stop growing early (pre-pruning, using a maximum depth or minimum samples per leaf) or grow fully then remove branches that add little information (post-pruning). In practice, maximum depth between 3 and 10 is common. Deeper trees are almost always overfit.
Flashcards — click each card to reveal the answer
Strengths, Weaknesses, and the Overfitting Problem
Decision trees are non-parametric — they do not assume a fixed functional form for the relationship between features and target. This makes them highly flexible and capable of capturing complex, nonlinear relationships. They handle both numerical and categorical features without preprocessing, require no feature scaling, and naturally perform feature selection — splits on uninformative features are simply never chosen. Their weaknesses are equally significant: High variance: small changes in training data can produce a completely different tree structure. Two models trained on slightly different samples of the same population may arrive at opposite splitting strategies. Axis-aligned boundaries: every split is perpendicular to one feature axis. A diagonal boundary in the feature space requires many splits to approximate, making trees inefficient for that geometry. Instability: a single noisy training example near a boundary can change the root split, cascading into an entirely different tree. These weaknesses motivate ensemble methods — the insight that combining many imperfect trees produces a far more robust model. That is the topic of Lesson 7.
A decision tree trained to classify emails achieves 100% accuracy on training data but 62% on the test set. A tree with maximum depth 4 achieves 87% training accuracy and 81% test accuracy. What is the correct conclusion?
You have a node with 12 examples: 9 in class A, 3 in class B. What is its Gini impurity?
Grow a Tree by Hand
- You have eight training examples with two features: study_hours (x₁) and slept_well (x₂, 0=no 1=yes), and label passed (y, 0=fail 1=pass).
- x₁=2, x₂=0, y=0
- x₁=3, x₂=1, y=1
- x₁=1, x₂=0, y=0
- x₁=4, x₂=0, y=1
- x₁=2, x₂=1, y=1
- x₁=1, x₂=1, y=0
- x₁=5, x₂=0, y=1
- x₁=3, x₂=0, y=0
- Step 1: Compute the Gini impurity of the full root node (4 pass, 4 fail).
- Step 2: Consider splitting on x₁ ≤ 2 (group 1: x₁≤2; group 2: x₁>2). Compute the Gini for each child group, then compute the weighted Gini after the split.
- Step 3: Consider splitting on x₂ = 1 (group 1: slept well; group 2: did not sleep well). Compute the weighted Gini after this split.
- Step 4: Which split gives greater information gain? Choose it as the root split.
- Step 5: For the larger child node, apply one more split of your choice. Draw the resulting tree.
- Step 6: Test your tree on this new example: x₁=4, x₂=1. What does it predict?