1 Decision Tree Learning Based on the ML lecture by Raymond J. Mooney University of Texas at Austin 2 Decision Trees •  Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category. •  Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors. •  Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF). –  red ∧ circle → pos –  red ∧ circle → A blue → B; red ∧ square → B green → C; red ∧ triangle → C color red blue green shape circle square triangle neg pos pos neg neg color red blue green shape circle square triangle B C A B C 3 Properties of Decision Tree Learning •  Continuous (real-valued) features can be handled by allowing nodes to split a real valued feature into two ranges based on a threshold (e.g. length < 3 and length ≥3) •  Classification trees have discrete class labels at the leaves, regression trees allow real-valued outputs at the leaves. •  Algorithms for finding consistent trees are efficient for processing large amounts of training data for data mining tasks. •  Methods developed for handling noisy training data (both class and feature noise). •  Methods developed for handling missing feature values. 4 Top-Down Decision Tree Induction •  Recursively build a tree top-down by divide and conquer. : + : + : - : - color red blue green : + : + : - 5 shape circle square triangle Top-Down Decision Tree Induction •  Recursively build a tree top-down by divide and conquer. : + : + : - : : + : + : - color red blue green : + : + pos : neg pos : neg neg 6 Decision Tree Induction Pseudocode DTree(examples, features) returns a tree If all examples are in one category, return a leaf node with that category label. Else if the set of features is empty, return a leaf node with the category label that is the most common in examples. Else pick a feature F and create a node R for it For each possible value vi of F: Let examplesi be the subset of examples that have value vi for F Add an out-going edge E to node R labeled with the value vi. If examplesi is empty then attach a leaf node to edge E labeled with the category that is the most common in examples. else call DTree(examplesi , features – {F}) and attach the resulting tree as the subtree under edge E. Return the subtree rooted at R. 7 Picking a Good Split Feature •  Goal is to have the resulting tree be as small as possible, per Occam’s razor. •  Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard optimization problem. •  Top-down divide-and-conquer method does a greedy search for a simple tree but does not guarantee to find the smallest. –  General lesson in ML: “Greed is good.” •  Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes. •  There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979). 8 Entropy •  Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: where p1 is the fraction of positive examples in S and p0 is the fraction of negatives. •  If all examples are in one category, entropy is zero (we define 0⋅log(0)=0) •  If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1. •  Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases. •  For multi-class problems with c categories, entropy generalizes to: )(log)(log)( 020121 ppppSEntropy −−= ∑= −= c i ii ppSEntropy 1 2 )(log)( 9 Entropy Plot for Binary Classification 10 Information Gain •  The information gain of a feature F is the expected reduction in entropy resulting from splitting on this feature. where Sv is the subset of S having value v for feature F. •  Entropy of each resulting subset weighted by its relative size. •  Example: –  : + : + –  : - : - )()(),( )( v FValuesv v SEntropy S S SEntropyFSGain ∑∈ −= 2+, 2 -: E=1 size big small 1+,1- 1+,1E=1 E=1 Gain=1-(0.5⋅1 + 0.5⋅1) = 0 2+, 2 - : E=1 color red blue 2+,1- 0+,1E=0.918 E=0 Gain=1-(0.75⋅0.918 + 0.25⋅0) = 0.311 2+, 2 - : E=1 shape circle square 2+,1- 0+,1E=0.918 E=0 Gain=1-(0.75⋅0.918 + 0.25⋅0) = 0.311 11 Hypothesis Space Search •  Performs batch learning that processes all training instances at once rather than incremental learning that updates a hypothesis after each example. •  Performs hill-climbing (greedy search) that may only find a locally-optimal solution. Guaranteed to find a tree consistent with any conflict-free training set (i.e. identical feature vectors always assigned the same class), but not necessarily the simplest tree. •  Finds a single discrete hypothesis, so there is no way to provide confidences or create useful queries. 12 Bias in Decision-Tree Induction •  Information-gain gives a bias for trees with minimal depth. •  Implements a search (preference) bias instead of a language (restriction) bias. 13 History of Decision-Tree Research •  Hunt and colleagues use exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. •  In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. •  Simulataneously, Breiman and Friedman and colleagues develop CART (Classification and Regression Trees), similar to ID3. •  In the 1980’s a variety of improvements are introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. •  Quinlan’s updated decision-tree package (C4.5) released in 1993. •  Weka includes Java version of C4.5 called J48. 14 Computational Complexity •  Worst case builds a complete tree where every path test every feature. Assume n examples and m features. •  At each level, i, in the tree, must examine the remaining mi features for each instance at the level to calculate info gains. •  However, learned tree is rarely complete (number of leaves is ≤ n). In practice, complexity is linear in both number of features (m) and number of training examples (n). F1 Fm ⋅⋅⋅⋅⋅ Maximum of n examples spread across all nodes at each of the m levels )( 1 2 ∑= =⋅ m i nmOni 15 Overfitting •  Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization to unseen data. –  There may be noise in the training data that the tree is erroneously fitting. –  The algorithm may be making poor decisions towards the leaves of the tree that are based on very little data and may not reflect reliable trends. •  A hypothesis, h, is said to overfit the training data is there exists another hypothesis which, h´, such that h has less error than h´ on the training data but greater error on independent test data. hypothesis complexity accuracy on training data on test data 16 Overfitting Example voltage (V) current(I) Testing Ohms Law: V = IR (I = (1/R)V) Ohm was wrong, we have found a more accurate function! Perfect fit to training data with an 9th degree polynomial (can fit n points exactly with an n-1 degree polynomial) Experimentally measure 10 points Fit a curve to the Resulting data. 17 Overfitting Example voltage (V) current(I) Testing Ohms Law: V = IR (I = (1/R)V) Better generalization with a linear function that fits training data less accurately. 18 Bias-variance tradeoff Another example 19 Bias-variance tradeoff Linear function works well but Cubic seems better Is there any general view to the problem, preferably with a theoretical background? 20 Bias-variance tradeoff y = f(x) + e, E[e] = 0, Var[e] = s2 f’() … estimate of f() learned by a classifier E[ (y-f’(x))2 ] = Bias[ f’(x) ] + Var[ f’(x) ] + s2 21 Bias-variance tradeoff y = f(x) + e, E[e] = 0, Var[e] = s2 f’() … estimate of f() learned by a classifier E[ (y-f’(x))2 ] = Bias[ f’(x) ] + Var[ f’(x) ] + s2 22 Bias-variance tradeoff y = f(x) + e, E[e] = 0, Var[e] = s2 f’() … estimate of f() learned by a classifier E[ (y-f’(x))2 ] = Bias[ f’(x) ] + Var[ f’(x) ] + s2 23 Bias-variance tradeoff y = f(x) + e, E[e] = 0, Var[e] = s2 f’() … estimate of f() learned by a classifier E[ (y-f’(x))2 ] = Bias[ f’(x) ] + Var[ f’(x) ] + s2 24 Bias-variance tradeoff y = f(x) + e, E[e] = 0, Var[e] = s2 f’() … estimate of f() learned by a classifier E[ (y-f’(x))2 ] = Bias[ f’(x) ] + Var[ f’(x) ] + s2 25 Bias-variance tradeoff y = f(x) + e, E[e] = 0, Var[e] = s2 f’() … estimate of f() learned by a classifier E[ (y-f’(x))2 ] = Bias[ f’(x) ] + Var[ f’(x) ] + s2 26 Overfitting Noise in Decision Trees •  Category or feature noise can easily cause overfitting. –  Add noisy instance : pos (but really neg) shape circle square triangle color red bluegreen pos neg pos neg neg 27 Overfitting Noise in Decision Trees •  Category or feature noise can easily cause overfitting. –  Add noisy instance : pos (but really neg) shape circle square triangle color red bluegreen pos neg pos neg : : + small med big posneg neg •  Noise can also cause different instances of the same feature vector to have different classes. Impossible to fit this data and must label leaf with the majority class. –  : neg (but really pos) •  Conflicting examples can also arise if the features are incomplete and inadequate to determine the class or if the target concept is non-deterministic. 28 Overfitting Prevention (Pruning) Methods •  Two basic approaches for decision trees –  Prepruning: Stop growing tree as some point during top-down construction when there is no longer sufficient data to make reliable decisions. –  Postpruning: Grow the full tree, then remove subtrees that do not have sufficient evidence. •  Label leaf resulting from pruning with the majority class of the remaining data, or a class probability distribution. •  Method for determining which subtrees to prune: –  Cross-validation: Reserve some training data as a hold-out set (validation set, tuning set) to evaluate utility of subtrees. –  Statistical test: Use a statistical test on the training data to determine if any observed regularity can be dismisses as likely due to random chance. –  Minimum description length (MDL): Determine if the additional complexity of the hypothesis is less complex than just explicitly remembering any exceptions resulting from pruning. 29 Additional Decision Tree Issues •  Better splitting criteria –  Information gain prefers features with many values. •  Continuous features •  Predicting a real-valued function (regression trees) •  Missing feature values •  Features with costs •  Misclassification costs •  Incremental learning •  Mining large databases that do not fit in main memory 30 C4.5 •  Based on ID3 algorithm, author Ross Quinlan •  In all (or most of) non-commercial and commercial data mining tools •  Weka: C4.5 ver.8 -> j48 Scheme of C4.5 algorithm: Run several time and choose the best tree Inner:Take L% of learning data randomly Call ID3 (pre-pruning, see –m parameter) Prune the tree (post-pruning, -cf) Take T% of unseen learning data for validation If validation criterion holds, exit Otherwise add Lcrement to L and go to Inner