Decision Tree Learning Based on the ML lecture by Raymond J. Mooney University of Texas at Austin Decision Trees • Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category. • Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors. • Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF). – red Ù circle → pos – red Ù circle → A blue → B; red Ù square → B green → C; red Ù triangle → C Properties of Decision Tree Learning • Continuous (real-valued) features can be handled by allowing nodes to split a real valued feature into two ranges based on a threshold (e.g. length < 3 and length ³3) • Classification trees have discrete class labels at the leaves, regression trees allow real-valued outputs at the leaves. • Algorithms for finding consistent trees are efficient for processing large amounts of training data for data mining tasks. • Methods developed for handling noisy training data (both class and feature noise). • Methods developed for handling missing feature values. Top-Down Decision Tree Induction • Recursively build a tree top-down by divide and conquer. Top-Down Decision Tree Induction • Recursively build a tree top-down by divide and conquer. Decision Tree Induction Pseudocode Picking a Good Split Feature • Goal is to have the resulting tree be as small as possible, per Occam’s razor. • Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard optimization problem. • Top-down divide-and-conquer method does a greedy search for a simple tree but does not guarantee to find the smallest. – General lesson in ML: “Greed is good.” • Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes. • There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979). Entropy • Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: where p[1] is the fraction of positive examples in S and p[0] is the fraction of negatives. • If all examples are in one category, entropy is zero (we define 0×log(0)=0) • If examples are equally mixed (p[1]=p[0]=0.5), entropy is a maximum of 1. • Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases. • For multi-class problems with c categories, entropy generalizes to: Entropy Plot for Binary Classification Information Gain • The information gain of a feature F is the expected reduction in entropy resulting from splitting on this feature. where S[v] is the subset of S having value v for feature F. • Entropy of each resulting subset weighted by its relative size. • Example: – : + : + – : - :  Hypothesis Space Search • Performs batch learning that processes all training instances at once rather than incremental learning that updates a hypothesis after each example. • Performs hill-climbing (greedy search) that may only find a locally-optimal solution. Guaranteed to find a tree consistent with any conflict-free training set (i.e. identical feature vectors always assigned the same class), but not necessarily the simplest tree. • Finds a single discrete hypothesis, so there is no way to provide confidences or create useful queries. Bias in Decision-Tree Induction • Information-gain gives a bias for trees with minimal depth. • Implements a search (preference) bias instead of a language (restriction) bias. History of Decision-Tree Research • Hunt and colleagues use exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. • In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. • Simulataneously, Breiman and Friedman and colleagues develop CART (Classification and Regression Trees), similar to ID3. • In the 1980’s a variety of improvements are introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. • Quinlan’s updated decision-tree package (C4.5) released in 1993. • Weka includes Java version of C4.5 called J48. Computational Complexity • Worst case builds a complete tree where every path test every feature. Assume n examples and m features. • At each level, i, in the tree, must examine the remaining m- i features for each instance at the level to calculate info gains. • However, learned tree is rarely complete (number of leaves is  n). In practice, complexity is linear in both number of features (m) and number of training examples (n). Overfitting • Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization to unseen data. – There may be noise in the training data that the tree is erroneously fitting. – The algorithm may be making poor decisions towards the leaves of the tree that are based on very little data and may not reflect reliable trends. • A hypothesis, h, is said to overfit the training data is there exists another hypothesis which, h´, such that h has less error than h´ on the training data but greater error on independent test data. Overfitting Example Overfitting Example Overfitting Noise in Decision Trees • Category or feature noise can easily cause overfitting. – Add noisy instance : pos (but really neg) Overfitting Noise in Decision Trees • Category or feature noise can easily cause overfitting. – Add noisy instance : pos (but really neg) Overfitting Prevention (Pruning) Methods • Two basic approaches for decision trees – Prepruning: Stop growing tree as some point during top-down construction when there is no longer sufficient data to make reliable decisions. – Postpruning: Grow the full tree, then remove subtrees that do not have sufficient evidence. • Label leaf resulting from pruning with the majority class of the remaining data, or a class probability distribution. • Method for determining which subtrees to prune: – Cross-validation: Reserve some training data as a hold-out set (validation set, tuning set) to evaluate utility of subtrees. – Statistical test: Use a statistical test on the training data to determine if any observed regularity can be dismisses as likely due to random chance. – Minimum description length (MDL): Determine if the additional complexity of the hypothesis is less complex than just explicitly remembering any exceptions resulting from pruning. Additional Decision Tree Issues • Better splitting criteria – Information gain prefers features with many values. • Continuous features • Predicting a real-valued function (regression trees) • Missing feature values • Features with costs • Misclassification costs • Incremental learning • Mining large databases that do not fit in main memory C4.5 • Based on ID3 algorithm, author Ross Quinlan • In all (or most of) non-commercial and commercial data mining tools • Weka Trees -> j48 C4.5 Algorithms • B AQ • Riszard Michalski • Bottom up Regression trees • Linear and multinomial regression • Regression tree • CART • B