Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin Decision Trees • Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category. • Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors. • Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF). – red Ù circle → pos – red Ù circle → A blue → B; red Ù square → B green → C; red Ù triangle → C Properties of Decision Tree Learning • Continuous (real-valued) features can be handled by allowing nodes to split a real valued feature into two ranges based on a threshold (e.g. length < 3 and length ³3) • Classification trees have discrete class labels at the leaves, regression trees allow real-valued outputs at the leaves. • Algorithms for finding consistent trees are efficient for processing large amounts of training data for data mining tasks. • Methods developed for handling noisy training data (both class and feature noise). • Methods developed for handling missing feature values. Top-Down Decision Tree Induction • Recursively build a tree top-down by divide and conquer. Top-Down Decision Tree Induction • Recursively build a tree top-down by divide and conquer. Decision Tree Induction Pseudocode Picking a Good Split Feature • Goal is to have the resulting tree be as small as possible, per Occam’s razor. • Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard optimization problem. • Top-down divide-and-conquer method does a greedy search for a simple tree but does not guarantee to find the smallest. – General lesson in ML: “Greed is good.” • Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes. • There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979). Entropy • Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: where p[1] is the fraction of positive examples in S and p[0] is the fraction of negatives. • If all examples are in one category, entropy is zero (we define 0×log(0)=0) • If examples are equally mixed (p[1]=p[0]=0.5), entropy is a maximum of 1. • Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases. • For multi-class problems with c categories, entropy generalizes to: Entropy Plot for Binary Classification Information Gain • The information gain of a feature F is the expected reduction in entropy resulting from splitting on this feature. where S[v] is the subset of S having value v for feature F. • Entropy of each resulting subset weighted by its relative size. • Example: – : + : + – : - :  Hypothesis Space Search • Performs batch learning that processes all training instances at once rather than incremental learning that updates a hypothesis after each example. • Performs hill-climbing (greedy search) that may only find a locally-optimal solution. Guaranteed to find a tree consistent with any conflict-free training set (i.e. identical feature vectors always assigned the same class), but not necessarily the simplest tree. • Finds a single discrete hypothesis, so there is no way to provide confidences or create useful queries. Bias in Decision-Tree Induction • Information-gain gives a bias for trees with minimal depth. • Implements a search (preference) bias instead of a language (restriction) bias. History of Decision-Tree Research • Hunt and colleagues use exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. • In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. • Simulataneously, Breiman and Friedman and colleagues develop CART (Classification and Regression Trees), similar to ID3. • In the 1980’s a variety of improvements are introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. • Quinlan’s updated decision-tree package (C4.5) released in 1993. • Weka includes Java version of C4.5 called J48. Overfitting Prevention (Pruning) Methods • Two basic approaches for decision trees – Prepruning: Stop growing tree as some point during top-down construction when there is no longer sufficient data to make reliable decisions. – Postpruning: Grow the full tree, then remove subtrees that do not have sufficient evidence. • Label leaf resulting from pruning with the majority class of the remaining data, or a class probability distribution. • Method for determining which subtrees to prune: – Cross-validation: Reserve some training data as a hold-out set (validation set, tuning set) to evaluate utility of subtrees. – Statistical test: Use a statistical test on the training data to determine if any observed regularity can be dismisses as likely due to random chance. – Minimum description length (MDL): Determine if the additional complexity of the hypothesis is less complex than just explicitly remembering any exceptions resulting from pruning. Reduced Error Pruning • A post-pruning, cross-validation approach.