Instance Based Learning Original slides: Raymond J. Mooney University of Texas at Austin Instance-Based Learning • Unlike other learning algorithms, does not involve construction of an explicit abstract generalization but classifies new instances based on direct comparison and similarity to known training instances. • Training can be very easy, just memorizing training instances. • Testing can be very expensive, requiring detailed comparison to all past training instances. • Also known as: – Case-based – Exemplar-based – Nearest Neighbor – Memory-based – Lazy Learning Similarity/Distance Metrics • Instance-based methods assume a function for determining the similarity or distance between any two instances. • For continuous feature vectors, Euclidian distance is the generic choice: Other Distance Metrics • Mahalanobis distance – Scale-invariant metric that normalizes for variance. • Cosine Similarity – Cosine of the angle between the two vectors. – Used in text and other high-dimensional data. • Pearson correlation – Standard statistical correlation coefficient. – Used for bioinformatics data. • Edit distance – Used to measure distance between unbounded length strings. – Used in text and bioinformatics. K-Nearest Neighbor • Calculate the distance between a test point and every training instance. • Pick the k closest training examples and assign the test instance to the most common category amongst these nearest neighbors. • Voting multiple neighbors helps decrease susceptibility to noise. • Usually use odd value for k to avoid ties. 5-Nearest Neighbor Example Implicit Classification Function • Although it is not necessary to explicitly calculate it, the learned classification rule is based on regions of the feature space closest to each training example. • For 1-nearest neighbor with Euclidian distance, the Voronoi diagram gives the complex polyhedra segmenting the space into the regions closest to each point. Feature Relevance and Weighting • Standard distance metrics weight each feature equally when determining similarity. – Problematic if many features are irrelevant, since similarity along many irrelevant examples could mislead the classification. • Features can be weighted by some measure that indicates their ability to discriminate the category of an example, such as information gain. • Overall, instance-based methods favor global similarity over concept simplicity. Conclusions • IBL methods classify test instances based on similarity to specific training instances rather than forming explicit generalizations. • Typically trade decreased training time for increased testing time.