Machine Learning The Art and Science of Algorithms that Make Sense of Data Peter A. Flach Intelligent Systems Laboratory, University of Bristol, United Kingdom February 19, 2020 These slides accompany the above book published by Cambridge University Press in 2012, and are made freely available for teaching purposes (the copyright remains with the author, however). The material is divided in four difficulty levels A (basic) to D (advanced); this PDF includes all material up to level B, and advanced material indicated by up to D. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 2 / 15 Assassinating spam e-mail SpamAssassin is a widely used open-source spam filter. It calculates a score for an incoming e-mail, based on a number of built-in rules or ‘tests’ in SpamAssassin’s terminology, and adds a ‘junk’ flag and a summary report to the e-mail’s headers if the score is 5 or more. -0.1 RCVD_IN_MXRATE_WL RBL: MXRate recommends allowing [123.45.6.789 listed in sub.mxrate.net] 0.6 HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to image area 1.2 TVD_FW_GRAPHIC_NAME_MID BODY: TVD_FW_GRAPHIC_NAME_MID 0.0 HTML_MESSAGE BODY: HTML included in message 0.6 HTML_FONx_FACE_BAD BODY: HTML font face is not a word 1.4 SARE_GIF_ATTACH FULL: Email has a inline gif 0.1 BOUNCE_MESSAGE MTA bounce message 0.1 ANY_BOUNCE_MESSAGE Message is some kind of bounce message 1.4 AWL AWL: From: address is in the auto white-list From left to right you see the score attached to a particular test, the test identifier, and a short description including a reference to the relevant part of the e-mail. As you see, scores for individual tests can be negative (indicating evidence suggesting the e-mail is ham rather than spam) as well as positive. The overall score of 5.3 suggests the e-mail might be spam. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 3 / 15 Example 1, p.2 Linear classification Suppose we have only two tests and four training e-mails, one of which is spam (see Table 1). Both tests succeed for the spam e-mail; for one ham e-mail neither test succeeds, for another the first test succeeds and the second doesn’t, and for the third ham e-mail the first test fails and the second succeeds. It is easy to see that assigning both tests a weight of 4 correctly ‘classifies’ these four e-mails into spam and ham. In the mathematical notation introduced in Background 1 we could describe this classifier as 4x1 +4x2 > 5 or (4,4)·(x1,x2) > 5. In fact, any weight between 2.5 and 5 will ensure that the threshold of 5 is only exceeded when both tests succeed. We could even consider assigning different weights to the tests – as long as each weight is less than 5 and their sum exceeds 5 – although it is hard to see how this could be justified by the training data. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 4 / 15 Table 1, p.3 Spam filtering as a classification task E-mail x1 x2 Spam? 4x1 +4x2 1 1 1 1 8 2 0 0 0 0 3 1 0 0 4 4 0 1 0 4 The columns marked x1 and x2 indicate the results of two tests on four different e-mails. The fourth column indicates which of the e-mails are spam. The right-most column demonstrates that by thresholding the function 4x1 +4x2 at 5, we can separate spam from ham. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 5 / 15 Figure 1, p.5 Linear classification in two dimensions + + + + + + ++ – – – – – – – x1 x0 x2 w – The straight line separates the positives from the negatives. It is defined by w·xi = t, where w is a vector perpendicular to the decision boundary and pointing in the direction of the positives, t is the decision threshold, and xi points to a point on the decision boundary. In particular, x0 points in the same direction as w, from which it follows that w·x0 = ||w|| ||x0|| = t (||x|| denotes the length of the vector x). cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 6 / 15 Background 1, p.4 Homogeneous coordinates It is sometimes convenient to simplify notation further by introducing an extra constant ‘variable’ x0 = 1, the weight of which is fixed to w0 = −t. The extended data point is then x◦ = (1,x1,...,xn) and the extended weight vector is w◦ = (−t,w1,...,wn), leading to the decision rule w◦ ·x◦ > 0 and the decision boundary w◦ ·x◦ = 0. Thanks to these so-called homogeneous coordinates the decision boundary passes through the origin of the extended coordinate system, at the expense of needing an additional dimension. note that this doesn’t really affect the data, as all data points and the ‘real’ decision boundary live in the plane x0 = 1. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 7 / 15 Important point to remember Machine learning is the systematic study of algorithms and systems that improve their knowledge or performance with experience. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 8 / 15 Figure 2, p.5 Machine learning for spam filtering SpamAssassin tests Linear classifier E-mails Data Spam? weights Learn weights Training data At the top we see how SpamAssassin approaches the spam e-mail classification task: the text of each e-mail is converted into a data point by means of SpamAssassin’s built-in tests, and a linear classifier is applied to obtain a ‘spam or ham’ decision. At the bottom (in blue) we see the bit that is done by machine learning. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 9 / 15 A Bayesian classifier I Bayesian spam filters maintain a vocabulary of words and phrases – potential spam or ham indicators – for which statistics are collected from a training set. For instance, suppose that the word ‘Viagra’ occurred in four spam e-mails and in one ham e-mail. If we then encounter a new e-mail that contains the word ‘Viagra’, we might reason that the odds that this e-mail is spam are 4:1, or the probability of it being spam is 0.80 and the probability of it being ham is 0.20. The situation is slightly more subtle because we have to take into account the prevalence of spam. Suppose that I receive on average one spam e-mail for every six ham e-mails. This means that I would estimate the odds of an unseen e-mail being spam as 1:6, i.e., non-negligible but not very high either. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 10 / 15 A Bayesian classifier II If I then learn that the e-mail contains the word ‘Viagra’, which occurs four times as often in spam as in ham, I need to combine these two odds. As we shall see later, Bayes’ rule tells us that we should simply multiply them: 1:6 times 4:1 is 4:6, corresponding to a spam probability of 0.4. In this way you are combining two independent pieces of evidence, one concerning the prevalence of spam, and the other concerning the occurrence of the word ‘Viagra’, pulling in opposite directions. The nice thing about this ‘Bayesian’ classification scheme is that it can be repeated if you have further evidence. For instance, suppose that the odds in favour of spam associated with the phrase ‘blue pill’ is estimated at 3:1, and suppose our e-mail contains both ‘Viagra’ and ‘blue pill’, then the combined odds are 4:1 times 3:1 is 12:1, which is ample to outweigh the 1:6 odds associated with the low prevalence of spam (total odds are 2:1, or a spam probability of 0.67, up from 0.40 without the ‘blue pill’). cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 11 / 15 A rule-based classifier if the e-mail contains the word ‘Viagra’ then estimate the odds of spam as 4:1; otherwise, if it contains the phrase ‘blue pill’ then estimate the odds of spam as 3:1; otherwise, estimate the odds of spam as 1:6. The first rule covers all e-mails containing the word ‘Viagra’, regardless of whether they contain the phrase ‘blue pill’, so no overcounting occurs. The second rule only covers e-mails containing the phrase ‘blue pill’ but not the word ‘Viagra’, by virtue of the ‘otherwise’ clause. The third rule covers all remaining e-mails: those which neither contain neither ‘Viagra’ nor ‘blue pill’. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 12 / 15 Figure 3, p.11 How machine learning helps to solve a task Learning problem Features Domain objects Data Output Model Learning algorithm Training data Task An overview of how machine learning is used to address a given task. A task (red box) requires an appropriate mapping – a model – from data described by features to outputs. Obtaining such a mapping from training data is what constitutes a learning problem (blue box). cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 13 / 15 Important point to remember Tasks are addressed by models, whereas learning problems are solved by learning algorithms that produce models. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 14 / 15 Important point to remember Machine learning is concerned with using the right features to build the right models that achieve the right tasks. cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 15 / 15