Machine Learning
The Art and Science of Algorithms that Make Sense of Data
Peter A. Flach
Intelligent Systems Laboratory, University of Bristol, United Kingdom
February 19, 2020
These slides accompany the above book published by Cambridge University Press in 2012, and
are made freely available for teaching purposes (the copyright remains with the author, however).
The material is divided in four difﬁculty levels A (basic) to D (advanced); this PDF includes all
material up to level B, and advanced material indicated by up to D.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 2 / 15
Assassinating spam e-mail
SpamAssassin is a widely used open-source spam ﬁlter. It calculates a score for
an incoming e-mail, based on a number of built-in rules or ‘tests’ in
SpamAssassin’s terminology, and adds a ‘junk’ ﬂag and a summary report to the
e-mail’s headers if the score is 5 or more.
-0.1 RCVD_IN_MXRATE_WL RBL: MXRate recommends allowing
[123.45.6.789 listed in sub.mxrate.net]
0.6 HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to image area
1.2 TVD_FW_GRAPHIC_NAME_MID BODY: TVD_FW_GRAPHIC_NAME_MID
0.0 HTML_MESSAGE BODY: HTML included in message
0.6 HTML_FONx_FACE_BAD BODY: HTML font face is not a word
1.4 SARE_GIF_ATTACH FULL: Email has a inline gif
0.1 BOUNCE_MESSAGE MTA bounce message
0.1 ANY_BOUNCE_MESSAGE Message is some kind of bounce message
1.4 AWL AWL: From: address is in the auto white-list
From left to right you see the score attached to a particular test, the test
identiﬁer, and a short description including a reference to the relevant part of the
e-mail. As you see, scores for individual tests can be negative (indicating
evidence suggesting the e-mail is ham rather than spam) as well as positive. The
overall score of 5.3 suggests the e-mail might be spam.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 3 / 15
Example 1, p.2 Linear classiﬁcation
Suppose we have only two tests and four training e-mails, one of which is spam
(see Table 1). Both tests succeed for the spam e-mail; for one ham e-mail neither
test succeeds, for another the ﬁrst test succeeds and the second doesn’t, and for
the third ham e-mail the ﬁrst test fails and the second succeeds.
It is easy to see that assigning both tests a weight of 4 correctly ‘classiﬁes’ these
four e-mails into spam and ham. In the mathematical notation introduced in
Background 1 we could describe this classiﬁer as 4x1 +4x2 > 5 or
(4,4)·(x1,x2) > 5.
In fact, any weight between 2.5 and 5 will ensure that the threshold of 5 is only
exceeded when both tests succeed. We could even consider assigning different
weights to the tests – as long as each weight is less than 5 and their sum
exceeds 5 – although it is hard to see how this could be justiﬁed by the training
data.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 4 / 15
Table 1, p.3 Spam ﬁltering as a classiﬁcation task
E-mail x1 x2 Spam? 4x1 +4x2
1 1 1 1 8
2 0 0 0 0
3 1 0 0 4
4 0 1 0 4
The columns marked x1 and x2 indicate the results of two tests on four different e-mails.
The fourth column indicates which of the e-mails are spam. The right-most column
demonstrates that by thresholding the function 4x1 +4x2 at 5, we can separate spam
from ham.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 5 / 15
Figure 1, p.5 Linear classiﬁcation in two dimensions
+
+
+ +
+
+
++
–
–
–
–
–
–
–
x1
x0
x2
w
–
The straight line separates the positives from the negatives. It is deﬁned by w·xi = t,
where w is a vector perpendicular to the decision boundary and pointing in the direction
of the positives, t is the decision threshold, and xi points to a point on the decision
boundary. In particular, x0 points in the same direction as w, from which it follows that
w·x0 = ||w|| ||x0|| = t (||x|| denotes the length of the vector x).
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 6 / 15
Background 1, p.4 Homogeneous coordinates
It is sometimes convenient to simplify notation further by introducing an extra
constant ‘variable’ x0 = 1, the weight of which is ﬁxed to w0 = −t.
The extended data point is then x◦
= (1,x1,...,xn) and the extended weight
vector is w◦
= (−t,w1,...,wn), leading to the decision rule w◦
·x◦
> 0 and the
decision boundary w◦
·x◦
= 0.
Thanks to these so-called homogeneous coordinates the decision boundary
passes through the origin of the extended coordinate system, at the expense of
needing an additional dimension.
note that this doesn’t really affect the data, as all data points and the ‘real’
decision boundary live in the plane x0 = 1.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 7 / 15
Important point to remember
Machine learning is the systematic study of algorithms and systems that improve
their knowledge or performance with experience.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 8 / 15
Figure 2, p.5 Machine learning for spam ﬁltering
SpamAssassin
tests
Linear classiﬁer
E-mails Data Spam?
weights
Learn weights
Training data
At the top we see how SpamAssassin approaches the spam e-mail classiﬁcation task:
the text of each e-mail is converted into a data point by means of SpamAssassin’s
built-in tests, and a linear classiﬁer is applied to obtain a ‘spam or ham’ decision. At the
bottom (in blue) we see the bit that is done by machine learning.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 9 / 15
A Bayesian classiﬁer I
Bayesian spam ﬁlters maintain a vocabulary of words and phrases – potential
spam or ham indicators – for which statistics are collected from a training set.
For instance, suppose that the word ‘Viagra’ occurred in four spam e-mails
and in one ham e-mail. If we then encounter a new e-mail that contains the
word ‘Viagra’, we might reason that the odds that this e-mail is spam are
4:1, or the probability of it being spam is 0.80 and the probability of it being
ham is 0.20.
The situation is slightly more subtle because we have to take into account
the prevalence of spam. Suppose that I receive on average one spam
e-mail for every six ham e-mails. This means that I would estimate the odds
of an unseen e-mail being spam as 1:6, i.e., non-negligible but not very high
either.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 10 / 15
A Bayesian classiﬁer II
If I then learn that the e-mail contains the word ‘Viagra’, which occurs four
times as often in spam as in ham, I need to combine these two odds. As we
shall see later, Bayes’ rule tells us that we should simply multiply them: 1:6
times 4:1 is 4:6, corresponding to a spam probability of 0.4.
In this way you are combining two independent pieces of evidence, one
concerning the prevalence of spam, and the other concerning the occurrence of
the word ‘Viagra’, pulling in opposite directions.
The nice thing about this ‘Bayesian’ classiﬁcation scheme is that it can be
repeated if you have further evidence. For instance, suppose that the odds in
favour of spam associated with the phrase ‘blue pill’ is estimated at 3:1, and
suppose our e-mail contains both ‘Viagra’ and ‘blue pill’, then the combined odds
are 4:1 times 3:1 is 12:1, which is ample to outweigh the 1:6 odds associated
with the low prevalence of spam (total odds are 2:1, or a spam probability of
0.67, up from 0.40 without the ‘blue pill’).
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 11 / 15
A rule-based classiﬁer
if the e-mail contains the word ‘Viagra’ then estimate the odds of spam as
4:1;
otherwise, if it contains the phrase ‘blue pill’ then estimate the odds of spam
as 3:1;
otherwise, estimate the odds of spam as 1:6.
The ﬁrst rule covers all e-mails containing the word ‘Viagra’, regardless of
whether they contain the phrase ‘blue pill’, so no overcounting occurs. The
second rule only covers e-mails containing the phrase ‘blue pill’ but not the word
‘Viagra’, by virtue of the ‘otherwise’ clause. The third rule covers all remaining
e-mails: those which neither contain neither ‘Viagra’ nor ‘blue pill’.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 12 / 15
Figure 3, p.11 How machine learning helps to solve a task
Learning problem
Features
Domain
objects
Data Output
Model
Learning
algorithm
Training data
Task
An overview of how machine learning is used to address a given task. A task (red box)
requires an appropriate mapping – a model – from data described by features to outputs.
Obtaining such a mapping from training data is what constitutes a learning problem (blue
box).
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 13 / 15
Important point to remember
Tasks are addressed by models, whereas learning problems are solved by
learning algorithms that produce models.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 14 / 15
Important point to remember
Machine learning is concerned with using the right features to build the right
models that achieve the right tasks.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data February 19, 2020 15 / 15