Probabilistic Classiﬁcation
Based on the ML lecture by Raymond J. Mooney
University of Texas at Austin
1
Probabilistic Classiﬁcation – Idea
Imagine that
I look out of a window and see a bird,
it is black, approx. 25 cm long, and has a rather yellow beak.
My daughter asks: What kind of bird is this?
My usual answer: This is probably a kind of blackbird (kos černý in
Czech).
Here probably means that out of my extensive catalogue of four
kinds of birds that I am able to recognize, "blackbird" gets the
highest degree of belief based on features of this particular bird.
Frequentists might say that the largest proportion of birds with similar features
I have ever seen were blackbirds.
The degree of belief (Bayesians), or the relative frequency
(frequentists) is the probability.
2
Basic Discrete Probability Theory
A ﬁnite or countably inﬁnite set Ω of possible outcomes, Ω is
called sample space.
Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6}
Each element ω of Ω is assigned a "probability" value f (ω),
here f must satisfy
f (ω) ∈ [0, 1] for all ω ∈ Ω,
ω∈Ω f (ω) = 1.
If the dice is fair, then f (ω) = 1
6
for all ω ∈ {1, . . . , 6}.
An event is any subset E of Ω.
The probability of a given event E ⊆ Ω is deﬁned as
P(E) =
ω∈E
f (ω)
Let E be the event that an odd number is rolled, i.e., E = {1, 3, 5}. Then
P(E) = 1
2
.
Basic laws: P(Ω) = 1, P(∅) = 0, given disjoint sets A, B we
have P(A ∪ B) = P(A) + P(B), P(Ω A) = 1 − P(A).
3
Conditional Probability and Independence
P(A | B) is the probability of A given B (assume P(B) > 0)
deﬁned by
P(A | B) = P(A ∩ B)/P(B)
(We assume that B is all and only information known.)
A fair dice: what is the probability that 3 is rolled assuming that an odd
number is rolled? ... and assuming that an even number is rolled?
The law of total probability: Let A be an event and
B1, . . . , Bn pairwise disjoint events such that Ω = n
i=1 Bi .
Then
P(A) =
n
i=1
P(A ∩ Bi ) =
n
i=1
P(A | Bi ) · P(Bi )
A and B are independent if P(A ∩ B) = P(A) · P(B).
It is easy to show that if P(B) > 0, then
A, B are independent iﬀ P(A | B) = P(A).
4
Random Variables
A random variable X is a function X : Ω → R.
A dice: X : {1, . . . , 6} → {0, 1} such that X(n) = n mod 2.
A probability mass function (pmf) of X is a function p deﬁned
by
p(x) := P(X = x)
Often P(X) is used to denote the pmf of X.
5
Random Vectors
A random vector is a function X : Ω → Rd .
We use X = (X1, . . . , Xd ) where Xi is a random variable
returning the i-th component of X.
A joint probability mass function of X is
pX (x1, . . . , xd ) := P(X1 = x1 ∧ · · · ∧ Xd = xd ).
I.e., pX gives the probability of every combination of values.
Often, P(X1, · · · , Xd ) denotes the joint pmf of X1, . . . , Xd . That is,
P(X1, · · · , Xd ) stands for probabilities P(X1 = x1 ∧ · · · ∧ Xd = xd ) for all
possible combinations of x1, . . . , xd .
The probability mass function pXi
of each Xi is called marginal
probability mass function. We have
pXi
(xi ) = P(Xi = xi ) =
(x1,...,xi−1,xi+1,...,xd )
pX (x1, . . . , xd )
6
Random Vectors – Example
Let Ω be a space of colored geometric shapes that are divided into
two categories (positive and negative).
Assume a random vector X = (Xcolor , Xshape, Xcat) where
Xcolor : Ω → {red, blue},
Xshape : Ω → {circle, square},
Xcat : Ω → {pos, neg}.
The joint pmf is given by the following tables:
positive:
circle square
red 0.2 0.02
blue 0.02 0.01
negative:
circle square
red 0.05 0.3
blue 0.2 0.2
7
Random Vectors – Example
The probability of all possible events can be calculated by summing
the appropriate probabilities.
P(red ∧ circle) = P(Xcolor = red ∧ Xshape = circle)
= P(red ∧ circle ∧ positive)+
+ P(red ∧ circle ∧ negative)
= 0.2 + 0.05 = 0.25
P(red) = 0.2 + 0.02 + 0.05 + 0.3 = 0.57
Thus also all conditional probabilities can be computed:
P(positive | red∧cicle) =
P(positive ∧ red ∧ circle)
P(red ∧ circle)
=
0.2
0.25
= 0.8
8
Conditional Probability Mass Functions
We often have to deal with a pmf of a random vector X
conditioned on values of a random vector Y .
I.e., we are interested in P(X = x | Y = y) for all x and y.
We write P(X | Y ) to denote the pmf of X conditioned on Y .
Technically, P(X | Y ) is a function which takes a possible value x
of X and a possible value y of Y and returns P(X = x | Y = y).
This allows us to say, e.g., that two variables X1 and X2 are
independent conditioned on Y by
P(X1, X2 | Y ) = P(X1 | Y ) · P(X2 | Y )
Technically this means that for all possible values x1 of X1, all
possible values x2 of X2, and all possible values y of Y we have
P(X1 = x1 ∧ X2 = x2 | Y = y) =
P(X1 = x1 | Y = y) · P(X2 = x2 | Y = y)
9
Bayesian Classiﬁcation
Let Ω be a sample space (a universum) of all objects that can be
classiﬁed.
We assume a probability P on Ω.
A training set will be a subset of Ω randomly sampled according to P.
Let Y be the random variable for the category which takes
values in {y1, . . . , ym}.
Let X be the random vector describing n features of a given
instance, i.e., X = (X1, . . . , Xn)
Denote by xk possible values of X,
and by xij possible values of Xi .
Bayes classiﬁer: Given a vector of feature values xk,
CBayes
(xk) := arg max
i∈{1,...,m}
P(Y = yi | X = xk)
Intuitively, CBayes assigns xk to the most probable category it might
be in.
10
Bayesian Classiﬁcation – Example
Imagine a conveyor belt with apples and apricots.
A machine is supposed to correctly distinguish apples from apricots
based on their weight and diameter.
That is,
Y = {apple, apricot},
X = (Xweight, Xdiam).
Assume that we are given a fruit that weighs 40g with 5cm
diameter.
The Bayes classiﬁer compares P(Y = apple | X = (40g, 5cm))
with P(Y = apricot | X = (40g, 5cm)) and selects the more
probable category given the features.
11
Optimality of the Bayes Classiﬁer
Let C be an arbitrary classiﬁer, that is a function that to every xk
assigns a class out of {y1, . . . , ym}.
Slightly abusing notation, we use C to also denote the random
variable which assigns a category to every instance.
(Technically this is a composition C ◦ X of C and X which ﬁrst determines
the features using X and then classiﬁes according to C).
Deﬁne the error of the classiﬁer C by
EC = P(Y = C)
Věta
The Bayes classiﬁer CBayes minimizes EC , that is
ECBayes := min
C is a classiﬁer
EC
12
Optimality of the Bayes Classiﬁer
EC =
m
i=1
P(Y = yi ∧ C = yi )
= 1 −
m
i=1
P(Y = yi ∧ C = yi )
= 1 −
m
i=1 xk
P(Y = yi ∧ C = yi | X = xk )P(X = xk )
= 1 −
xk
P(X = xk )
m
i=1
P(Y = yi ∧ C = yi | X = xk )
= 1 −
xk
P(X = xk )P(Y = C(xk ) | X = xk )
(Here the last equality follows from the fact that C is determined by xk .)
Choosing
C(xk ) = CBayes
(xk ) = arg max
i∈{1,...,m}
P(Y = yi | X = xk )
maximizes P(Y = C(xk ) | X = xk ) and thus minimizes EC .
13
Practical Use of Bayes Classiﬁer
The crucial problem: How to compute P(Y = yi | X = xk) ?
Given no other assumptions, this requires a table giving
the probability of each category for each possible vector of feature
values, which is impossible to accurately estimate from
a reasonably-sized training set.
Concretely, if all Y , X1, . . . , Xn are binary, we need 2n numbers to
specify P(Y = 0 | X = xk) for each possible xk.
(Note that we do not need to specify
P(Y = 1 | X = xk ) = 1 − P(Y = 0 | X = xk )).
It is a bit better than 2n+1 − 1 entries for speciﬁcation of the
complete joint pmf P(Y , X1, . . . , Xn).
However, it is still too large for most classiﬁcation problems.
14
Let’s Look at It the Other Way Round
Věta (Bayes,1764)
P(A | B) =
P(B | A) · P(A)
P(B)
Důkaz.
P(A | B) =
P(A ∩ B)
P(B)
=
P(A∩B)
P(A) · P(A)
P(B)
=
P(B | A) · P(A)
P(B)
15
Bayesian Classiﬁcation
Determine the category for xk by ﬁnding yi maximizing
P(Y = yi | X = xk) =
P(Y = yi ) · P(X = xk | Y = yi )
P(X = xk)
So in order to make the classiﬁer we need to compute:
The prior P(Y = yi ) for every yi
The conditionals P(X = xk | Y = yi ) for every xk and yi
16
Estimating the Prior and Conditionals
P(Y = yi ) can be easily estimated from data:
Given a set of p training examples where
ni of the examples are in the category yi ,
we set
P(Y = yi ) =
ni
p
If the dimension of features is small, P(X = xk | Y = yi ) can
be estimated from data similarly as for P(Y = yi ).
Unfortunately, for higher dimensional data too many examples
are needed to estimate all P(X = xk | Y = yi ) (there are too
many xk’s).
So where is the advantage of using the Bayes thm.?
We introduce independence assumptions about the features!
17
Generative Probabilistic Models
Assume a simple (usually unrealistic) probabilistic method by
which the data was generated.
For classiﬁcation, assume that each category yi has a diﬀerent
parametrized generative model for P(X = xk | Y = yi ).
Maximum Likelihood Estiomation (MLE): Set parameters to
maximize the probability that the model produced the given
training data.
More conceretely: If Mλ denotes a model with parameter
values λ, and Dk is the training data for the k-th category, ﬁnd
model parameters for category k (λk ) that maximizes the
likelihood of Dk :
λk = arg max
λ
P(Dk | Mλ)
18
Generative Probabilistic Models – Simple Example
First, let us illustrate the generative model on a simple example.
Consider two binary features:
Xcolor : Ω → {red, blue}
Xshape : Ω → {circle, square}
and two classes {pos, neg}.
There are 23 = 8 possible combinations of features and classes.
We assume that for each category, the features are distributed
independently:
P(Xcolor , Xshape | pos) = P(Xcolor | pos) · P(Xshape | pos)
P(Xcolor , Xshape | neg) = P(Xcolor | neg) · P(Xshape | neg)
So we have to estimate four numbers (parameters):
P(red | pos), P(circle | pos), P(red | neg), P(circle | neg)
(As opposed to six when we want to completely specify the joint conditional
pmfs.)
19
Generative Probabilistic Models – Simple Example
Given p training examples, assume that p+ of them are positive, p−
of them are negative and that
in +
red positive examples the color is red,
in +
circle positive examples the shape is circle,
in −
red negative examples the color is red,
in −
circle negative examples the shape is circle.
Then MLE estimate ¯P of P is
¯P(red | pos) =
+
red
p+
¯P(circle | pos) =
+
circle
p+
¯P(red | neg) =
−
red
p−
¯P(circle | neg) =
−
circle
p−
Now e.g. ¯P(red ∧ circle | neg) =
−
red
p−
·
−
circle
p−
.
Note that if in reality the features are dependent, then the joint
pmf cannot be obtained by such an estimate!
20
Naive Bayes
We assume that features of an instance are (conditionally)
independent given the category:
P(X | Y ) = P(X1, · · · , Xn | Y ) =
n
i=1
P(Xi | Y )
Therefore, we only need to specify P(Xi | Y ), that is
P(Xi = xij | Y = yk) for each possible pair of a feature-value
xij and a class yk.
Note that if Y and all Xi are binary (values in {0, 1}), this
requires specifying only 2n parameters:
P(Xi = 1 | Y = 1) and P(Xi = 1 | Y = 0) for each Xi
since P(Xi = 0 | Y ) = 1 − P(Xi = 1 | Y ).
Compared to specifying 2n
parameters without any independence assumptions.
21
Naive Bayes – Example
positive negative
P(Y) 0.5 0.5
P(small | Y ) 0.4 0.4
P(medium | Y ) 0.1 0.2
P(large | Y ) 0.5 0.4
P(red | Y ) 0.9 0.3
P(blue | Y ) 0.05 0.3
P(green | Y ) 0.05 0.4
P(square | Y ) 0.05 0.4
P(triangle | Y ) 0.05 0.3
P(circle | Y ) 0.9 0.3
Is (medium, red, circle) positive?
22
positive negative
P(Y) 0.5 0.5
P(medium | Y ) 0.1 0.2
P(red | Y ) 0.9 0.3
P(circle | Y ) 0.9 0.3
Denote xk = (medium, red, circle).
P(pos | X = xk ) =
= P(pos) · P(medium | pos) · P(red | pos) · P(circle | pos) / P(X = xk )
= 0.5 · 0.1 · 0.9 · 0.9 / P(X = xk ) = 0.0405/P(X = xk )
P(neg | X = xk ) =
= P(neg) · P(medium | neg) · P(red | neg) · P(circle | neg) / P(X = xk )
= 0.5 · 0.2 · 0.3 · 0.3 / P(X = xk ) = 0.009/P(X = xk )
Apparently,
P(pos | X = xk ) = 0.0405/P(X = xk ) > 0.009/P(X = xk ) = P(neg | X = xk )
So we classify xk as positive.
23
Estimating Probabilities (In General)
Normally, probabilities are estimated on observed frequencies
in the training data (see the previous example).
Let us have
nk training examples in class yk ,
nijk of these nk examples have the value for Xi equal to xij .
Then we put ¯P(Xi = xij | Y = yk) =
nijk
nk
.
A problem: If, by chance, a rare value xij of a feature Xi
never occurs in the training data, we get
¯P(Xi = xij | Y = yk) = 0 for all k ∈ {1, . . . , m}
But then ¯P(X = xk) = 0 for xk containing the value xij for Xi ,
and thus ¯P(Y = yk | X = xk) is not well deﬁned.
Moreover, ¯P(Y = yk) · ¯P(X = xk | Y = yk) = 0 (for all yk) so
even this cannot be used for classiﬁcation.
24
Probability Estimation Example
Training data:
Size Color Shape Class
small red circle pos
large red circle pos
small red triangle neg
large blue circle neg
Learned probabilities:
positive negative
¯P(Y ) 0.5 0.5
¯P(small | Y ) 0.5 0.5
¯P(medium | Y ) 0 0
¯P(large | Y ) 0.5 0.5
¯P(red | Y ) 1 0.5
¯P(blue | Y ) 0 0.5
¯P(green | Y ) 0 0
¯P(square | Y ) 0 0
¯P(triangle | Y ) 0 0.5
¯P(circle | Y ) 1 0.5
Note that ¯P(medium ∧ red ∧ circle) = 0.
So what is ¯P(pos | medium ∧ red ∧ circle) ?
25
Smoothing
To account for estimation from small samples, probability
estimates are adjusted or smoothed.
Laplace smoothing using an m-estimate works as if
each feature is given a prior probability p,
such feature have been observed with this probability p in
a sample of size m (recall that m is the number of classes).
We get
¯P(Xi = xij | Y = yk) =
nijk + mp
nk + m
(Recall that nk is the number of training examples of class yk,
and nijk is the number of training examples of class yk for
which the i-th feature Xi has the value xij .)
26
Laplace Smothing Example
Assume training set contains 10 positive examples:
4 small
0 medium
6 large
Estimate parameters as follows (m = 2 and p = 1/3)
¯P(small | positive) = (4 + 2/3)/(10 + 2) = 0.389
¯P(medium | positive) = (0 + 2/3)/(10 + 2) = 0.056
¯P(large | positive) = (6 + 2/3)/(10 + 2) = 0.556
(We get
¯P(small ∨ medium ∨ large | positive) = 0.394 + 0.03 + 0.576 = 1.)
27
Continuous Features
Ω may be (potentially) continuous, Xi may assign a continuum of
values in R.
The probabilities are computed using probability density
p : R → R+ instead of pmf.
A random variable X : Ω → R+
has a density p : R → R+
if for every
interval [a, b] we have
P(a ≤ X ≤ b) =
b
a
p(x)dx
Usually, P(Xi | Y = yk) is used to denote the density of Xi
conditioned on Y = yk.
The densities P(Xi | Y = yk) are usually estimated using
Gaussian densities as follows:
Estimate the mean µik and the standard deviation σik based
on training data.
Then put
¯P(Xi | Y = yk ) =
1
σik
√
2π
exp
−(Xi − µik )2
2σ2
ik
28
Comments on Naive Bayes
Tends to work well despite rather strong assumption of
conditional independence of features.
Experiments show it to be quite competitive with other
classiﬁcation methods.
Even if the probabilities are not accurately estimeted, it often picks the
correct maximum probability category.
Directly constructs a hypothesis from parameter estimates that
are calculated from the training data.
Consistency with the training data is not guaranteed.
Typically handles noise well.
Missing values are easy to deal with (simply average over all
missing values in feature vectors).
29
Bayes Classiﬁer vs MAP vs MLE
Recall that the Bayes classiﬁer chooses the category as follows:
CBayes
(xk) = arg max
i∈{1,...,m}
P(Y = yi | X = xk)
= arg max
i∈{1,...,m}
P(Y = yi ) · P(X = xk | Y = yi )
P(X = xk)
As the denominator P(X = xk) is not inﬂuenced by i, the Bayes is
equivalent to the Maximum Aposteriori Probability rule:
CMAP
(xk) = arg max
i∈{1,...,m}
P(Y = yi ) · P(X = xk | Y = yi )
If we do not care about the prior (or assume uniform) we may use
the Maximum Likelihood Estimate rule:
CMLE
(xk) = arg max
i∈{1,...,m}
P(X = xk | Y = yi )
(Intuitively, we maximize the probability that the data xk have been generated
into the category yi .)
30
Bayesian Networks (Basic Information)
In the Naive Bayes we have assumed that all features X1, . . . , Xn
are independent.
This is usually not realistic.
E.g. Variables "rain" and "grass wet" are (usually) strongly dependent.
What if we return some dependencies back?
(But now in a well-deﬁned sense.)
Bayesian networks are a graphical model that uses a directed
acyclic graph to specify dependencies among variables.
31
Bayesian Networks – Example
Now, e.g.,
P(C, S, W , B, A) = P(C) · P(S) · P(W | C) · P(B | C, S) · P(A | B)
Now we may e.g. infer what is the probability P(C = T | A = T) that we sit in
a bad chair assuming that our back aches.
We have to store only 10 numbers as opposed to 25
− 1 if the whole joint
pmf is stored. 32
Bayesian Networks – Learning & Naive Bayes
Many algorithms have been developed for learning:
the structure of the graph of the network,
the conditional probability tables.
The methods are based on maximum-likelihood estimation,
gradient descent, etc.
Automatic procedures are usually combined with expert knowledge.
Can you express the naive Bayes for Y , X1, . . . , Xn using a Bayesian
network?
33