Bayesian Learning:
                                            Naïve Bayes

                                Original slides: Raymond J. Mooney

                                   University of Texas at Austin

                                   Axioms of Probability Theory

•           All probabilities between 0 and 1


•           True proposition has probability 1, false has probability 0.

        P(true) = 1        P(false) = 0.

•           The probability of  disjunction is:

                                      Conditional Probability

•           P(A | B) is the probability of A given B

•           Assumes that B is all and only information known.

•           Defined by:

                                           Independence

•           A and B are independent iff:


•           Therefore, if A and B are independent:

                                        Joint Distribution

•             The joint probability distribution for a set of random variables, X[1],…,X[n] gives
the probability of every combination of values (an n-dimensional array with v^n values if all
variables are discrete with v values, all v^n values must sum to 1): P(X[1],…,X[n])


•             The probability of all possible conjunctions (assignments of values to some subset of
variables) can be calculated by summing the appropriate subset of values from the joint
distribution.


•             Therefore, all conditional probabilities can also be calculated.

                                   Probabilistic Classification

•             Let Y be the random variable for the class which takes values {y[1],y[2],…y[m]}.

[•               ]Let X be the random variable describing an instance consisting of a vector of
values for n features <X[1],X[2]…X[n]>, let x[k ]be a possible value for X and x[ij] a possible
value for X[i.]

•             For classification, we need to compute P(Y=y[i] | X=x[k]) for i=1…m

•             However, given no other assumptions, this requires a table giving the probability of
each category for each possible instance in the instance space, which is impossible to accurately
estimate from a reasonably-sized training set.

–        Assuming Y and all X[i] are binary, we need 2^n^ entries to specify      P(Y=pos | X=x[k])
for each of the 2^n possible x[k]’s[  ]since[                                     ]P(Y=neg |
X=x[k]) = 1 – P(Y=pos | X=x[k])

–        Compared to 2^n+1 – 1 entries for the joint distribution P(Y,X[1],X[2]…X[n])

[ ]

                                           Bayes Theorem


Simple proof from definition of conditional probability:

                                      Bayesian Categorization

•            Determine category of x[k] by determining for each y[i]


•            P(X=x[k]) can be determined since categories are complete and disjoint.


                                  Bayesian Categorization (cont.)

•            Need to know:

–       Priors: P(Y=y[i])

–       Conditionals: P(X=x[k] | Y=y[i])

•            P(Y=y[i]) are easily estimated from data.

–       If n[i] of the examples in D are in y[i ]then P(Y=y[i]) =  n[i ]/ |D|

•            Too many possible instances (e.g. 2^n for binary features) to estimate all P(X=x[k] |
Y=y[i]).

•            Still need to make some sort of independence assumptions about the features to make
learning tractable.


                                  Generative Probabilistic Models

•             Assume a simple (usually unrealistic) probabilistic method by which the data was
generated.

•             For categorization, each category has a different parameterized generative model that
characterizes that category.

•             Training: Use the data for each category to estimate the parameters of the generative
model for that category.

–        Maximum Likelihood Estimation (MLE):  Set parameters to maximize the probability that the
model produced the given training data.

–        If M[λ] denotes a model with parameter values λ and D[k] is the training data for the kth
class, find model parameters for class k (λ[k]) that maximize the likelihood of D[k]:


•             Testing: Use Bayesian analysis to determine the category model that most likely
generated a specific test instance.

                                   Naïve Bayesian Categorization

•             If we assume features of an instance are independent given the category
(conditionally independent).


•             Therefore, we then only need to know P(X[i ]| Y) for each possible pair of a
feature-value and a category.

•             If Y and all X[i] and binary, this requires specifying only 2n parameters:

[–           ]P(X[i]=true | Y=true) and P(X[i]=true | Y=false) for each X[i]

–        P(X[i]=false | Y) = 1 – P(X[i]=true | Y)

[ ]

•             Compared to specifying 2^n parameters without any independence assumptions.

                                        Naïve Bayes Example

                                        Naïve Bayes Example

                                     Estimating Probabilities

•             Normally, probabilities are estimated based on observed frequencies in the training
data.

•             If D contains n[k] examples in category y[k], and n[ijk] of these n[k] examples have
the jth value for feature X[i], x[ij], then:


•             However, estimating such probabilities from small training sets is error-prone.

•             If due only to chance, a rare feature, X[i], is always false in the training data,
"y[k] :P(X[i]=true | Y=y[k]) = 0.

•             If  X[i]=true then occurs in a test example, X, the result is that "y[k]: P(X |
Y=y[k]) = 0 and "y[k]: P(Y=y[k ]| X) = 0

                                  Probability Estimation Example

                                             Smoothing

•            To account for estimation from small samples, probability estimates are adjusted or
smoothed.

•            Laplace smoothing using an m-estimate assumes that each feature is given a prior
probability, p, that is assumed to have been previously observed in a “virtual” sample of size m.


•            For binary features, p is simply assumed to be 0.5.


                                     Laplace Smothing Example

•            Assume training set contains 10 positive examples:

–       4: small

–       0: medium

–       6: large

•             Estimate parameters as follows (if m=1, p=1/3)

–       P(small | positive) = (4 + 1/3) / (10 + 1) =     0.394

–       P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03

–       P(large | positive) = (6 + 1/3) / (10 + 1) =      0.576

–       P(small or medium or large | positive) =        1.0


                                       Continuous Attributes

•             If X[i] is a continuous feature rather than a discrete one, need another way to
calculate P(X[i ]| Y).

•             Assume that X[i ]has a Gaussian distribution whose mean and variance depends on Y.

•             During training, for each combination of a continuous feature X[i ]and a class value
for Y, y[k], estimate a mean, [ ]μ[ik] , and standard deviation σ[ik ]based on the values of
feature X[i] in class y[k] in the training data.

•             During testing, estimate P(X[i ]| Y=y[k]) for a given example, using the Gaussian
distribution defined by μ[ik] and σ[ik ].

                                      Comments on Naïve Bayes

•             Tends to work well despite strong assumption of conditional independence.

•             Experiments show it to be quite competitive with other classification methods on
standard UCI datasets.

•             Although it does not produce accurate probability estimates when its independence
assumptions are violated, it may still pick the correct maximum-probability class in many cases.

–        Able to learn conjunctive concepts in any case

•             Does not perform any search of the hypothesis space.  Directly constructs a
hypothesis from parameter estimates that are easily calculated from the training data.

–        Strong bias

•             Not guarantee consistency with training data.

•             Typically handles noise well since it does not even focus on completely fitting the
training data.