Bayesian Learning: Naïve Bayes Original slides: Raymond J. Mooney University of Texas at Austin Axioms of Probability Theory • All probabilities between 0 and 1 • True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0. • The probability of disjunction is: Conditional Probability • P(A | B) is the probability of A given B • Assumes that B is all and only information known. • Defined by: Independence • A and B are independent iff: • Therefore, if A and B are independent: Joint Distribution • The joint probability distribution for a set of random variables, X[1],…,X[n] gives the probability of every combination of values (an n-dimensional array with v^n values if all variables are discrete with v values, all v^n values must sum to 1): P(X[1],…,X[n]) • The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. • Therefore, all conditional probabilities can also be calculated. Probabilistic Classification • Let Y be the random variable for the class which takes values {y[1],y[2],…y[m]}. [• ]Let X be the random variable describing an instance consisting of a vector of values for n features , let x[k ]be a possible value for X and x[ij] a possible value for X[i.] • For classification, we need to compute P(Y=y[i] | X=x[k]) for i=1…m • However, given no other assumptions, this requires a table giving the probability of each category for each possible instance in the instance space, which is impossible to accurately estimate from a reasonably-sized training set. – Assuming Y and all X[i] are binary, we need 2^n^ entries to specify P(Y=pos | X=x[k]) for each of the 2^n possible x[k]’s[ ]since[ ]P(Y=neg | X=x[k]) = 1 – P(Y=pos | X=x[k]) – Compared to 2^n+1 – 1 entries for the joint distribution P(Y,X[1],X[2]…X[n]) [ ] Bayes Theorem Simple proof from definition of conditional probability: Bayesian Categorization • Determine category of x[k] by determining for each y[i] • P(X=x[k]) can be determined since categories are complete and disjoint. Bayesian Categorization (cont.) • Need to know: – Priors: P(Y=y[i]) – Conditionals: P(X=x[k] | Y=y[i]) • P(Y=y[i]) are easily estimated from data. – If n[i] of the examples in D are in y[i ]then P(Y=y[i]) = n[i ]/ |D| • Too many possible instances (e.g. 2^n for binary features) to estimate all P(X=x[k] | Y=y[i]). • Still need to make some sort of independence assumptions about the features to make learning tractable. Generative Probabilistic Models • Assume a simple (usually unrealistic) probabilistic method by which the data was generated. • For categorization, each category has a different parameterized generative model that characterizes that category. • Training: Use the data for each category to estimate the parameters of the generative model for that category. – Maximum Likelihood Estimation (MLE): Set parameters to maximize the probability that the model produced the given training data. – If M[λ] denotes a model with parameter values λ and D[k] is the training data for the kth class, find model parameters for class k (λ[k]) that maximize the likelihood of D[k]: • Testing: Use Bayesian analysis to determine the category model that most likely generated a specific test instance. Naïve Bayesian Categorization • If we assume features of an instance are independent given the category (conditionally independent). • Therefore, we then only need to know P(X[i ]| Y) for each possible pair of a feature-value and a category. • If Y and all X[i] and binary, this requires specifying only 2n parameters: [– ]P(X[i]=true | Y=true) and P(X[i]=true | Y=false) for each X[i] – P(X[i]=false | Y) = 1 – P(X[i]=true | Y) [ ] • Compared to specifying 2^n parameters without any independence assumptions. Naïve Bayes Example Naïve Bayes Example Estimating Probabilities • Normally, probabilities are estimated based on observed frequencies in the training data. • If D contains n[k] examples in category y[k], and n[ijk] of these n[k] examples have the jth value for feature X[i], x[ij], then: • However, estimating such probabilities from small training sets is error-prone. • If due only to chance, a rare feature, X[i], is always false in the training data, "y[k] :P(X[i]=true | Y=y[k]) = 0. • If X[i]=true then occurs in a test example, X, the result is that "y[k]: P(X | Y=y[k]) = 0 and "y[k]: P(Y=y[k ]| X) = 0 Probability Estimation Example Smoothing • To account for estimation from small samples, probability estimates are adjusted or smoothed. • Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m. • For binary features, p is simply assumed to be 0.5. Laplace Smothing Example • Assume training set contains 10 positive examples: – 4: small – 0: medium – 6: large • Estimate parameters as follows (if m=1, p=1/3) – P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394 – P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 – P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576 – P(small or medium or large | positive) = 1.0 Continuous Attributes • If X[i] is a continuous feature rather than a discrete one, need another way to calculate P(X[i ]| Y). • Assume that X[i ]has a Gaussian distribution whose mean and variance depends on Y. • During training, for each combination of a continuous feature X[i ]and a class value for Y, y[k], estimate a mean, [ ]μ[ik] , and standard deviation σ[ik ]based on the values of feature X[i] in class y[k] in the training data. • During testing, estimate P(X[i ]| Y=y[k]) for a given example, using the Gaussian distribution defined by μ[ik] and σ[ik ]. Comments on Naïve Bayes • Tends to work well despite strong assumption of conditional independence. • Experiments show it to be quite competitive with other classification methods on standard UCI datasets. • Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases. – Able to learn conjunctive concepts in any case • Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easily calculated from the training data. – Strong bias • Not guarantee consistency with training data. • Typically handles noise well since it does not even focus on completely fitting the training data.