PA196: Pattern Recognition 06. Support vector machines Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods SLT • views the problem of "learning" from a statistical perspective • aim (as for any theory): model some phenomena so that we can make predictions about them • other equally valid theories exist: Bayesian inference, inductive inference, statistical physics, "traditional" statistical analysis, etc. • some assumptions need to be made which may define which approach is more suitable in different contexts In SLT: • we assume data is generated by some underlying (unknown) distribution P(x, y) • a sample of n observations i.i.d. is drawn from P and is available for the learner: S = {(xi, yi) ∈ Rd × {±1}|i = 1, . . . , n} • there is a learning algorithm A that chooses a function f = AF (S) from a function space F as a results of training on S • generalization error (expected error): (S, A, F ) = E(x,y)[l(AF (S), x, y)] where l is a loss function • we are interested not only in ES[ (S, A, F )] but also in the distribution of (S, A, F ) • classifier consistency: lim n→∞ ES[ (S, A, F )] = Bayes where Bayes is the Bayes risk • the distribution of (S, A, F ) depends on the algorithm, F and n • classical statistics: investigates mostly the mean value of the distribution of • SLT: looks also at the tails; derives probabilistic bounds on the generalization error • hence PAC: probably approximately correct - bound the probability of being "deceived" and set it equal to some δ What is the probability of being deceived by a "bad" function f? i.e. what is the probability of having a perfect training, but a true error above some ? PS{ErrS(f) = 0, Err(f) > } = (1 − Err(f))n ≤ (1 − )n ≤ exp(− n) By taking = 1 n ln 1 δ leads to PS ErrS(f) = 0, Err(f) > 1 n ln 1 δ ≤ δ Now consider a (countable) set of functions F = {f1, . . . , fk , . . . } and let the probability of being misled by fk less than qk δ ( k qk ≤ 1). Then PS ∃fk : ErrS(fk ) = 0, Err(fk ) > 1 n ln 1 qk δ ≤ δ Theorem Given a countable set of functions F and qk ≤ 1, with probability at least 1 − δ over random samples of size n, the generalization error of a function fk ∈ F with zero training error is bounded by Err(fk ) ≤ 1 n ln 1 qk + ln 1 δ Notes: • ln(1/qk ) can be thought of as a "complexity" (description length) of the function fk Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods • use 0-1 loss: 1 2 |yi − f(xi, α)| ∈ {0, 1} • the expected error (expected risk or actual risk) is R(α) = 1 2 |y − f(x, α)| dP(x, y) • the empirical risk is measured over an observed set (here of size n): Remp(α) = 1 2n n i=1 |yi − f(xi, α)| • for such losses, the following bound holds (Vapnik, 1995): for η ∈ [0, 1], with probability 1 − η, R(α) ≤ Remp(α) + h n log 2n h + h n − 1 n log η 4 • h is a non-negative integer called Vapnik-Chervonenkis (VC) dimension and is a measure of the capacity of the set of functions f • the 2nd term of the rhs in above bound ( √ . . .) is called the VC confidence • notes: the bound is independent of P(x, y); if we knew h we could compute rhs Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods • VC dimension is a characteristic of the set of functions F = {f(x, α)} • we restrict the analysis to functions f ∈ {±1} • n points can be labeled in 2n distinct ways • if for any labeling of the set of points, a function f(x, α) can be found in F , then we say the F is shattering the set of points • the VC dimension (h) of F is the maximum number of points that can be shattered by F • if the VC dim of F is h it means that there exists at least one set of h points that can be shattered, and not that all such sets can be shattered Shattering points with oriented hyperplanes in Rd The VC dimension of the set of oriented hyperplanes in Rd is d + 1. Notes: • h does not depend on the number of parameters a family of functions has • for 2 machines having null empirical risk, the one with smaller h has better generalization guarantees • a k−NN classifier with k = 1 has h = ∞ and null empirical risk → the bound becomes useless • h depends on the class of functions F , while R and Remp depend on the particular function selected by the learning machine Structural risk minimization h1 h2 h3 hk • we introduce a structure over the set of functions, such that h1 < h2 < · · · < hk < . . . • idea: find that subset of functions which minimizes the empirical risk, while controlling the complexity Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods (Reminder) 2 |w| − w0 |w| w − ξ |w| minimizew,w0,ξ 1 2 w, w + Ω(ξ) subject to yi( w, xi + w0 ≥ 1 − ξ, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n • Ω(ξ) = C i ξp i ; p=1 → 1-norm (L1) soft margin SVM and p = 2 → 2-Norm (L2) soft margin SVM • w0 can be computed from w0 = yi − w, xi and a more stable solution is obtained by averaging over all support vectos (SVs): w0 = 1 |SV| i∈SV (yi − w, xi ) L1 SVM Dual optimization problem (from KKT conditions): maximizeα n i=1 αi − 1 2 n i,j=1 αiαjyiyj xi, xj subject to n i=1 yiαi = 0 (box conditions) C ≥ αi ≥ 0, i = 1, . . . , n Notes: • if αi = 0 then ξi = 0 and it follows that xi is correctly classified • if 0 < αi < C then yi( w, xi + w0) − 1 + ξi = 0 and ξi = 0 meaning that xi is an unbounded support vector • if αi = C then yi( w, xi + w0) − 1 + ξi = 0 and ξi > 0 meaning that xi is a bounded support vector. Moreover, if 0 ≥ ξi < 1 then xi is correctly classified, otherwise it is misclassified • w0 is obtained as before, but averaging over unbounded SVs • the discriminant function is h(x) = i∈SV αiyi xi, x + w0    > 0, predict y = +1 < 0, predict y = −1 L2 SVM For convenience, we take Ω(ξ) = C/2 i ξ2 i , which leads to the dual optimization maximizeα n i=1 αi − 1 2 n i,j=1 yiyjαiαj xi, xj + δij C subject to n i=1 yiαi = 0 αi ≥ 0, i = 1, . . . , n where δij is Kronecker’s delta function. Notes: • w0 is computed from averaging over terms of the form yi − n j=1 αiyi xi, xj + δij C • the decision function remains the same Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods The kernel trick • the SVM problem was formulated in terms of inner products • let there a mapping Φ : Rd → H (from input space into feature space) and suppose that there exists a "kernel function" such that K(xi, xj) = Φ(xi), Φ(xj) • H may be infinite-dimensional, ex. K(xi, xj) = exp  − xi − xj 2 2σ2   • if we replace xi, xj with K(xi, xj) in the linear SVM, we obtain a nonlinear SVM! Φ(x) Discriminant function: h(x) = i∈SV αiyiK(xi, x) + w0 Which functions can be used as kernels? For some kernels, it is easy to find the corresponding mapping Φ: for ex., K(xi, xj) = xi, xj 2 corresponds to Φ : R2 → R3 , Φ(x) =   x2 1√ 2x1x2 x2 2   In general, for a kernel there may exist several possible mappings Φ. from Burges: A tutorial on support vector machines for pattern recognition (Theoretical conditions for kernels) Mercer’s conditions There exists a mapphing Φ and an expansion K(x, x) = i Φ(x)Φ(z) if and only if, for any g(x) such that g(x)2 dx < ∞ then K(x, y)g(x)g(z) dx dz ≥ 0 • if the Mercer’s conditions are not satisfied, there might exist cases from which the optimization problem has no solution • the space which is generated by the kernel space is called Reproducing Kernel Hilbert Space • kernel matrix (Gram matrix): Kij = K(xi, xj); Hessian matrix: Hij = yiyjK(xi, xj) • K is positive semi-definite • in L2 SVM, the diagonal of K is augmented by 1/C thus potentially transforming K into a positive definite matrix • all information about the data is concentrated into K • K can be seen as defining a similarity between samples Commonly used kernels: • linear kernel: K(x, z) = x, z • polynomial kernel: K(x, z) = ( x, z + 1)p • radial basis function (RBF) kernel: K(x, z) = exp − xi−xj 2 2σ2 • sigmoid kernel: K(x, y) = tanh(κ x, z − δ): this kernel does not always satisfy the Mercer conditions! Kernels - closure properties If K1, and K2 are some kernels, and a ∈ R+, f a real valued function, φ : Rd → Rm and B a symmetric positive semi-definite d × d matrix, then the following are kernels: • K1(x, z) + K2(x, z) • aK1(x, z) • K1(x, z)K2(x, z) • K(x, z) = f(x)f(z) • K1(φ(x), φ(bz)) • K(x, z) = xt Bz The solution of the optimization problem is • global: any local solution of a convex optimization problem is also a global solution • unique: if the Hessian matrix is positive definite the solution is guaranteed to be unique In the case the solution is not unique: • it is still global! • if w1 and w2 are solutions, then there exists a path w(τ) = τw1 + (1 − τ)w2 with 0 ≤ τ ≤ 1, such that w(τ) is also a solution Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods • for a Mercer kernel K, the VC dimension of the SVM is dim(H) + 1 • the VC dimension of the RKHS generated by the polynomial kernel is d+p−1 p where p is the degree of the polynomial • the VC dimension in the case of an RBF is infinite How comes that SVM can have very good generalization performance, even in the case of an infinite VC dimension?? Hint: it has to do with the large margin... Another bound on the generalization error: E[P(error)] ≤ E[no. of SVs] n where E[no. of SVs] is the expected number of support vectors of all training sets of size n Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Platt scaling Idea: apply a logistic transformation to the classifier score (margin): P(y = +1|x) = 1 1 + exp(αh(x) + β) The parameters α and β are found by optimization. Some remarks • SVM have a good overall performance of a large number of problems - but they are not the "Swiss knife" of pattern recognition • one key ingredient: choosing the right kernel • another key ingredient: choosing the right formulation of the problem • in general, there are a number of parameters (e.g. C and p or σ) that need to be tuned • C can be used to re-balance the classes: C = C+ + C− and assign different weights to each class • support vector regression and support vector density estimation Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods • why not replace the inner product with kernels in other methods as well? • apply the same reasoning in the case of regression... • this leads to Kernel LDA, Kernel PCA, Kernel Perceptron, etc etc Kernel LDA (Mika et al. Fisher Discriminant Analysis with Kernels, 1999) Fisher criterion: w∗ = arg max w wt Sbw wt Sww Suppose now that this is carried out in the feature space: means and scatter matrices are computed on the transformed data. (Sketch) This can still be expressed in terms of operations in the input space. Let µΦ = 1/n i Φ(xi) be the mean in the feature space (for each of the classes you have a similar mean). The weight vector has the form w = i αiΦ(xi). So the product w, µ will be of the form w, µ = 1 n i,j αjK(xi, xj)