Kernel Methods & SVM Partially based on the ML lecture by Raymond J. Mooney University of Texas at Austin 1 Back to Linear Classifier (Slightly Modified) A linear classifier h[w] is determined by a vector of weights w = (w0, w1, . . . , wn) ∈ Rn+1 as follows: Given x = (x1, . . . , xn) ∈ X ⊆ Rn, h[w](x) := 1 w0 + n i=1 wi · xi ≥ 0 −1 w0 + n i=1 wi · xi < 0 For convenience, we use values {−1, 1} instead of {0, 1}. Note that this is not a principal modification, the linear classifier works exactly as the original one. Recall that given x = (x1, . . . , xn) ∈ Rn, the augmented feature vector is ~x = (x0, x1, . . . , xn) where x0 = 1 This makes the notation for the linear classifier more succinct: h[w](x) = sig(w ·~x) where sig(y) = 1 y ≥ 0 −1 y < 0 2 Perceptron Learning Revisited Given a training set D = {(x1, y(x1)) , (x2, y(x2)) , . . . , (xp, y(xp))} Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and y(xk) ∈ {−1, 1}. We write yk instead of y(xk). Note that ~xk = (xk0, xk1 . . . , xkn) where xk0 = 1. A weight vector w ∈ Rn+1 is consistent with D if h[w](xk) = sig(w ·~xk) = yk for all k = 1, . . . , p D is linearly separable if there is a vector w ∈ Rn+1 which is consistent with D. 3 Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w(0), w(1), w(2), . . .. w(0) is initialized to 0 = (0, . . . , 0). (This is a slight but harmless modification of the traditional algorithm.) In (t + 1)-th step, w(t+1) is computed as follows: If sig(w ·~xk ) = yk , then w(t+1) = w(t) + yk ·~xk . Otherwise, w(t+1) = w(t) . Here k = (t mod p) + 1, i.e. the examples are considered cyclically. (Note that this algorithm corresponds to the perceptron learning with the learning speed ε = 1.) We know: if D is linearly separable, then there is t∗ such that w(t∗) is consistent with D. But what can we do if D is not linearly separable? 4 Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is the decision boundary learned using the perceptron algorithm. (The red boundary corresponds to another learning algorithm.) Left: the green ellipse maps exactly to the green line. How to classify (in the original space): First, transform a given feature vector by squaring the features, then use the linear classifier. 5 Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). However, complexity of learning grows (quickly) with dimension. Sometimes its even beneficial to map to infinite-dimensional spaces. To avoid explicit construction of the higher dimensional feature space, we use so called kernel trick. But first we need to dualize our learning algorithm. 6 Perceptron Learning Revisited Perceptron learning algorithm once more: Compute a sequence of weight vectors w(0), w(1), w(2), . . .. w(0) is initialized to 0 = (0, . . . , 0). In (t + 1)-th step, w(t+1) is computed as follows: If sig(w ·~xk ) = yk , then w(t+1) = w(t) + yk ·~xk . Otherwise, w(t+1) = w(t) . Here k = (t mod p) + 1, i.e. the examples are considered cyclically. Crucial observation: Note that w(t) = p =1 n (t) · y ·~x for suitable n (t) 1 , . . . , n (t) p ∈ N. Intuitively, n (t) counts how many times ~x was added to (if y = 1), or subtracted from (if y = −1) weights. 7 Dual Perceptron Learning Dual Perceptron learning algorithm : Compute a sequence of vectors of numbers n(0), n(1), . . . where each n(t) = (n (t) 1 , . . . , n (t) p ) ∈ Np. n(0) is initialized to 0 = (0, . . . , 0). In (t + 1)-th step, (n (t+1) 1 , . . . , n (t+1) p ) is computed as follows: If sig( p =1 n (t) · y ·~x ·~xk ) = yk , then n (t+1) k := n (t) k + 1, else, n (t+1) k := n (t) k . n (t+1) := n (t) for all = k. Here k = (t mod p) + 1, the examples are considered cyclically. If D is linearly separable, there exists t∗ such that p =1 n (t∗) · y ·~x is consistent with D. The algorithm stops at such t∗ and returns (n (t∗) 1 , . . . , n (t∗) p ) so that p =1 n (t∗) · y ·~x is consistent with D. 8 Example Training set: D = {((2, −1), 1), ((2, 1), 1), ((1, 3), −1)} That is x1 = (2, −1) x2 = (2, 1) x3 = (1, 3) ~x1 = (1, 2, −1) ~x2 = (1, 2, 1) ~x3 = (1, 1, 3) y1 = 1 y2 = 1 y3 = −1 The initial values n (0) 1 = n (0) 2 = n (0) 3 = 0. 9 3 =1 n (0) · y ·~x ·~x1 = 0, thus sig( 3 =1 n (0) · y ·~x ·~x1) = 1 = y1. Hence, n(1) = (0, 0, 0). 3 =1 n (1) · y ·~x ·~x2 = 0, thus sig( 3 =1 n (1) · y ·~x ·~x2) = 1 = y2. Hence, n(2) = (0, 0, 0). 3 =1 n (2) · y ·~x ·~x3 = 0, thus sig( 3 =1 n (2) · y ·~x ·~x3) = 1 = y3. Hence, n(3) = (0, 0, 1). 3 =1 n (3) ·y ·~x ·~x1 = −1·~x3·~x1 = −1·(1, 1, 3)·(1, 2, −1) = −1·0 = 0, thus sig( 3 =1 n (3) · y ·~x ·~x1) = 1 = y1. Hence, n(4) = (0, 0, 1). 3 =1 n (4) ·y ·~x ·~x2 = −1·~x3·~x2 = −1·(1, 1, 3)·(1, 2, 1) = −1·6 = −6, thus sig( p =1 n (4) · y ·~x ·~x2) = −1 = y2. Hence, n(5) = (0, 1, 1). p =1 n (5) · y ·~x ·~x3 = 1 ·~x2 ·~x3 − 1 ·~x3 ·~x3 = −5, thus n(6) = (0, 1, 1). This is OK. p =1 n (6) · y ·~x ·~x1 = 1 ·~x2 ·~x1 − 1 ·~x3 ·~x1 = 4, thus n(7) = (0, 1, 1). This is OK. p =1 n (6) · y ·~x ·~x2 = 1 ·~x2 ·~x2 − 1 ·~x3 ·~x2 = 0, thus n(7) = (0, 1, 1). This is OK. The result: ~x2 −~x3. 10 Dual Perceptron Learning – Output Let p =1 n · y ·~x result from the dual perceptron learning algorithm. I.e., each n = n (t∗ ) ∈ N for suitable t∗ in which the algorithm found a consistent vector. This vector of weights determines a linear classifier that for a given x ∈ Rn gives h[w](x) = sig p =1 n · y ·~x ·~x (Here ~x is the augmented feature vector obtained from x.) Crucial observation: The (augmented) feature vectors ~x and ~x occur only in scalar products! 11 Kernel Trick For simplicity, assume bivariate data: ~xk = (1, xk1, xk2). The corresponding instance in the quadratic feature space is (1, x2 k1, x2 k2). Consider two instances ~xk = (1, xk1, xk2) and ~x = (1, x 1, x 2). Then the scalar product of their corresponding instances (1, x2 k1, x2 k2) and (1, x2 1, x2 2), resp., in the quadratic feature space is 1 + x2 k1x2 1 + x2 k2x2 2 which resembles (but is not equal to) (~xk ·~x )2 = (1 + xk1x 1 + xk2x 2)2 = = 1 + x2 k1x2 1 + x2 k2x2 2 + 2xk1x 1xk2x 2 + 2xk1x 1 + 2xk2x 2 But now consider a mapping φ to R6 defined by φ(~xk ) = (1, x2 k1, x2 k2, √ 2xk1xk2, √ 2xk1, √ 2xk2) Then φ(~xk ) · φ(~x ) = (~xk ·~x )2 THE Idea: Define a kernel κ(~xk ,~x ) = (~xk ·~x )2 and replace ~xk ·~x in the dual perceptron algorithm with κ(~xk ,~x ). 12 Kernel Perceptron Learning Kernel Perceptron learning algorithm : Compute a sequence of vectors of numbers n(0) , n(1) , . . . where each n(t) = (n (t) 1 , . . . , n (t) p ) ∈ Np . n(0) is initialized to 0 = (0, . . . , 0). In (t + 1)-th step, (n (t+1) 1 , . . . , n (t+1) p ) is computed as follows: If sig p =1 n (t) · y · κ(~xk ,~x ) = yk , then n (t+1) k := n (t) k + 1, else, n (t+1) k := n (t) k . n (t+1) := n (t) for all = k. Here k = (t mod p) + 1, the examples are considered cyclically. Intuition: The algorithm computes a linear classifier in R6 for training examples transformed using φ. The trick is that the transformation φ itself does not have to be explicitly computed! 13 Dual Perceptron Learning Let n = (n1, . . . , np) result from the kernel perceptron learning algorithm. I.e., each n = n (t∗ ) ∈ N for suitable t∗ such that sig p =1 n (t∗ ) · y · κ(~xk ,~x ) = yk for all k = 1, . . . , p. We obtain a non-linear classifier that for a given x ∈ Rn gives h[w](x) = sig p =1 n · y · κ(~x,~x ) (Here ~x is the augmented feature vector obtained from x.) A note on connections: Observe that h[w](x) resembles a function computed by a two-layer neural network, where the top layer consists of a single neuron with the activation function sig, and the lower layer contains p neurons, each trained to compute a function κ(~x,~xk ) with a fixed ~xk . Are there other kernels that correspond to the scalar product in higher dimensional spaces? 14 Kernels Given a (potential) kernel κ(x , xk) we need to check whether κ(x , xk) = φ(x ) · φ(xk) for a function φ. This might be very difficult. Věta (Mercer’s) κ is a kernel if the corresponding Gram matrix K of the training set D, whose each k-th element is κ(x , xk), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations linear combination (i.e. multiply by a constant, or sum), multiplication, exponentiation, multiply by a polynomial with non-negative coefficients, · · · (see e.g. "Pattern Recognition and Machine Learning" by Bishop) 15 Examples of Kernels Linear: κ(x , xk ) = x · xk The corresponding mapping φ(x) = x is identity (no transformation). Polynomial of power m: κ(x , xk ) = (1 + x · xk )m The corresponding mapping assigns to x ∈ Rn the vector φ(x) in R(n+m m ). Gaussian (radial-basis function): κ(x , xk ) = e− x −xk 2 2σ2 The corresponding mapping φ maps x to an infinite-dimensional vector φ(x) which is, in fact, a Gaussian function; combination of such functions for support vectors is then the separating hypersurface. · · · Choosing kernels remains to be black magic of kernel methods. They are usually chosen based on trial and error (of course, experience and additional insight into data helps). Now let’s go on to the main area where kernel methods are used: to enhance support vector machines. 16 SVM Idea – Which Linear Classifier is the Best? Maximum margin: Benefits of maximum margin: Intuitively, maximum margin is good w.r.t. generalization. Only the support vectors (those on the magin) matter, others can, in principle, be ignored. 17 Support Vector Machines (SVM) Notation: w = (w0, w1, . . . , wn) a vector of weights, w = (w1, . . . , wn) a vector of all weights except w0, x = (x1, . . . , xn) a (generic) feature vector. Consider a linear classifier: h[w](x) := 1 w0 + n i=1 wi · xi = w0 + w · x ≥ 0 −1 w0 + n i=1 wi · xi = w0 + w · x < 0 The signed distance of x from the decision boundary determined by w is d[w](x) = w0 + w · xk w Here w = n i=1 w2 i is the Euclidean norm of w. |d[w](x)| is the distance of x from the decision boundary. d[w](x) is positive for x on the side to which w points and negative on the opposite side. 18 Support Vectors & Margin Given a training set D = {(x1, y(x1)) , (x2, y(x2)) , . . . , (xp, y(xp))} Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and y(xk ) ∈ {−1, 1}. We write yk instead of y(xk ). Assume that D is linearly separable, let w be consistent with D so that the distance of the decision boundary from the nearest examples on both sides is the same (if not, it suffices to adjust w0). Support vectors are those xk that minimize |d[w](xk )|. Margin ρ of w is twice the distance between support vectors and the decision boundary. Our goal is to find a classifier that maximizes the margin. 19 Maximizing the Margin Let x where y = 1, and xk where yk = −1 be support vectors. i.e. points of minimum distance to the boundary of their respective classes. Note that the width of the margin ρ satisfies ρ = w · (x − xk ) w (see also the picture at the next slide) Indeed, recall that w · (x − xk ) = w x − xk cos α where α is the angle spanned by w and x − xk . This means that = w·(x −xk ) w = x − xk cos α which is the length of the projection of x − xk onto the line parallel with w (and hence perpendicular to the decision boundary). Since x and xk are of the same minimum distance to the boundary, ρ is exactly the width of the margin. 20 Maximizing the Margin −1 1 2 3 −3 −2 −1 1 2 3 4 w x1 x2 x3ρ x1 = a support vector with y1 = −1 x2 = a support vector with y2 = 1 blue line = the decision boundary determined by −1 − 2x1 + 2x2 = 0 ρ = the length of projection of x2 − x1 onto the line parallel with w (x3 is not a support vector) 21 Maximizing the Margin We show that we may safely assume that for all support vectors xr we have w0 + w · xr = yr ∈ {−1, 1} Indeed, let γ > 0 and consider a rescaling γw of w (i.e., we rescale both w0 and w to γw0 and γw, resp.) Note that sig(w0 + w · xr ) = sig(γw0 + γw · xr ) and that ρ = w · (x − xk ) w = γw · (x − xk ) γ w = γw · (x − xk ) γw which means that rescaling does not affect the width of the margin. Note that for all support vectors xr we have that |w0 + w · xr | is the same since the distance |w0 + w · xr |/ w of all support vectors xr to the boundary is the same. Let γ = 1 |w0+w·xr | for an arbitrary support vector xr , then γw0 + γw · xr = w0 + w · xr |w0 + w · xr | = sig(w0 + w · xr ) = yr 22 Maximizing the Margin So we have w such that for all support vectors xr we have w0 + w · xr = yr . Now note that the width of the margin satisfies ρ = w · (x − xk ) w = w · x − w · xk w = (w · x + w0) − (w · xk + w0) w = 1 − (−1) w = 2 w (Recall that x is a support vector with y = 1, and xk is a support vector with yk = −1.) Thus to maximize the margin, it suffices to minimize w . 23 SVM – Optimization Margin maximization can be formulated as a quadratic optimization problem: Find w = (w0, . . . , wn) such that ρ = 2 w is maximized and for all (xk, yk) ∈ D we have yk · (w0 + w · xk) ≥ 1. which can be reformulated as: Find w such that Φ(w) = w 2 = w · w is minimized and for all (xk, yk) ∈ D we have yk · (w0 + w · xk) ≥ 1. 24 SVM – Optimization Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist. The solution usually involves construction of a dual problem where Lagrange multipliers αi are associated with every inequality (constraint) in the original problem: Find α = (α1, . . . , αp) such that Ψ(α) = p =1 α − 1 2 p =1 p k=1 α · αk · y · yk · x · xk is maximized so that the following constraints are satisfied: p =1 α y = 0 α ≥ 0 for all 1 ≤ ≤ p 25 The Optimization Problem Solution Given a solution α1, . . . , αn to the dual problem, solution w = (w0, w1, . . . , wn) to the original one is: w = (w1, . . . , wn) = p =1 α · y · x w0 = yk − p =1 α · y · x · xk for an arbitrary αk > 0 Note that αk > 0 iff xk is a support vector. Hence it does not matter which αk > 0 is chosen in the above definition of w0. The classifier is then h(x) = sig(w0 + w · xk ) = sig (yk − α · y · x · xk + α · y · x · x) Note that both the dual optimization problem as well as the classifier contain training feature vectors only in the scalar product! We may apply the kernel trick! 26 Kernel SVM Choose your favourite kernel κ. Solve the dual problem with kernel κ: Find α = (α1, . . . , αp) such that Ψ(α) = p =1 α − 1 2 p =1 p k=1 α ·αk ·y ·yk ·κ(x , xk) is maximized so that the following constraints are satisfied: α y = 0 α ≥ 0 for all 1 ≤ ≤ p Then use the classifier: h(x) = sig (yk − α · y · κ(x , xk) + α · y · κ(x , x)) Note that the optimization techniques remain the same as for the linear SVM without kernels! 27 Comments on Algorithms The main bottleneck of SVM’s is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. For small problems any general purpose optimization algorithm can be used. For large problems this is usually not possible, many methods avoiding direct solution have been devised. These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, start with a (smaller) subset of training examples. Find an optimal solution using any solver. Afterwards, only support vectors matter in the solution! Leave only them in the training set, and add new training examples. This iterative procedure decreases the (general) cost function. 28 SVM in Applications (Mooney’s lecture) SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. Most popular optimization algorithms for SVMs use decomposition to hillclimb over a subset of αi ’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99] Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner. 29