Kernel Methods & SVM Partially based on the ML lecture by Raymond J. Mooney University of Texas at Austin 1 Back to Linear Classifier (Slightly Modified) A linear classifier h[w] is determined by a vector of weights w = (w0, w1, . . . , wn) ∈ Rn+1 as follows: Given x = (x1, . . . , xn) ∈ X ⊆ Rn, h[w](x) := 1 w0 + n i=1 wi · xi ≥ 0 −1 w0 + n i=1 wi · xi < 0 For convenience, we use values {−1, 1} instead of {0, 1}. Note that this is not a principal modification, the linear classifier works exactly as the original one. Recall that given x = (x1, . . . , xn) ∈ Rn, the augmented feature vector is ~x = (x0, x1, . . . , xn) where x0 = 1 This makes the notation for the linear classifier more succinct: h[w](x) = sig(w ·~x) where sig(y) = 1 y ≥ 0 −1 y < 0 2 Perceptron Learning Revisited Given a training set D = {(x1, y(x1)) , (x2, y(x2)) , . . . , (xp, y(xp))} Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and y(xk) ∈ {−1, 1}. We write yk instead of y(xk). Note that ~xk = (xk0, xk1 . . . , xkn) where xk0 = 1. A weight vector w ∈ Rn+1 is consistent with D if h[w](xk) = sig(w ·~xk) = yk for all k = 1, . . . , p D is linearly separable if there is a vector w ∈ Rn+1 which is consistent with D. 3 Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w(0), w(1), w(2), . . .. w(0) is initialized to 0 = (0, . . . , 0). (This is a slight but harmless modification of the traditional algorithm.) In (t + 1)-th step, w(t+1) is computed as follows: If sig(w ·~xk ) = yk , then w(t+1) = w(t) + yk ·~xk . Otherwise, w(t+1) = w(t) . Here k = (t mod p) + 1, i.e. the examples are considered cyclically. (Note that this algorithm corresponds to the perceptron learning with the learning speed ε = 1.) We know: if D is linearly separable, then there is t∗ such that w(t∗) is consistent with D. But what can we do if D is not linearly separable? 4 Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is the decision boundary learned using the perceptron algorithm. (The red boundary corresponds to another learning algorithm.) Left: the green ellipse maps exactly to the green line. How to classify (in the original space): First, transform a given feature vector by squaring the features, then use the linear classifier. 5 Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). However, complexity of learning grows (quickly) with dimension. Sometimes its even beneficial to map to infinite-dimensional spaces. To avoid explicit construction of the higher dimensional feature space, we use so called kernel trick. But first we need to dualize our learning algorithm. 6 Perceptron Learning Revisited Perceptron learning algorithm once more: Compute a sequence of weight vectors w(0), w(1), w(2), . . .. w(0) is initialized to 0 = (0, . . . , 0). In (t + 1)-th step, w(t+1) is computed as follows: If sig(w ·~xk ) = yk , then w(t+1) = w(t) + yk ·~xk . Otherwise, w(t+1) = w(t) . Here k = (t mod p) + 1, i.e. the examples are considered cyclically. Crucial observation: Note that w(t) = p =1 n (t) · y ·~x for suitable n (t) 1 , . . . , n (t) p ∈ N. Intuitively, n (t) counts how many times ~x was added to (if y = 1), or subtracted from (if y = −1) weights. 7 Dual Perceptron Learning Dual Perceptron learning algorithm : Compute a sequence of vectors of numbers n(0), n(1), . . . where each n(t) = (n (t) 1 , . . . , n (t) p ) ∈ Np. n(0) is initialized to 0 = (0, . . . , 0). In (t + 1)-th step, (n (t+1) 1 , . . . , n (t+1) p ) is computed as follows: If sig( p =1 n (t) · y ·~x ·~xk ) = yk , then n (t+1) k := n (t) k + 1, else, n (t+1) k := n (t) k . n (t+1) := n (t) for all = k. Here k = (t mod p) + 1, the examples are considered cyclically. If D is linearly separable, there exists t∗ such that p =1 n (t∗) · y ·~x is consistent with D. The algorithm stops at such t∗ and returns (n (t∗) 1 , . . . , n (t∗) p ) so that p =1 n (t∗) · y ·~x is consistent with D. 8 Example Training set: D = {((2, −1), 1), ((2, 1), 1), ((1, 3), −1)} That is x1 = (2, −1) x2 = (2, 1) x3 = (1, 3) ~x1 = (1, 2, −1) ~x2 = (1, 2, 1) ~x3 = (1, 1, 3) y1 = 1 y2 = 1 y3 = −1 The initial values n (0) 1 = n (0) 2 = n (0) 3 = 0. 9 3 =1 n (0) · y ·~x ·~x1 = 0, thus sig( 3 =1 n (0) · y ·~x ·~x1) = 1 = y1. Hence, n(1) = (0, 0, 0). 3 =1 n (1) · y ·~x ·~x2 = 0, thus sig( 3 =1 n (1) · y ·~x ·~x2) = 1 = y2. Hence, n(2) = (0, 0, 0). 3 =1 n (2) · y ·~x ·~x3 = 0, thus sig( 3 =1 n (2) · y ·~x ·~x3) = 1 = y3. Hence, n(3) = (0, 0, 1). 3 =1 n (3) ·y ·~x ·~x1 = −1·~x3·~x1 = −1·(1, 1, 3)·(1, 2, −1) = −1·0 = 0, thus sig( 3 =1 n (3) · y ·~x ·~x1) = 1 = y1. Hence, n(4) = (0, 0, 1). 3 =1 n (4) ·y ·~x ·~x2 = −1·~x3·~x2 = −1·(1, 1, 3)·(1, 2, 1) = −1·6 = −6, thus sig( p =1 n (4) · y ·~x ·~x2) = −1 = y2. Hence, n(5) = (0, 1, 1). p =1 n (5) · y ·~x ·~x3 = 1 ·~x2 ·~x3 − 1 ·~x3 ·~x3 = −5, thus n(6) = (0, 1, 1). This is OK. p =1 n (6) · y ·~x ·~x1 = 1 ·~x2 ·~x1 − 1 ·~x3 ·~x1 = 4, thus n(7) = (0, 1, 1). This is OK. p =1 n (6) · y ·~x ·~x2 = 1 ·~x2 ·~x2 − 1 ·~x3 ·~x2 = 0, thus n(7) = (0, 1, 1). This is OK. The result: ~x2 −~x3. 10 Dual Perceptron Learning – Output Let p =1 n · y ·~x result from the dual perceptron learning algorithm. I.e., each n = n (t∗ ) ∈ N for suitable t∗ in which the algorithm found a consistent vector. This vector of weights determines a linear classifier that for a given x ∈ Rn gives h[w](x) = sig p =1 n · y ·~x ·~x (Here ~x is the augmented feature vector obtained from x.) Crucial observation: The (augmented) feature vectors ~x and ~x occur only in scalar products! 11 Kernel Trick For simplicity, assume bivariate data: ~xk = (1, xk1, xk2). The corresponding instance in the quadratic feature space is (1, x2 k1, x2 k2). Consider two instances ~xk = (1, xk1, xk2) and ~x = (1, x 1, x 2). Then the scalar product of their corresponding instances (1, x2 k1, x2 k2) and (1, x2 1, x2 2), resp., in the quadratic feature space is 1 + x2 k1x2 1 + x2 k2x2 2 which resembles (but is not equal to) (~xk ·~x )2 = (1 + xk1x 1 + xk2x 2)2 = = 1 + x2 k1x2 1 + x2 k2x2 2 + 2xk1x 1xk2x 2 + 2xk1x 1 + 2xk2x 2 But now consider a mapping φ to R6 defined by φ(~xk ) = (1, x2 k1, x2 k2, √ 2xk1xk2, √ 2xk1, √ 2xk2) Then φ(~xk ) · φ(~x ) = (~xk ·~x )2 THE Idea: Define a kernel κ(~xk ,~x ) = (~xk ·~x )2 and replace ~xk ·~x in the dual perceptron algorithm with κ(~xk ,~x ). 12 Kernel Perceptron Learning Kernel Perceptron learning algorithm : Compute a sequence of vectors of numbers n(0) , n(1) , . . . where each n(t) = (n (t) 1 , . . . , n (t) p ) ∈ Np . n(0) is initialized to 0 = (0, . . . , 0). In (t + 1)-th step, (n (t+1) 1 , . . . , n (t+1) p ) is computed as follows: If sig p =1 n (t) · y · κ(~xk ,~x ) = yk , then n (t+1) k := n (t) k + 1, else, n (t+1) k := n (t) k . n (t+1) := n (t) for all = k. Here k = (t mod p) + 1, the examples are considered cyclically. Intuition: The algorithm computes a linear classifier in R6 for training examples transformed using φ. The trick is that the transformation φ itself does not have to be explicitly computed! 13 Dual Perceptron Learning Let n = (n1, . . . , np) result from the kernel perceptron learning algorithm. I.e., each n = n (t∗ ) ∈ N for suitable t∗ such that sig p =1 n (t∗ ) · y · κ(~xk ,~x ) = yk for all k = 1, . . . , p. We obtain a non-linear classifier that for a given x ∈ Rn gives h[w](x) = sig p =1 n · y · κ(~x,~x ) (Here ~x is the augmented feature vector obtained from x.) Are there other kernels that correspond to the scalar product in higher dimensional spaces? 14 Kernels Given a (potential) kernel κ(x , xk) we need to check whether κ(x , xk) = φ(x ) · φ(xk) for a function φ. This might be very difficult. Věta (Mercer’s) κ is a kernel if the corresponding Gram matrix K of the training set D, whose each k-th element is κ(x , xk), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations linear combination (i.e. multiply by a constant, or sum), multiplication, exponentiation, multiply by a polynomial with non-negative coefficients, · · · (see e.g. "Pattern Recognition and Machine Learning" by Bishop) 15 Examples of Kernels Linear: κ(x , xk ) = x · xk The corresponding mapping φ(x) = x is identity (no transformation). Polynomial of power m: κ(x , xk ) = (1 + x · xk )m The corresponding mapping assigns to x ∈ Rn the vector φ(x) in R(n+m m ). Gaussian (radial-basis function): κ(x , xk ) = e− x −xk 2 2σ2 The corresponding mapping φ maps x to an infinite-dimensional vector φ(x) which is, in fact, a Gaussian function; combination of such functions for support vectors is then the separating hypersurface. · · · Choosing kernels remains to be black magic of kernel methods. They are usually chosen based on trial and error (of course, experience and additional insight into data helps). Now let’s go on to the main area where kernel methods are used: to enhance support vector machines. 16 SVM Idea – Which Linear Classifier is the Best? Benefits of maximum margin: Intuitively, maximum margin is good w.r.t. generalization. Only the support vectors (those on the magin) matter, others can, in principle, be ignored. 17 Support Vector Machines (SVM) Notation: w = (w0, w1, . . . , wn) a vector of weights, w = (w1, . . . , wn) a vector of all weights except w0, x = (x1, . . . , xn) a (generic) feature vector. Consider a linear classifier: h[w](x) := 1 w0 + n i=1 wi · xi = w0 + w · x ≥ 0 −1 w0 + n i=1 wi · xi = w0 + w · x < 0 The signed distance of x from the decision boundary determined by w is d[w](x) = w0 + w · xk w Here w = n i=1 w2 i is the Euclidean norm of w. |d[w](x)| is the distance of x from the decision boundary. d[w](x) is positive for x on the side to which w points and negative on the opposite side. 18 Support Vectors & Margin Given a training set D = {(x1, y(x1)) , (x2, y(x2)) , . . . , (xp, y(xp))} Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and y(xk ) ∈ {−1, 1}. We write yk instead of y(xk ). Assume that D is linearly separable, let w be consistent with D so that the distance of the decision boundary from the nearest examples on both sides is the same (if not, it suffices to adjust w0). Support vectors are those xk that minimize |d[w](xk )|. Margin ρ of w is twice the distance between support vectors and the decision boundary. Our goal is to find a classifier that maximizes the margin. 19 Maximizing the Margin For w consistent with D (such that no xk lies on the decision boundary) we have = 2 · |w0 + w · xk| w = 2 · yk · (w0 + w · xk) w > 0 where xk is a support vector. We may safely consider only w such that yk · (w0 + w · xk) = 1 for the support vectors. Just adjust the length of w so that yk · (w0 + w · xk ) = 1, the denominator w will compensate. Then maximizing is equivalent to maximizing 2/ w . (In what follows we use a bit looser constraint: yk · (w0 + w · xk ) ≥ 1 for all xk However, the result is the same since even with this looser condition, the support vectors always satisfy yk · (w0 + w · xk ) = 1 whenever 2/ w is maximal.) 20 SVM – Optimization Margin maximization can be formulated as a quadratic optimization problem: Find w = (w0, . . . , wn) such that ρ = 2 w is maximized and for all (xk, yk) ∈ D we have yk · (w0 + w · xk) ≥ 1. which can be reformulated as: Find w such that Φ(w) = w 2 = w · w is minimized and for all (xk, yk) ∈ D we have yk · (w0 + w · xk) ≥ 1. 21 SVM – Optimization Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist. The solution usually involves construction of a dual problem where Lagrange multipliers αi are associated with every inequality (constraint) in the original problem: Find α = (α1, . . . , αp) such that Ψ(α) = p =1 α − 1 2 p =1 p k=1 α · αk · y · yk · x · xk is maximized so that the following constraints are satisfied: p =1 α y = 0 α ≥ 0 for all 1 ≤ ≤ p 22 The Optimization Problem Solution Given a solution α1, . . . , αn to the dual problem, solution w = (w0, w1, . . . , wn) to the original one is: w = (w1, . . . , wn) = p =1 α · y · x w0 = yk − p =1 α · y · x · xk for an arbitrary αk > 0 Note that αk > 0 iff xk is a support vector. Hence it does not matter which αk > 0 is chosen in the above definition of w0. The classifier is then h(x) = sig(w0 + w · x) = sig (yk − α · y · x · xk + α · y · x · x) Note that both the dual optimization problem as well as the classifier contain training feature vectors only in the scalar product! We may apply the kernel trick! 23 Kernel SVM Choose your favourite kernel κ. Solve the dual problem with kernel κ: Find α = (α1, . . . , αp) such that Ψ(α) = p =1 α − 1 2 p =1 p k=1 α ·αk ·y ·yk ·κ(x , xk) is maximized so that the following constraints are satisfied: α y = 0 α ≥ 0 for all 1 ≤ ≤ p Then use the classifier: h(x) = sig (yk − α · y · κ(x , xk) + α · y · κ(x , x)) Note that the optimization techniques remain the same as for the linear SVM without kernels! 24 Comments on Algorithms The main bottleneck of SVM’s is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. For small problems any general purpose optimization algorithm can be used. For large problems this is usually not possible, many methods avoiding direct solution have been devised. These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, start with a (smaller) subset of training examples. Find an optimal solution using any solver. Afterwards, only support vectors matter in the solution! Leave only them in the training set, and add new training examples. This iterative procedure decreases the (general) cost function. 25 SVM in Applications (Mooney’s lecture) SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. Most popular optimization algorithms for SVMs use decomposition to hillclimb over a subset of αi ’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99] Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner. 26