Kernel Methods & SVM
Partially based on the ML lecture by Raymond J. Mooney
University of Texas at Austin
1
Back to Linear Classiﬁer (Slightly Modiﬁed)
A linear classiﬁer h[w] is determined by a vector of weights
w = (w0, w1, . . . , wn) ∈ Rn+1 as follows:
Given x = (x1, . . . , xn) ∈ X ⊆ Rn,
h[w](x) :=
1 w0 + n
i=1 wi · xi ≥ 0
−1 w0 + n
i=1 wi · xi < 0
For convenience, we use values {−1, 1} instead of {0, 1}. Note that this is not
a principal modiﬁcation, the linear classiﬁer works exactly as the original one.
Recall that given x = (x1, . . . , xn) ∈ Rn, the augmented feature
vector is
~x = (x0, x1, . . . , xn) where x0 = 1
This makes the notation for the linear classiﬁer more succinct:
h[w](x) = sig(w ·~x) where sig(y) =
1 y ≥ 0
−1 y < 0
2
Perceptron Learning Revisited
Given a training set
D = {(x1, y(x1)) , (x2, y(x2)) , . . . , (xp, y(xp))}
Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and y(xk) ∈ {−1, 1}.
We write yk instead of y(xk).
Note that ~xk = (xk0, xk1 . . . , xkn) where xk0 = 1.
A weight vector w ∈ Rn+1 is consistent with D if
h[w](xk) = sig(w ·~xk) = yk for all k = 1, . . . , p
D is linearly separable if there is a vector w ∈ Rn+1 which is
consistent with D.
3
Perceptron Learning Revisited
Perceptron learning algorithm (slightly modiﬁed):
Consider training examples cyclically. Compute a sequence of
weight vectors w(0), w(1), w(2), . . ..
w(0) is initialized to 0 = (0, . . . , 0).
(This is a slight but harmless modiﬁcation of the traditional algorithm.)
In (t + 1)-th step, w(t+1) is computed as follows:
If sig(w ·~xk ) = yk , then w(t+1)
= w(t)
+ yk ·~xk .
Otherwise, w(t+1)
= w(t)
.
Here k = (t mod p) + 1, i.e. the examples are considered
cyclically.
(Note that this algorithm corresponds to the perceptron learning with
the learning speed ε = 1.)
We know: if D is linearly separable, then there is t∗ such that w(t∗)
is consistent with D.
But what can we do if D is not linearly separable?
4
Quadratic Decision Boundary
Left: The original set, Right: Transformed using the square of features.
Right: the green line is the decision boundary learned using
the perceptron algorithm.
(The red boundary corresponds to another learning algorithm.)
Left: the green ellipse maps exactly to the green line.
How to classify (in the original space): First, transform a given feature
vector by squaring the features, then use the linear classiﬁer. 5
Do We Need to Map Explicitly?
In general, mapping to (much) higher feature space helps
(there are more "degrees of freedom" so linear separability might
get a chance).
However, complexity of learning grows (quickly) with dimension.
Sometimes its even beneﬁcial to map to inﬁnite-dimensional spaces.
To avoid explicit construction of the higher dimensional feature
space, we use so called kernel trick.
But ﬁrst we need to dualize our learning algorithm.
6
Perceptron Learning Revisited
Perceptron learning algorithm once more:
Compute a sequence of weight vectors w(0), w(1), w(2), . . ..
w(0) is initialized to 0 = (0, . . . , 0).
In (t + 1)-th step, w(t+1) is computed as follows:
If sig(w ·~xk ) = yk , then w(t+1)
= w(t)
+ yk ·~xk .
Otherwise, w(t+1)
= w(t)
.
Here k = (t mod p) + 1, i.e. the examples are considered
cyclically.
Crucial observation:
Note that w(t) = p
=1 n
(t)
· y ·~x for suitable n
(t)
1 , . . . , n
(t)
p ∈ N.
Intuitively, n
(t)
counts how many times ~x was added to (if y = 1), or
subtracted from (if y = −1) weights.
7
Dual Perceptron Learning
Dual Perceptron learning algorithm :
Compute a sequence of vectors of numbers n(0), n(1), . . . where
each n(t) = (n
(t)
1 , . . . , n
(t)
p ) ∈ Np.
n(0) is initialized to 0 = (0, . . . , 0).
In (t + 1)-th step, (n
(t+1)
1 , . . . , n
(t+1)
p ) is computed as follows:
If sig(
p
=1 n
(t)
· y ·~x ·~xk ) = yk , then n
(t+1)
k := n
(t)
k + 1,
else, n
(t+1)
k := n
(t)
k .
n
(t+1)
:= n
(t)
for all = k.
Here k = (t mod p) + 1, the examples are considered
cyclically.
If D is linearly separable, there exists t∗ such that p
=1 n
(t∗)
· y ·~x
is consistent with D. The algorithm stops at such t∗ and returns
(n
(t∗)
1 , . . . , n
(t∗)
p ) so that p
=1 n
(t∗)
· y ·~x is consistent with D.
8
Example
Training set:
D = {((2, −1), 1), ((2, 1), 1), ((1, 3), −1)}
That is
x1 = (2, −1)
x2 = (2, 1)
x3 = (1, 3)
~x1 = (1, 2, −1)
~x2 = (1, 2, 1)
~x3 = (1, 1, 3)
y1 = 1
y2 = 1
y3 = −1
The initial values n
(0)
1 = n
(0)
2 = n
(0)
3 = 0.
9
3
=1 n
(0)
· y ·~x ·~x1 = 0, thus sig(
3
=1 n
(0)
· y ·~x ·~x1) = 1 = y1.
Hence, n(1)
= (0, 0, 0).
3
=1 n
(1)
· y ·~x ·~x2 = 0, thus sig(
3
=1 n
(1)
· y ·~x ·~x2) = 1 = y2.
Hence, n(2)
= (0, 0, 0).
3
=1 n
(2)
· y ·~x ·~x3 = 0, thus sig(
3
=1 n
(2)
· y ·~x ·~x3) = 1 = y3.
Hence, n(3)
= (0, 0, 1).
3
=1 n
(3)
·y ·~x ·~x1 = −1·~x3·~x1 = −1·(1, 1, 3)·(1, 2, −1) = −1·0 = 0,
thus sig(
3
=1 n
(3)
· y ·~x ·~x1) = 1 = y1. Hence, n(4)
= (0, 0, 1).
3
=1 n
(4)
·y ·~x ·~x2 = −1·~x3·~x2 = −1·(1, 1, 3)·(1, 2, 1) = −1·6 = −6,
thus sig(
p
=1 n
(4)
· y ·~x ·~x2) = −1 = y2. Hence, n(5)
= (0, 1, 1).
p
=1 n
(5)
· y ·~x ·~x3 = 1 ·~x2 ·~x3 − 1 ·~x3 ·~x3 = −5, thus
n(6)
= (0, 1, 1). This is OK.
p
=1 n
(6)
· y ·~x ·~x1 = 1 ·~x2 ·~x1 − 1 ·~x3 ·~x1 = 4, thus
n(7)
= (0, 1, 1). This is OK.
p
=1 n
(6)
· y ·~x ·~x2 = 1 ·~x2 ·~x2 − 1 ·~x3 ·~x2 = 0, thus
n(7)
= (0, 1, 1). This is OK.
The result: ~x2 −~x3.
10
Dual Perceptron Learning – Output
Let p
=1 n · y ·~x result from the dual perceptron learning
algorithm.
I.e., each n = n
(t∗
)
∈ N for suitable t∗
in which the algorithm found
a consistent vector.
This vector of weights determines a linear classiﬁer that for a given
x ∈ Rn gives
h[w](x) = sig
p
=1
n · y ·~x ·~x
(Here ~x is the augmented feature vector obtained from x.)
Crucial observation: The (augmented) feature vectors ~x and ~x
occur only in scalar products!
11
Kernel Trick
For simplicity, assume bivariate data: ~xk = (1, xk1, xk2).
The corresponding instance in the quadratic feature space is (1, x2
k1, x2
k2).
Consider two instances ~xk = (1, xk1, xk2) and ~x = (1, x 1, x 2). Then
the scalar product of their corresponding instances (1, x2
k1, x2
k2) and
(1, x2
1, x2
2), resp., in the quadratic feature space is
1 + x2
k1x2
1 + x2
k2x2
2
which resembles (but is not equal to)
(~xk ·~x )2
= (1 + xk1x 1 + xk2x 2)2
=
= 1 + x2
k1x2
1 + x2
k2x2
2 + 2xk1x 1xk2x 2 + 2xk1x 1 + 2xk2x 2
But now consider a mapping φ to R6
deﬁned by
φ(~xk ) = (1, x2
k1, x2
k2,
√
2xk1xk2,
√
2xk1,
√
2xk2)
Then
φ(~xk ) · φ(~x ) = (~xk ·~x )2
THE Idea: Deﬁne a kernel κ(~xk ,~x ) = (~xk ·~x )2
and replace ~xk ·~x in
the dual perceptron algorithm with κ(~xk ,~x ).
12
Kernel Perceptron Learning
Kernel Perceptron learning algorithm :
Compute a sequence of vectors of numbers n(0)
, n(1)
, . . . where each
n(t)
= (n
(t)
1 , . . . , n
(t)
p ) ∈ Np
.
n(0)
is initialized to 0 = (0, . . . , 0).
In (t + 1)-th step, (n
(t+1)
1 , . . . , n
(t+1)
p ) is computed as follows:
If sig
p
=1 n
(t)
· y · κ(~xk ,~x ) = yk , then n
(t+1)
k := n
(t)
k + 1,
else, n
(t+1)
k := n
(t)
k .
n
(t+1)
:= n
(t)
for all = k.
Here k = (t mod p) + 1, the examples are considered cyclically.
Intuition: The algorithm computes a linear classiﬁer in R6
for training
examples transformed using φ.
The trick is that the transformation φ itself does not have to be explicitly
computed!
13
Dual Perceptron Learning
Let n = (n1, . . . , np) result from the kernel perceptron learning algorithm.
I.e., each n = n
(t∗
)
∈ N for suitable t∗
such that
sig p
=1 n
(t∗
)
· y · κ(~xk ,~x ) = yk for all k = 1, . . . , p.
We obtain a non-linear classiﬁer that for a given x ∈ Rn
gives
h[w](x) = sig
p
=1
n · y · κ(~x,~x )
(Here ~x is the augmented feature vector obtained from x.)
Are there other kernels that correspond to the scalar product in higher
dimensional spaces?
14
Kernels
Given a (potential) kernel κ(x , xk) we need to check whether
κ(x , xk) = φ(x ) · φ(xk) for a function φ. This might be very
diﬃcult.
Věta (Mercer’s)
κ is a kernel if the corresponding Gram matrix K of the training set
D, whose each k-th element is κ(x , xk), is positive semi-deﬁnite
for all possible choices of the training set D.
Kernels can be constructed from existing kernels by several
operations
linear combination (i.e. multiply by a constant, or sum),
multiplication,
exponentiation,
multiply by a polynomial with non-negative coeﬃcients,
· · ·
(see e.g. "Pattern Recognition and Machine Learning" by Bishop)
15
Examples of Kernels
Linear: κ(x , xk ) = x · xk
The corresponding mapping φ(x) = x is identity (no
transformation).
Polynomial of power m: κ(x , xk ) = (1 + x · xk )m
The corresponding mapping assigns to x ∈ Rn
the vector φ(x) in
R(n+m
m ).
Gaussian (radial-basis function): κ(x , xk ) = e−
x −xk
2
2σ2
The corresponding mapping φ maps x to an inﬁnite-dimensional
vector φ(x) which is, in fact, a Gaussian function; combination of
such functions for support vectors is then the separating
hypersurface.
· · ·
Choosing kernels remains to be black magic of kernel methods. They are
usually chosen based on trial and error (of course, experience and
additional insight into data helps).
Now let’s go on to the main area where kernel methods are used: to enhance
support vector machines. 16
SVM Idea – Which Linear Classiﬁer is the Best?
Beneﬁts of maximum margin:
Intuitively, maximum margin is good w.r.t. generalization.
Only the support vectors (those on the magin) matter, others
can, in principle, be ignored.
17
Support Vector Machines (SVM)
Notation:
w = (w0, w1, . . . , wn) a vector of weights,
w = (w1, . . . , wn) a vector of all weights except w0,
x = (x1, . . . , xn) a (generic) feature vector.
Consider a linear classiﬁer:
h[w](x) :=
1 w0 +
n
i=1 wi · xi = w0 + w · x ≥ 0
−1 w0 +
n
i=1 wi · xi = w0 + w · x < 0
The signed distance of x from the decision boundary determined by w is
d[w](x) =
w0 + w · xk
w
Here w =
n
i=1 w2
i is the Euclidean norm of w.
|d[w](x)| is the distance of x from the decision boundary.
d[w](x) is positive for x on the side to which w points and negative on the
opposite side.
18
Support Vectors & Margin
Given a training set
D = {(x1, y(x1)) , (x2, y(x2)) , . . . , (xp, y(xp))}
Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn
and y(xk ) ∈ {−1, 1}.
We write yk instead of y(xk ).
Assume that D is linearly separable, let w be consistent with D so
that the distance of the decision boundary from the nearest
examples on both sides is the same (if not, it suﬃces to adjust w0).
Support vectors are those xk that
minimize |d[w](xk )|.
Margin ρ of w is twice the distance
between support vectors and
the decision boundary.
Our goal is to ﬁnd a classiﬁer that maximizes the margin.
19
Maximizing the Margin
For w consistent with D (such that no xk lies on the decision
boundary) we have
= 2 ·
|w0 + w · xk|
w
= 2 ·
yk · (w0 + w · xk)
w
> 0
where xk is a support vector.
We may safely consider only w such that yk · (w0 + w · xk) = 1 for
the support vectors.
Just adjust the length of w so that yk · (w0 + w · xk ) = 1, the denominator w
will compensate.
Then maximizing is equivalent to maximizing 2/ w .
(In what follows we use a bit looser constraint:
yk · (w0 + w · xk ) ≥ 1 for all xk
However, the result is the same since even with this looser condition,
the support vectors always satisfy yk · (w0 + w · xk ) = 1 whenever 2/ w is
maximal.)
20
SVM – Optimization
Margin maximization can be formulated as a quadratic optimization
problem:
Find w = (w0, . . . , wn) such that
ρ =
2
w
is maximized
and for all (xk, yk) ∈ D we have yk · (w0 + w · xk) ≥ 1.
which can be reformulated as:
Find w such that
Φ(w) = w 2
= w · w is minimized
and for all (xk, yk) ∈ D we have yk · (w0 + w · xk) ≥ 1.
21
SVM – Optimization
Need to optimize a quadratic function subject to linear
constraints.
Quadratic optimization problems are a well-known class of
mathematical programming problems for which eﬃcient
methods (and tools) exist.
The solution usually involves construction of a dual problem
where Lagrange multipliers αi are associated with every
inequality (constraint) in the original problem:
Find α = (α1, . . . , αp) such that
Ψ(α) =
p
=1
α −
1
2
p
=1
p
k=1
α · αk · y · yk · x · xk is maximized
so that the following constraints are satisﬁed:
p
=1 α y = 0
α ≥ 0 for all 1 ≤ ≤ p
22
The Optimization Problem Solution
Given a solution α1, . . . , αn to the dual problem, solution
w = (w0, w1, . . . , wn) to the original one is:
w = (w1, . . . , wn) =
p
=1
α · y · x
w0 = yk −
p
=1
α · y · x · xk for an arbitrary αk > 0
Note that αk > 0 iﬀ xk is a support vector. Hence it does not
matter which αk > 0 is chosen in the above deﬁnition of w0.
The classiﬁer is then
h(x) = sig(w0 + w · x)
= sig (yk − α · y · x · xk + α · y · x · x)
Note that both the dual optimization problem as well as the classiﬁer
contain training feature vectors only in the scalar product! We may apply
the kernel trick!
23
Kernel SVM
Choose your favourite kernel κ.
Solve the dual problem with kernel κ:
Find α = (α1, . . . , αp) such that
Ψ(α) =
p
=1
α −
1
2
p
=1
p
k=1
α ·αk ·y ·yk ·κ(x , xk) is maximized
so that the following constraints are satisﬁed:
α y = 0
α ≥ 0 for all 1 ≤ ≤ p
Then use the classiﬁer:
h(x) = sig (yk − α · y · κ(x , xk) + α · y · κ(x , x))
Note that the optimization techniques remain the same as for
the linear SVM without kernels!
24
Comments on Algorithms
The main bottleneck of SVM’s is in complexity of quadratic
programming (QP). A naive QP solver has cubic complexity.
For small problems any general purpose optimization algorithm
can be used.
For large problems this is usually not possible, many methods
avoiding direct solution have been devised.
These methods usually decompose the optimization problem
into a sequence of smaller ones. Intuitively,
start with a (smaller) subset of training examples.
Find an optimal solution using any solver.
Afterwards, only support vectors matter in the solution! Leave
only them in the training set, and add new training examples.
This iterative procedure decreases the (general) cost function.
25
SVM in Applications (Mooney’s lecture)
SVMs were originally proposed by Boser, Guyon and Vapnik in
1992 and gained increasing popularity in late 1990s.
SVMs are currently among the best performers for a number
of classiﬁcation tasks ranging from text to genomic data.
SVMs can be applied to complex data types beyond feature
vectors (e.g. graphs, sequences, relational data) by designing
kernel functions for such data.
SVM techniques have been extended to a number of tasks
such as regression [Vapnik et al. ’97], principal component
analysis [Schölkopf et al. ’99], etc.
Most popular optimization algorithms for SVMs use
decomposition to hillclimb over a subset of αi ’s at a time, e.g.
SMO [Platt ’99] and [Joachims ’99]
Tuning SVMs remains a black art: selecting a speciﬁc kernel
and parameters is usually done in a try-and-see manner.
26