PA196: Pattern Recognition
06. Support vector machines
Dr. Vlad Popovici
popovici@recetox.muni.cz
RECETOX
Masaryk University, Brno
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
SLT
• views the problem of "learning" from a statistical perspective
• aim (as for any theory): model some phenomena so that we
can make predictions about them
• other equally valid theories exist: Bayesian inference,
inductive inference, statistical physics, "traditional" statistical
analysis, etc.
• some assumptions need to be made which may deﬁne which
approach is more suitable in different contexts
In SLT:
• we assume data is generated by some underlying (unknown)
distribution P(x, y)
• a sample of n observations i.i.d. is drawn from P and is
available for the learner: S = {(xi, yi) ∈ Rd
× {±1}|i = 1, . . . , n}
• there is a learning algorithm A that chooses a function
f = AF (S) from a function space F as a results of training on
S
• generalization error (expected error):
(S, A, F ) = E(x,y)[l(AF (S), x, y)]
where l is a loss function
• we are interested not only in ES[ (S, A, F )] but also in the
distribution of (S, A, F )
• classiﬁer consistency:
lim
n→∞
ES[ (S, A, F )] = Bayes
where Bayes is the Bayes risk
• the distribution of (S, A, F ) depends on the algorithm, F
and n
• classical statistics: investigates mostly the mean value of the
distribution of
• SLT: looks also at the tails; derives probabilistic bounds on the
generalization error
• hence PAC: probably approximately correct - bound the
probability of being "deceived" and set it equal to some δ
What is the probability of being deceived by a "bad" function f? i.e.
what is the probability of having a perfect training, but a true error
above some ?
PS{ErrS(f) = 0, Err(f) > } = (1 − Err(f))n
≤ (1 − )n
≤ exp(− n)
By taking = 1
n ln 1
δ leads to
PS ErrS(f) = 0, Err(f) >
1
n
ln
1
δ
≤ δ
Now consider a (countable) set of functions F = {f1, . . . , fk , . . . }
and let the probability of being misled by fk less than qk δ
( k qk ≤ 1). Then
PS ∃fk : ErrS(fk ) = 0, Err(fk ) >
1
n
ln
1
qk δ
≤ δ
Theorem
Given a countable set of functions F and qk ≤ 1, with probability at
least 1 − δ over random samples of size n, the generalization error
of a function fk ∈ F with zero training error is bounded by
Err(fk ) ≤
1
n
ln
1
qk
+ ln
1
δ
Notes:
• ln(1/qk ) can be thought of as a "complexity" (description
length) of the function fk
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
• use 0-1 loss: 1
2 |yi − f(xi, α)| ∈ {0, 1}
• the expected error (expected risk or actual risk) is
R(α) =
1
2
|y − f(x, α)| dP(x, y)
• the empirical risk is measured over an observed set (here of
size n):
Remp(α) =
1
2n
n
i=1
|yi − f(xi, α)|
• for such losses, the following bound holds (Vapnik, 1995): for
η ∈ [0, 1], with probability 1 − η,
R(α) ≤ Remp(α) +
h
n
log
2n
h
+
h
n
−
1
n
log
η
4
• h is a non-negative integer called Vapnik-Chervonenkis (VC)
dimension and is a measure of the capacity of the set of
functions f
• the 2nd term of the rhs in above bound (
√
. . .) is called the VC
conﬁdence
• notes: the bound is independent of P(x, y); if we knew h we
could compute rhs
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
• VC dimension is a characteristic of the set of functions
F = {f(x, α)}
• we restrict the analysis to functions f ∈ {±1}
• n points can be labeled in 2n
distinct ways
• if for any labeling of the set of points, a function f(x, α) can be
found in F , then we say the F is shattering the set of points
• the VC dimension (h) of F is the maximum number of points
that can be shattered by F
• if the VC dim of F is h it means that there exists at least one
set of h points that can be shattered, and not that all such sets
can be shattered
Shattering points with oriented
hyperplanes in Rd
The VC dimension of the set of oriented hyperplanes in Rd
is d + 1.
Notes:
• h does not depend on the number of parameters a family of
functions has
• for 2 machines having null empirical risk, the one with smaller
h has better generalization guarantees
• a k−NN classiﬁer with k = 1 has h = ∞ and null empirical
risk → the bound becomes useless
• h depends on the class of functions F , while R and Remp
depend on the particular function selected by the learning
machine
Structural risk minimization
h1
h2
h3
hk
• we introduce a structure over the set of functions, such that
h1 < h2 < · · · < hk < . . .
• idea: ﬁnd that subset of functions which minimizes the
empirical risk, while controlling the complexity
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
(Reminder)
2
|w|
− w0
|w|
w
− ξ
|w|
minimizew,w0,ξ
1
2
w, w + Ω(ξ)
subject to yi( w, xi + w0 ≥ 1 − ξ, i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
• Ω(ξ) = C i ξp
i
; p=1 → 1-norm (L1) soft margin SVM and
p = 2 → 2-Norm (L2) soft margin SVM
• w0 can be computed from w0 = yi − w, xi and a more stable
solution is obtained by averaging over all support vectos
(SVs):
w0 =
1
|SV| i∈SV
(yi − w, xi )
L1 SVM
Dual optimization problem (from KKT conditions):
maximizeα
n
i=1
αi −
1
2
n
i,j=1
αiαjyiyj xi, xj
subject to
n
i=1
yiαi = 0
(box conditions) C ≥ αi ≥ 0, i = 1, . . . , n
Notes:
• if αi = 0 then ξi = 0 and it follows that xi is correctly classiﬁed
• if 0 < αi < C then yi( w, xi + w0) − 1 + ξi = 0 and ξi = 0
meaning that xi is an unbounded support vector
• if αi = C then yi( w, xi + w0) − 1 + ξi = 0 and ξi > 0 meaning
that xi is a bounded support vector. Moreover, if 0 ≥ ξi < 1
then xi is correctly classiﬁed, otherwise it is misclassiﬁed
• w0 is obtained as before, but averaging over unbounded SVs
• the discriminant function is
h(x) =
i∈SV
αiyi xi, x + w0



> 0, predict y = +1
< 0, predict y = −1
L2 SVM
For convenience, we take Ω(ξ) = C/2 i ξ2
i
, which leads to the
dual optimization
maximizeα
n
i=1
αi −
1
2
n
i,j=1
yiyjαiαj xi, xj +
δij
C
subject to
n
i=1
yiαi = 0
αi ≥ 0, i = 1, . . . , n
where δij is Kronecker’s delta function.
Notes:
• w0 is computed from averaging over terms of the form
yi −
n
j=1
αiyi xi, xj +
δij
C
• the decision function remains the same
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
The kernel trick
• the SVM problem was formulated in terms of inner products
• let there a mapping Φ : Rd
→ H (from input space into
feature space) and suppose that there exists a "kernel
function" such that
K(xi, xj) = Φ(xi), Φ(xj)
• H may be inﬁnite-dimensional, ex.
K(xi, xj) = exp

−
xi − xj
2
2σ2


• if we replace xi, xj with K(xi, xj) in the linear SVM, we obtain
a nonlinear SVM!
Φ(x)
Discriminant function:
h(x) =
i∈SV
αiyiK(xi, x) + w0
Which functions can be used as kernels?
For some kernels, it is easy to ﬁnd the corresponding mapping Φ:
for ex., K(xi, xj) = xi, xj
2
corresponds to
Φ : R2
→ R3
, Φ(x) =


x2
1√
2x1x2
x2
2


In general, for a kernel there may exist several possible mappings
Φ.
from Burges: A tutorial on support vector machines for pattern recognition
(Theoretical conditions for kernels)
Mercer’s conditions
There exists a mapphing Φ and an expansion
K(x, x) =
i
Φ(x)Φ(z)
if and only if, for any g(x) such that g(x)2
dx < ∞ then
K(x, y)g(x)g(z) dx dz ≥ 0
• if the Mercer’s conditions are not satisﬁed, there might exist
cases from which the optimization problem has no solution
• the space which is generated by the kernel space is called
Reproducing Kernel Hilbert Space
• kernel matrix (Gram matrix): Kij = K(xi, xj); Hessian matrix:
Hij = yiyjK(xi, xj)
• K is positive semi-deﬁnite
• in L2 SVM, the diagonal of K is augmented by 1/C thus
potentially transforming K into a positive deﬁnite matrix
• all information about the data is concentrated into K
• K can be seen as deﬁning a similarity between samples
Commonly used kernels:
• linear kernel: K(x, z) = x, z
• polynomial kernel: K(x, z) = ( x, z + 1)p
• radial basis function (RBF) kernel: K(x, z) = exp −
xi−xj
2
2σ2
• sigmoid kernel: K(x, y) = tanh(κ x, z − δ): this kernel does
not always satisfy the Mercer conditions!
Kernels - closure properties
If K1, and K2 are some kernels, and a ∈ R+, f a real valued
function, φ : Rd
→ Rm
and B a symmetric positive semi-deﬁnite
d × d matrix, then the following are kernels:
• K1(x, z) + K2(x, z)
• aK1(x, z)
• K1(x, z)K2(x, z)
• K(x, z) = f(x)f(z)
• K1(φ(x), φ(bz))
• K(x, z) = xt
Bz
The solution of the optimization problem is
• global: any local solution of a convex optimization problem is
also a global solution
• unique: if the Hessian matrix is positive deﬁnite the solution is
guaranteed to be unique
In the case the solution is not unique:
• it is still global!
• if w1 and w2 are solutions, then there exists a path
w(τ) = τw1 + (1 − τ)w2 with 0 ≤ τ ≤ 1, such that w(τ) is also
a solution
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
• for a Mercer kernel K, the VC dimension of the SVM is
dim(H) + 1
• the VC dimension of the RKHS generated by the polynomial
kernel is d+p−1
p where p is the degree of the polynomial
• the VC dimension in the case of an RBF is inﬁnite
How comes that SVM can have very good generalization
performance, even in the case of an inﬁnite VC dimension??
Hint: it has to do with the large margin...
Another bound on the generalization error:
E[P(error)] ≤
E[no. of SVs]
n
where E[no. of SVs] is the expected number of support vectors of
all training sets of size n
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Platt scaling
Idea: apply a logistic transformation to the classiﬁer score (margin):
P(y = +1|x) =
1
1 + exp(αh(x) + β)
The parameters α and β are found by optimization.
Some remarks
• SVM have a good overall performance of a large number of
problems - but they are not the "Swiss knife" of pattern
recognition
• one key ingredient: choosing the right kernel
• another key ingredient: choosing the right formulation of the
problem
• in general, there are a number of parameters (e.g. C and p or
σ) that need to be tuned
• C can be used to re-balance the classes: C = C+ + C− and
assign different weights to each class
• support vector regression and support vector density
estimation
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
• why not replace the inner product with kernels in other
methods as well?
• apply the same reasoning in the case of regression...
• this leads to Kernel LDA, Kernel PCA, Kernel Perceptron, etc
etc
Kernel LDA
(Mika et al. Fisher Discriminant Analysis with Kernels, 1999)
Fisher criterion:
w∗
= arg max
w
wt
Sbw
wt Sww
Suppose now that this is carried out in the feature space: means
and scatter matrices are computed on the transformed data.
(Sketch) This can still be expressed in terms of operations in the
input space. Let µΦ = 1/n i Φ(xi) be the mean in the feature
space (for each of the classes you have a similar mean).
The weight vector has the form w = i αiΦ(xi). So the product
w, µ will be of the form
w, µ =
1
n
i,j
αjK(xi, xj)