Numerical features
Throughout this lecture we assume that all features are
numerical, i.e. feature vectors belong to Rn.
Most non-numerical features can be conveniently transformed
to numerical ones.
For example:
Colors {blue, red, yellow} can be represented by {0, 1, 2} (or
{−1, 0, 1}, ...)
A black-and-white picture of x × y pixels can be encoded as
a vector of xy numbers that capture the shades of gray of
the pixels.
1
Basic Problems
We consider two basic problems:
(Binary) classiﬁcation
Our goal: Classify inputs into
two categories.
Function approximation
(regression)
Our goal: Find a (hypothesized)
functional dependency in data.
2
Binary classiﬁcation in Rn
Assume
a set of instances X ⊆ Rn
,
an unknown categorization function c : X → {0, 1}.
Our goal:
Given a set D of training examples of the form (x, c(x)) where
x ∈ X,
construct a hypothesized categorization function h ∈ H that is
consistent with c on the training examples, i.e.,
h(x) = c(x) for all training examples (x, c(x)) ∈ D
Comments:
In practice, we often do not strictly demand h(x) = c(x) for all training
examples (x, c(x)) ∈ D (often it is impossible)
We are more interested in good generalization, that is how well h
classiﬁes new instances that do not belong to D.
Recall that we usually evaluate accuracy of the resulting
hypothesized function h on a test set.
3
Hypothesis Spaces
We consider two kinds of hypothesis spaces:
Linear (aﬃne) classiﬁers (this lecture)
Classiﬁers based on combinations of linear and sigmoidal
functions (classical neural networks) (next lecture)
4
Length and Scalar Product of Vectors
We consider vectors x = (x1, . . . , xm) ∈ Rm.
Typically, we use Euclidean metric on vectors: |x| = m
i=1 x2
i
The distance between two vectors (points) x, y is |x − y|.
We use the scalar product x · y of vectors x = (x1, . . . , xm)
and y = (y1, . . . , ym) deﬁned by
x · y =
m
i=1
xi yi
Recall that x · y = |x||y| cos θ where θ is the angle between x
and y. That is x · y is the length of the projection of y on x
multiplied by |x|.
Note that x · x = |x|2
5
Linear classiﬁer - example
0
0
0 0
1
1
1
classiﬁcation in plane using
a linear classiﬁer
if a point is incorrectly classiﬁed,
the learning algorithm turns the
line (hyperplane) to improve the
classiﬁcation.
6
Linear Classiﬁer
A linear classiﬁer h[w] is determined by a vector of weights
w = (w0, w1, . . . , wn) ∈ Rn+1 as follows:
Given x = (x1, . . . , xn) ∈ X ⊆ Rn,
h[w](x) :=
1 w0 + n
i=1 wi · xi ≥ 0
0 w0 + n
i=1 wi · xi < 0
More succinctly:
h(x) = sgn w0 +
n
i=1
wi · xi where sgn(y) =
1 y ≥ 0
0 y < 0
7
Linear Classiﬁer – Notation
Given x = (x1, . . . , xn) ∈ Rn we deﬁne an augmented feature vector
~x = (x0, x1, . . . , xn) where x0 = 1
This makes the notation for the linear classiﬁer more succinct:
h[w](x) = sgn(w ·~x)
8
Perceptron Learning
Given a training set
D = {(x1, c(x1)) , (x2, c(x2)) , . . . , (xp, c(xp))}
Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and c(xk) ∈ {0, 1}.
We write ck instead of c(xk).
Note that ~xk = (xk0, xk1 . . . , xkn) where xk0 = 1.
A weight vector w ∈ Rn+1 is consistent with D if
h[w](xk) = sgn(w ·~xk) = ck for all k = 1, . . . , p
D is linearly separable if there is a vector w ∈ Rn+1 which is
consistent with D.
Our goal is to ﬁnd a consistent w assuming that D is linearly
separable.
9
Perceptron – Learning Algorithm
Online learning algorithm:
Idea: Cyclically go through the training examples in D and adapt weights.
Whenever an example is incorrectly classiﬁed, turn the hyperplane so that
the example is closer to it’s correct half-space.
Compute a sequence of weight vectors w(0), w(1), w(2), . . ..
w(0) is randomly initialized close to 0 = (0, . . . , 0)
In (t + 1)-th step, w(t+1) is computed as follows:
w(t+1)
= w(t)
− ε · h[w(t)
](xk) − ck ·~xk
= w(t)
− ε · sgn w(t)
·~xk − ck ·~xk
Here k = (t mod p) + 1, i.e. the examples are considered
cyclically, and 0 < ε ≤ 1 is a learning speed.
Věta (Rosenblatt)
If D is linearly separable, then there is t∗ such that w(t∗) is
consistent with D.
10
Example
Training set:
D = {((2, −1), 1), ((2, 1), 1), ((1, 3), 0)}
That is
x1 = (2, −1)
x2 = (2, 1)
x3 = (1, 3)
~x1 = (1, 2, −1)
~x2 = (1, 2, 1)
~x3 = (1, 1, 3)
c1 = 1
c2 = 1
c3 = 0
Assume that the initial vector w(0) is w(0) = (0, −1, 1).
Consider ε = 1.
11
Example: Separating by w(0)
−1 1 2 3
−3
−2
−1
1
2
3
4
x1
x2
x3
Denoting w(0) =
(w0, w1, w2) = (0, −1, 1)
the blue separating line is given
by w0 + w1x1 + w2x2 = 0.
The red vector normal to
the blue line is (w1, w2).
The points on the side of
(w1, w2) are assigned 1 by the
classiﬁer, the others zero.
(In this case x3 is assigned one
and x2, x3 are assigned zero, all
of this is inconsistent with
c1, c2, c3.)
12
Example: w(1)
We have
w(0)
·~x1 = (0, −1, 1) · (1, 2, −1) = 0 − 2 − 1 = −3
thus
sgn w(0)
·~x1 = 0
and thus
sgn w(0)
·~x1 − c1 = 0 − 1 = −1
(This means that x1 is not well classiﬁed, and w(0)
is not consistent with D.)
Hence,
w(1)
= w(0)
− sgn w(0)
·~x1 − c1 ·~x1
= w(0)
+~x1
= (0, −1, 1) + (1, 2, −1)
= (1, 1, 0)
13
Example
−1 1 2 3
−3
−2
−1
1
2
3
4
x1
x2
x3
14
Example: Separating by w(1)
We have
w(1)
·~x2 = (1, 1, 0) · (1, 2, 1) = 1 + 2 = 3
thus
sgn w(1)
·~x2 = 1
and thus
sgn w(1)
·~x2 − c2 = 1 − 1 = 0
(This means that x2 is currently well classiﬁed by w(1)
. However, as we will see,
x3 is not well classiﬁed.)
Hence,
w(2)
= w(1)
= (1, 1, 0)
15
Example: w(3)
We have
w(2)
·~x3 = (1, 1, 0) · (1, 1, 3) = 1 + 1 = 2
thus
sgn w(2)
·~x3 = 1
and thus
sgn w(2)
·~x3 − c3 = 1 − 0 = 1
(This means that x3 is not well classiﬁed, and w(2)
is not consistent with D.)
Hence,
w(3)
= w(2)
− sgn w(2)
·~x3 − c3 ·~x3
= w(2)
−~x3
= (1, 1, 0) − (1, 1, 3)
= (0, 0, −3)
16
Example: Separating by w(3)
−1 1 2 3
−3
−2
−1
1
2
3
4
x1
x2
x3
17
Example: w(4)
We have
w(3)
·~x1 = (0, 0, −3) · (1, 2, −1) = 3
thus
sgn w(3)
·~x1 = 1
and thus
sgn w(3)
·~x1 − c1 = 1 − 1 = 0
(This means that x1 is currently well classiﬁed by w(3)
. However, as we will see,
x2 is not.)
Hence,
w(4)
= w(3)
= (0, 0, −3)
18
Example: w(5)
We have
w(4)
·~x2 = (0, 0, −3) · (1, 2, 1) = −3
thus
sgn w(4)
·~x2 = 0
and thus
sgn w(4)
·~x2 − c2 = 0 − 1 = −1
(This means that x2 is not well classiﬁed, and w(4)
is not consistent with D.)
Hence,
w(5)
= w(4)
− sgn w(4)
·~x2 − c2 ·~x2
= w(4)
+~x2
= (0, 0, −3) + (1, 2, 1)
= (1, 2, −2)
19
Example: Separating by w(5)
−1 1 2 3
−3
−2
−1
1
2
3
4
x1
x2
x3
20
Example: The result
The vector w(5) is consistent with D:
sgn w(5)
·~x1 = sgn ((1, 2, −2) · (1, 2, −1)) = sgn(7) = 1 = c1
sgn w(5)
·~x2 = sgn ((1, 2, −2) · (1, 2, 1)) = sgn(3) = 1 = c2
sgn w(5)
·~x3 = sgn ((1, 2, −2) · (1, 1, 3)) = sgn(−3) = 0 = c3
21
Perceptron – Learning Algorithm
Batch learning algorithm:
Compute a sequence of weight vectors w(0), w(1), w(2), . . ..
w(0) is randomly initialized close to 0 = (0, . . . , 0)
In (t + 1)-th step, w(t+1) is computed as follows:
w(t+1)
= w(t)
− ε ·
p
k=1
h[w(t)
](xk) − ck ·~xk
= w(t)
− ε ·
p
k=1
sgn w(t)
·~xk − ck ·~xk
Here k = (t mod p) + 1, i.e. the examples are considered
cyclically, and 0 < ε ≤ 1 is a learning speed.
22
Function Approximation – Oaks in Wisconsin
This example is from How to Lie with Statistics by Darrell Huﬀ (1954)
23
Function Approximation
Assume
a set X ⊆ Rn
of instances,
an unknown function f : X → R.
Our goal:
Given a set D of training examples of the form (x, f (x)) where
x ∈ X,
construct a hypothesized function h ∈ H such that
h(x) ≈ f (x) for all training examples (x, f (x)) ∈ D
Here ≈ means that the values are somewhat close to each
other w.r.t. an appropriate error function E.
In what follows we use the least squares deﬁned by
E =
1
2
(x,f (x))∈D
(f (x) − h(x))2
Our goal is to minimize E.
The main reason is that this function has nice mathematical properties
(as opposed e.g. to (x,f (x))∈D |f (x) − h(x)| ).
24
Least Squares – Oaks in Wisconsin
25
Linear Function Approximation
Given a set D of training examples:
D = {(x1, f (x1)) , (x2, f (x2)) , . . . , (xp, f (xp))}
Here xk = (xk1 . . . , xkn) ∈ Rn and fk(x) ∈ R.
Recall that ~xk = (xk0, xk1 . . . , xkn).
Our goal: Find w so that h[w](x) = w ·~x approximates the
function f some of whose values are given by the training set.
Least Squares Error Function:
E(w) =
1
2
p
k=1
(w ·~xk − fk)2
=
1
2
p
k=1
n
i=0
wi xki − fk
2
26
Gradient of the Error Function
Consider the gradient of the error function:
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w) =
p
k=1
(w ·~xk − fk) ·~xk
What is the gradient E(w) ? It is a vector in Rn+1
which points in the
direction of the steepest ascent of E (it’s length corresponds to the steepness).
Note that here the vectors ~xk are ﬁxed parameters of E!
Fakt
If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E.
This follows from the fact that E is a convex
paraboloid that has a unique extreme which is a
minimum.
27
Gradient – illustration
28
Function Approximation – Learning
Gradient Descent:
Weights w(0) are initialized randomly close to 0.
In (t + 1)-th step, w(t+1) is computed as follows:
w(t+1)
= w(t)
− ε · E(w(t)
)
= w(t)
− ε ·
p
k=1
w(t)
·~xk − fk ·~xk
= w(t)
− ε ·
p
k=1
h[w(t)
](xk) − fk ·~xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is the learning speed.
Note that the algorithm is almost similar to the batch perceptron algorithm!
Tvrzení
For suﬃciently small ε > 0 the sequence w(0), w(1), w(2), . . .
converges (component-wisely) to the global minimum of E.
29
Finding the Minimum in Dimension One
Assume n = 1. Then the error function E is
E(w0, w1) =
1
2
p
k=1
(w0 + w1xk − fk)2
Minimize E w.r.t. w0 a w1:
δE
δw0
= 0 ⇔ w0 = ¯f − w1 ¯x ⇔ ¯f = w0 + w1 ¯x
where ¯x = 1
p
p
k=1 xk a ¯f = 1
p
p
k=1 fk
δE
δw1
= 0 ⇔ w1 =
1
p
p
k=1(fk − ¯f )(xk − ¯x)
1
p
p
k=1(xk − ¯x)2
i.e. w1 = cov(f , x)/var(x)
30
Finding the Minimum in Arbitrary Dimension
Let A be a matrix p × (n + 1) (p rows, n + 1 columns) whose k-th
row is the vector ~xk.
Let f = (f1, . . . , fp) be the column vector formed by values of f in
the training set.
Then
E(w) = 0 ⇔ w = (A A)−1
A f
if (A A)−1 exists
(Then (A A)−1
A is the so called Moore-Penrose pseudoinverse of A.)
31
Normal Distribution – Reminder
Distribution of continuous random variables.
Density (one dimensional, that is over R):
p(x) =
1
σ
√
2π
exp −
(x − µ)2
2σ2
=: N[µ, σ2
](x)
µ is the expected value (the mean), σ2 is the variance.
32
Maximum Likelihood vs Least Squares
Fix a training set D = {(x1, f1) , (x2, f2) , . . . , (xp, fp)}
Assume that each fk has been generated randomly by
fk = w ·~xk + k
Here
w0, w1 are unknown constants
k are normally distributed with mean 0 and an unknown variance σ2
Assume that 1, . . . , p have been generated independently.
Denote by p(f1 , . . . , fp | w0, w1, σ2
) the probability density according to
which the values f1, . . . , fn were generated assuming ﬁxed
w0, w1, x1, . . . , xp.
(For interested: The independence and normality imply
p(f1 , . . . , fp | w0, w1, σ2
) =
p
k=1
N[w0 + w1xk , σ2
](fk )
where N[w0 + w1xk , σ2
](fk ) is a normal distribution with the mean w0 + w1xk
and the variance σ2
.)
33
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd w that maximizes the likelihood that the
training set D with ﬁxed values f1, . . . , fn has been generated:
L(w, σ2
) := p(f1, . . . , fp | w, σ2
)
Věta
w maximizes L(w, σ2) for arbitrary σ2 iﬀ w minimizes E(w).
Note that the maximizing/minimizing w does not depend on σ2.
Maximizing σ2 satisﬁes σ2 = 1
p
p
k=1(fk − w · xk)2.
34
Comments on Linear Models
Linear models are parametric, i.e. they have a ﬁxed form with
a small number of parameters that need to be learned from
data (as opposed e.g. to decision trees where the structure is
not ﬁxed in advance).
Linear models are stable, i.e. small variations in in the training
data have only limited impact on the learned model. (tree
models typically vary more with the training data).
Linear models are less likely to overﬁt (low variance) the
training data but sometimes tend to underﬁt (high bias).
35