What about classiﬁcation?
Binary classiﬁcation: Desired outputs 0 and 1.
Ideally, capture the probability distribution of classes.
14
What about classiﬁcation?
Binary classiﬁcation: Desired outputs 0 and 1.
... does not capture probability well (it is not a probability at all)
14
What about classiﬁcation?
Binary classiﬁcation: Desired outputs 0 and 1.
... logistic sigmoid 1
1+e−(w·x)
is much better!
14
Logistic regression
x1 x2 xn
· · ·
y
x0 = 1
w0
w1 w2 wn
w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1.
Activity:
inner potential: ξ = w0 + n
i=1 wixi = n
i=0 wixi = w · x
activation function: σ(ξ) = 1
1+e−ξ
network function: y[w](x) = σ(ξ) = 1
1+e−(w·x)
Intuition: The output y is now interpreted as the probability of the class 1
given the input x.
15
But what is the meaning of the sigmoid?
The model gives a probability y of the class 1 given an input x.
But why we model such a probability using 1/(1 + e−w·x) ??
16
But what is the meaning of the sigmoid?
The model gives a probability y of the class 1 given an input x.
But why we model such a probability using 1/(1 + e−w·x) ??
Let ˆy be the "true" probability of the class 1 to be modeled.
What about odds of the class 1?
odds(ˆy) = ˆy/1 − ˆy
Resembles an exponential function ...
16
But what is the meaning of the sigmoid?
The model gives a probability y of the class 1 given an input x.
But why we model such a probability using 1/(1 + e−w·x) ??
Let ˆy be the "true" probability of the class 1 to be modeled.
What about log odds (aka logit) of the class 1?
logit(ˆy) = log(ˆy/(1 − ˆy))
Looks almost linear ...
16
But what is the meaning of the sigmoid?
Assume that ˆy is the probability of the class 1. Put
log(ˆy/(1 − ˆy)) = w · x
17
But what is the meaning of the sigmoid?
Assume that ˆy is the probability of the class 1. Put
log(ˆy/(1 − ˆy)) = w · x
Then
log((1 − ˆy)/ˆy) = −w · x
17
But what is the meaning of the sigmoid?
Assume that ˆy is the probability of the class 1. Put
log(ˆy/(1 − ˆy)) = w · x
Then
log((1 − ˆy)/ˆy) = −w · x
and
(1 − ˆy)/ˆy = e−w·x
17
But what is the meaning of the sigmoid?
Assume that ˆy is the probability of the class 1. Put
log(ˆy/(1 − ˆy)) = w · x
Then
log((1 − ˆy)/ˆy) = −w · x
and
(1 − ˆy)/ˆy = e−w·x
and
ˆy =
1
1 + e−w·x
That is, if we model log odds using a linear function, the probability is
obtained by applying the logistic sigmoid on the result of the linear function.
17
Logistic regression
Learning:
Given a training dataset
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ {0, 1} is the expected output.
18
Logistic regression
Learning:
Given a training dataset
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ {0, 1} is the expected output.
What error function?
(Binary) cross-entropy:
E(w) =
p
k=1
−(dk log(yk ) + (1 − dk ) log(1 − yk ))
What?!?
18
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
19
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is ˆy and is unknown!
19
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is ˆy and is unknown!
You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
19
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is ˆy and is unknown!
You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
What is the best model y of ˆy based on the data?
19
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is ˆy and is unknown!
You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
What is the best model y of ˆy based on the data?
Answer: The one that generates the data with maximum
probability!
19
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
20
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T from our model is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
20
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T from our model is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
Maximize
LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y)
20
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T from our model is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
Maximize
LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y)
But then
−LL = −1·log(y)−1·log(y)−(1−0)·log(1−y)−(1−0)·log(1−y)−1·log(y)
i.e. −LL is the cross-entropy.
20
Let the coin depend on the input
Consider our model:
y =
1
1 + e−(w·x)
21
Let the coin depend on the input
Consider our model:
y =
1
1 + e−(w·x)
The training dataset is now standard:
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input,
and dk ∈ {0, 1} is the expected output.
21
Let the coin depend on the input
Consider our model:
y =
1
1 + e−(w·x)
The training dataset is now standard:
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input,
and dk ∈ {0, 1} is the expected output.
The likelihood:
L =
p
k=1
ydk
k
· (1 − yk )(1−dk )
and LL = log(L) =
p
k=1
(dk log(yk ) + (1 − dk ) log(1 − yk ))
and thus −LL = the cross-entropy.
Minimizing the cross-netropy maximizes the log-likelihood
(and vice versa). 21
Normal Distribution
Distribution of continuous random variables.
Density (one dimensional, that is over R):
p(x) =
1
σ
√
2π
exp −
(x − µ)2
2σ2
=: N[µ, σ2
](x)
µ is the expected value (the mean), σ2 is the variance.
22
Maximum Likelihood vs Least Squares (Dim 1)
Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp
23
Maximum Likelihood vs Least Squares (Dim 1)
Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp
Assume that each dk has been generated randomly by
dk = (w0 + w1 · xk ) + k
w0, w1 are unknown numbers
k are normally distributed with mean 0 and an unknown
variance σ2
23
Maximum Likelihood vs Least Squares (Dim 1)
Keep in mind:
dk = (w0 + w1 · xk ) + k
Assume that 1, . . . , p were generated independently.
24
Maximum Likelihood vs Least Squares (Dim 1)
Keep in mind:
dk = (w0 + w1 · xk ) + k
Assume that 1, . . . , p were generated independently.
Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density
according to which the values d1, . . . , dn were generated
assuming ﬁxed w0, w1, σ2, x1, . . . , xp.
24
Maximum Likelihood vs Least Squares (Dim 1)
Keep in mind:
dk = (w0 + w1 · xk ) + k
Assume that 1, . . . , p were generated independently.
Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density
according to which the values d1, . . . , dn were generated
assuming ﬁxed w0, w1, σ2, x1, . . . , xp.
The independence and normality imply
p(d1, . . . , dp | w0, w1, σ2
) =
p
k=1
N[w0 + w1xk , σ2
](dk )
=
p
k=1
1
σ
√
2π
exp −
(dk − w0 − w1xk )2
2σ2
24
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
25
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
Theorem
(w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1)
minimizes squared error E(w0, w1) =
p
k=1
(dk − w0 − w1xk )2.
25
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
Theorem
(w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1)
minimizes squared error E(w0, w1) =
p
k=1
(dk − w0 − w1xk )2.
Note that the maximizing/minimizing (w0, w1) does not depend
on σ2.
25
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
Theorem
(w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1)
minimizes squared error E(w0, w1) =
p
k=1
(dk − w0 − w1xk )2.
Note that the maximizing/minimizing (w0, w1) does not depend
on σ2.
Maximizing σ2 satisﬁes σ2 = 1
p
p
k=1
(dk − w0 − w1 · xk )2.
25
MLP training – theory
26
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
27
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
28
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
29
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
29
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
29
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
The network computes a function R|X|
do R|Y|
. Layer-wise computation:
First, all input neurons are assigned values of the input. In the -th step,
all neurons of the -th layer are evaluated.
29
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
30
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function:
E(w) =
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
30
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
31
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
31
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
Note that ∂E
∂wji
(w(t)
) is a component of the gradient E, i.e. the weight update
can be written as w(t+1)
= w(t)
− ε(t) · E(w(t)
).
31
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
32
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
32
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
32
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
(Here all yj are in fact yj(w, xk )).
32
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
33
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
33
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then
σj (ξj) =
b
a
(a − yj)(a + yj)
33
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
4. Eji := Eji + ∂Ek
∂wji
The resulting Eji equals ∂E
∂wji
.
34
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
35
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
35
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and
assuming that ∂Ek
∂yr
has already been computed for all
neurons in the + 1-st layer, compute
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
(This works because all neurons of r ∈ j→
belong to the + 1-st layer.)
35
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
Note that the speed of convergence of the gradient descent cannot be
estimated ...
36
Illustration of the gradient descent – XOR
Source: Pattern Classiﬁcation (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
37
MLP – learning algorithm
Online algorithm:
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂Ek
∂wji
(w
(t)
ji
)
is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1
is the learning rate in the step t + 1.
There are other variants determined by selection of the training examples
used for the error computation (more on this later).
38
SGD
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
39