PV021: Neural networks
Tomáš Brázdil
1
Course organization
Course materials:
Main: The lecture
Neural Networks and Deep Learning by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
(Extremely well written modern online textbook.)
Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron
Courville
http://www.deeplearningbook.org/
(A very good overview of the state-of-the-art in neural networks.)
Suggested: deeplearning.ai courses by Andrew Ng
2
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++, or in Java without use of
any specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
3
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++, or in Java without use of
any specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
Oral exam
I may ask about anything from the lecture including some
proofs that occur only on the whiteboard!
3
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++, or in Java without use of
any specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
Oral exam
I may ask about anything from the lecture including some
proofs that occur only on the whiteboard!
Application of any deep learning toolset on given (difﬁcult) data.
We prefer TensorFlow but you may use another library (CNTK,
Caffe, DeepLearning4j, ...) The goal is to get the best results on
increasingly more difﬁcult datasets.
3
FAQ
Q: Why we cannot use specialized libraries in projects?
4
FAQ
Q: Why we cannot use specialized libraries in projects?
A: In order to "touch" the low level implementation details of the
algorithms. You should not even use libraries for linear algebra
and numerical methods, so that you will be confronted with
rounding errors and numerical instabilities.
4
FAQ
Q: Why we cannot use specialized libraries in projects?
A: In order to "touch" the low level implementation details of the
algorithms. You should not even use libraries for linear algebra
and numerical methods, so that you will be confronted with
rounding errors and numerical instabilities.
Q: Why should you attend this course when there are inﬁnitely
many great reasources elsewhere?
A: There are at least two reasons:
You may discuss issues in person.
I will make you truly learn fundamentals by heart.
4
Notable features of the course
Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
Sometimes goes deeper into statistical underpinnings of neural
networks learning
The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
5
Notable features of the course
Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
Sometimes goes deeper into statistical underpinnings of neural
networks learning
The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
An unusual exam system! You can repeat the oral exam as many
times as needed (only the best grade goes into IS).
5
Notable features of the course
Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
Sometimes goes deeper into statistical underpinnings of neural
networks learning
The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
An unusual exam system! You can repeat the oral exam as many
times as needed (only the best grade goes into IS).
An example of an instruction email (from another course with the
same system):
It is typically not sufficient to devote a single
afternoon to the preparation for the exam.
You have to know _everything_ (which means every
single thing) starting with the slide 42
and ending with the slide 245 with notable exceptions
of slides: 121 - 123, 137 - 140, 165, 167.
Proofs presented on the whiteboard are also mandatory.
5
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
· · ·
and lots of much much more sophisticated applications ...
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
· · ·
and lots of much much more sophisticated applications ...
Basic attributes of learning algorithms:
representation: ability to capture the inner structure of
training data
generalization: ability to work properly on new data
6
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
7
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
There are many types of models:
decision trees
support vector machines
hidden Markov models
Bayes networks and other graphical models
neural networks
· · ·
Neural networks, based on models of a (human) brain, form
a natural basis for learning algorithms!
7
Artiﬁcial neural networks
Artiﬁcial neuron is a rough mathematical approximation
of a biological neuron.
(Aritiﬁcial) neural network (NN) consists of a number of
interconnected artiﬁcial neurons. "Behavior" of the network
is encoded in connections between neurons.
σ
ξ
x1 x2 xn
y
Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/
8
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
simpliﬁed mathematical models help to identify important
mechanisms
How a brain receives information?
How the information is stored?
How a brain develops?
· · ·
9
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
simpliﬁed mathematical models help to identify important
mechanisms
How a brain receives information?
How the information is stored?
How a brain develops?
· · ·
neuroscience is strongly multidisciplinary; precise
mathematical descriptions help in communication among
experts and in design of new experiments.
I will not spend much time on this area!
9
Why artiﬁcial neural networks?
Neural networks in machine learning.
Typically primitive models, far from their biological
counterparts (but often inspired by biology).
10
Why artiﬁcial neural networks?
Neural networks in machine learning.
Typically primitive models, far from their biological
counterparts (but often inspired by biology).
Strongly oriented towards concrete application domains:
decision making and control - autonomous vehicles,
manufacturing processes, control of natural resources
games - backgammon, poker, GO, Starcraft, ...
ﬁnance - stock prices, risk analysis
medicine - diagnosis, signal processing (EKG, EEG, ...), image
processing (MRI, roentgen, WSI ...)
text and speech processing - automatic translation, text
generation, speech recognition
other signal processing - ﬁltering, radar tracking, noise
reduction
· · ·
I will concentrate on this area!
10
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
Robustness
a blurred photo of a rabbit may still be classiﬁed as an
image of a rabbit
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
Robustness
a blurred photo of a rabbit may still be classiﬁed as an
image of a rabbit
Graceful degradation
Experiments have shown that damaged neural network is
still able to work quite well
Damaged network may re-adapt, remaining neurons may
take on functionality of the damaged ones
11
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
Basic practical training techniques
(data preparation, setting various parameters, control of learning)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
Basic practical training techniques
(data preparation, setting various parameters, control of learning)
Basic information about current implementations
(TensorFlow, Keras)
12
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
13
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
13
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
13
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
Afterwards, the output signal is transfered via PNS to
effectors (e.g. muscle cells).
13
Biological neural network
Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X 14
Summation
15
Biological and Mathematical neurons
16
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = n
i=1 wixi
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ h ;
0 ξ < h.
where h ∈ R is a threshold.
17
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
(The threshold h has been substituted
with the new input x0 = 1 and the weight
w0 = −h.)
18
Neuron and linear separation
ξ = 0
ξ > 0
ξ > 0
ξ < 0
ξ < 0
inner potential
ξ = w0 +
n
i=1
wixi
determines a separation
hyperplane in
the n-dimensional input space
in 2d line
in 3d plane
· · ·
19
Neuron and linear separation
σ σ( wixi)
x1 xn
· · ·
1/0 by A/B
w1 wn
n = 8 · 8, i.e. the number of pixels in the images. Inputs are
binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0).
20
Neuron and linear separation
σ
x1 xn
· · ·
x0 = 1
1/0 pro A/B
w1 wn
w0
n = 8 · 8, i.e. the number of pixels in the images. Inputs are
binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0).
21
Neuron and linear separation
¯w0 + n
i=1 ¯wixi = 0
w0 + n
i=1 wixi = 0
A
A
A A
B
B
B
Red line classiﬁes incorrectly
Green line classiﬁes correctly
(may be a result of
a correction by a learning
algorithm)
22
Neuron and linear separation (XOR)
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
x1
x2
No line separates ones from
zeros.
23
Neural networks
Neural network consists of formal neurons interconnected in
such a way that the output of one neuron is an input of several
other neurons.
In order to describe a particular type of neural networks we
need to specify:
Architecture
How the neurons are connected.
Activity
How the network transforms inputs to outputs.
Learning
How the weights are changed during training.
24
Architecture
Network architecture is given as a digraph whose nodes are
neurons and edges are connections.
We distinguish several categories of
neurons:
Output neurons
Hidden neurons
Input neurons
(In general, a neuron may be both input and
output; a neuron is hidden if it is neither input,
nor output.)
25
Architecture – Cycles
A network is cyclic (recurrent) if its architecture contains a
directed cycle.
26
Architecture – Cycles
A network is cyclic (recurrent) if its architecture contains a
directed cycle.
Otherwise it is acyclic (feed-forward)
26
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
27
Activity
Consider a network with n neurons, k input and output.
28
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
28
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
Network input is a vector of k real numbers, i.e.
an element of Rk .
Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
28
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
Network input is a vector of k real numbers, i.e.
an element of Rk .
Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
Initial state
Input neurons set to values from the network input
(each component of the network input corresponds to an input
neuron)
Values of the remaining neurons set to 0.
28
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
Network output is a vector of values of all output neurons
in the network (i.e. an element of R ).
Note that the network output keeps changing throughout
the computation!
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
Network output is a vector of values of all output neurons
in the network (i.e. an element of R ).
Note that the network output keeps changing throughout
the computation!
MLP uses the following selection rule:
In the i-th step evaluate all neurons in the i-th layer.
29
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
30
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
30
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
Example 1
This network computes a function
from R2 to R.
30
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
31
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here x = (x1, . . . , xn) are inputs of the neuron and
w = (w1, . . . , wn) are weights.
31
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here x = (x1, . . . , xn) are inputs of the neuron and
w = (w1, . . . , wn) are weights.
There are special types of neural network where the inner
potential is computed differently, e.g. as a "distance" of an input
from the weight vector:
ξ = x − w
here ||·|| is a vector norm, typically Euclidean.
31
Activity – inner potential and activation functions
There are many activation functions, typical examples:
Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
32
Activity – inner potential and activation functions
There are many activation functions, typical examples:
Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
(Logistic) sigmoid
σ(ξ) =
1
1 + e−λ·ξ
here λ ∈ R is a steepness parameter.
Hyperbolic tangens
σ(ξ) =
1 − e−ξ
1 + e−ξ
ReLU
σ(ξ) = max(ξ, 0)
32
Activity – XOR
1 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 1
σ
11 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 0
σ 01 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 0
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 0
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 1
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 1
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – MLP and linear separation
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
P1 P2
x1
x2
σ1 σ 1
σ1
−22 2 −2
1
−1
1
3
−2
The line P1 is given by
−1 + 2x1 + 2x2 = 0
The line P2 is given by
3 − 2x1 − 2x2 = 0
34
Activity – example
x1
1
σ
0
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
1
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
1
1
σ
1 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
1
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
0
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Learning
Consider a network with n neurons, k input and output.
36
Learning
Consider a network with n neurons, k input and output.
Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
Weight-space of a network is a set of all conﬁgurations.
36
Learning
Consider a network with n neurons, k input and output.
Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
Weight-space of a network is a set of all conﬁgurations.
initial conﬁguration
weights can be initialized randomly or using some sophisticated
algorithm
36
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
37
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
Supervised learning
The desired function is described using training examples
that are pairs of the form (input, output).
Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
37
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
Supervised learning
The desired function is described using training examples
that are pairs of the form (input, output).
Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
Unsupervised learning
The training set contains only inputs.
The goal is to determine distribution of the inputs
(clustering, deep belief networks, etc.)
37
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
38
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
38
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
the algorithm considers examples
one after another
whenever an incorrectly classiﬁed
point is considered, the learning
algorithm turns the line in
the direction of the point
38
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
39
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
39
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
generalization and robustness
information is encoded in a distributed manned in weights
"close" inputs typicaly get similar values
39
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
generalization and robustness
information is encoded in a distributed manned in weights
"close" inputs typicaly get similar values
Graceful degradation
damage typically causes only a decrease in precision of
results
39
Expressive power of neural networks
40
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
41
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
42
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
σ
x1 x2 xn
x0 = 1
y = AND(x1, . . . , xn)
1 1
· · ·
1
−n
σ
x1 x2 xn
x0 = 1
y = OR(x1, . . . , xn)
1 1
· · ·
1
−1
σ
x1
x0 = 1
y = NOT(x1)
−1
0
42
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
43
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
Proof.
Given a vector v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron
Nv whose output is 1 iff the input is v:
σ
y
x1 xi xn
x0 = 1
w1 wi
· · ·· · ·
wn
w0
w0 = − n
i=1 vi
wi =



1 vi = 1
−1 vi = 0
Now let us connect all outputs of all neurons Nv satisfying
F(v) = 1 using a neuron implementing OR.
43
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
44
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
44
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
44
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
The third layer may e.g. make unions of some
convex sets.
44
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
Finally, connect outputs of the nets NK
satisfying K ∩ A ∅ using a neuron
implementing OR.
45
Non-linear separation - sigmoid
Theorem (Cybenko 1989 - informal version)
Let σ be a continuous function which is sigmoidal, i.e. satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every "reasonable" set A ⊆ [0, 1]n, there is a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following:
For "most" vectors v ∈ [0, 1]n we have that v ∈ A iff the network
output is > 0 for the input v.
For mathematically oriented:
"reasonable" means Lebesgue measurable
"most" means that the set of incorrectly classiﬁed vectors has
the Lebesgue measure smaller than a given ε > 0
46
Non-linear separation - practical illustration
ALVINN drives a car
47
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
47
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
Input values correspond to
shades of gray of pixels.
47
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
Input values correspond to
shades of gray of pixels.
Output neurons "classify" images
of the road based on their
"curvature".
Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 47
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
48
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
the remaining neurons have the logistic sigmoid σ as their
activation,
48
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
the remaining neurons have the logistic sigmoid σ as their
activation,
for every v ∈ [0, 1]n we have that |F(v) − f(v)| < ε.
48
Function approximation – three layer networks
x1 x2
σ σ σ σ
σ· · ·
· · · · · ·
ζ
y weighted sum of "spikes"
... + the other two 90 degree rotations
a "spike"
inner potential
the value of the neuron
49
Function approximation - two-layer networks
Theorem (Cybenko 1989)
Let σ be a continuous function which is sigmoidal, i.e. is
increasing and satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every continuous function f : [0, 1]n → [0, 1] and every ε > 0
there is a function F : [0, 1]n → [0, 1] computed by a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following
|f(v) − F(v)| < ε pro každé v ∈ [0, 1]n
.
50
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
activation function
σ(ξ) =



1 ξ ≥ 1 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
activation function
σ(ξ) =



1 ξ ≥ 1 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
We encode words ω ∈ {0, 1}+ into numbers as follows:
δ(ω) =
|ω|
i=1
ω(i)
2i
+
1
2|ω|+1
E.g. ω = 11001 gives δ(ω) = 1
2 + 1
22 + 1
25 + 1
26
(= 0.110011 in binary form).
51
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
Recurrent networks are super-Turing powerful
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
Recurrent networks are super-Turing powerful
For every language L ⊆ {0, 1}+
there is a recurrent network
with less than 1000 nerons which recognizes L.
52
Summary of theoretical results
Neural networks are very strong from the point of view of
theory:
All Boolean functions can be expressed using two-layer
networks.
Two-layer networks may approximate any continuous
function.
Recurrent networks are at least as strong as Turing
machines.
53
Summary of theoretical results
Neural networks are very strong from the point of view of
theory:
All Boolean functions can be expressed using two-layer
networks.
Two-layer networks may approximate any continuous
function.
Recurrent networks are at least as strong as Turing
machines.
These results are purely theoretical!
"Theoretical" networks are extremely huge.
It is very difﬁcult to handcraft them even for simplest
problems.
From practical point of view, the most important advantage
of neural networks are: learning, generalization,
robustness.
53
Neural networks vs classical computers
Neural networks "Classical" computers
Data implicitly in weights explicitly
Computation naturally parallel sequential, localized
Robustness robust w.r.t. input corruption
& damage
changing one bit may
completely crash the
computation
Precision imprecise, network recalls a
training example "similar" to
the input
(typically) precise
Programming learning manual
54
History & implementations
55
History of neurocomputers
1951: SNARC (Minski et al)
the ﬁrst implementation of neural network
a rat strives to exit a maze
40 artiﬁcial neurons (300 vacuum tubes, engines, etc.)
56
History of neurocomputers
1957: Mark I Perceptron (Rosenblatt et al) - the ﬁrst
successful network for image recognition
single layer network
image represented by 20 × 20 photocells
intensity of pixels was treated as the input to a perceptron
(basically the formal neuron), which recognized ﬁgures
weights were implemented using potentiometers, each set
by its own engine
it was possible to arbitrarily reconnect inputs to neurons to
demonstrate adaptability
57
History of neurocomputers
1960: ADALINE (Widrow & Hof)
single layer neural network
weights stored in a newly invented electronic component
memistor, which remembers history of electric current in
the form of resistance.
Widrow founded a company Memistor Corporation, which
sold implementations of neural networks.
1960-66: several companies concerned with neural
networks were founded. 58
History of neurocomputers
1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
1983-end of 90s: revival of neural networks
many attempts at hardware implementations
application speciﬁc chips (ASIC)
programmable hardware (FPGA)
hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
59
History of neurocomputers
1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
1983-end of 90s: revival of neural networks
many attempts at hardware implementations
application speciﬁc chips (ASIC)
programmable hardware (FPGA)
hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
end of 90s-cca 2005: NN suppressed by other machine
learning methods (support vector machines (SVM))
2006-now: The boom of neural networks!
deep networks – often better than any other method
GPU implementations
... some specialized hw implementations (Google’s TPU)
59
Some highlights
Breakthrough in image recognition.
Accuracy of image recognition improved by an order of magnitude in 5
years.
Breakthrough in game playing.
Superhuman results in Go and Chess almost without any human
intervention. Master level in Starcraft, poker, etc.
Breakthrough in machine translation.
Switching to deep learning produced a 60% increase in translation
accuracy compared to the phrase-based approach previously used in
Google Translate (in human evaluation)
Breakthrough in speech processing.
60
History in waves ...
Figure: The ﬁgure shows two of the three historical waves of artiﬁcial
neural nets research, as measured by the frequency of the phrases
"cybernetics" and "connectionism" or "neural networks" according to
Google Books (the third wave is too recent to appear).
61
Current hardware – What do we face?
Increasing dataset size ...
62
Current hardware – What do we face?
... and thus increasing size of neural networks ...
2. ADALINE
4. Early back-propagation network (Rumelhart et al., 1986b)
8. Image recognition: LeNet-5 (LeCun et al., 1998b)
10. Dimensionality reduction: Deep belief network (Hinton et al., 2006)
... here the third "wave" of neural networks started
15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012)
20. Image recognition: GoogLeNet (Szegedy et al., 2014a)
63
Current hardware – What do we face?
... as a reward we get this ...
Figure: Since deep networks reached the scale necessary to
compete in the ImageNetLarge Scale Visual Recognition Challenge,
they have consistently won the competition every year, and yielded
lower and lower error rates each time. Data from Russakovsky et al.
(2014b) and He et al. (2015).
64
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
65
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
In 2014, similar task performed on Commodity
Off-The-Shelf High Performance Computing
(COTS HPC) technology: a cluster of GPU
servers with Inﬁniband interconnects and MPI.
Able to train 1 billion parameter networks on
just 3 machines in a couple of days.
Able to scale to 11 billion weights (approx. 6.5
times larger than the Google model) on 16
GPUs. 65
Current hardware – NVIDIA DGX-1 Station
8x GPU (Tesla V100)
TFLOPS = 1000
GPU memory 256GB total
NVIDIA Tensor Cores: 5,120
NVIDIA CUDA Cores: 40,960
System memory: 512 GB
Network: Dual 10 Gb LAN
NVIDIA Deep Learning SDK
66
Deep learning in clouds
Several companies offer cloud services for deep learning:
Amazon Web Services
Google Cloud
Deep Cognition
...
Advantages:
Do not have to care (too much) about technical problems.
Do not have to buy and optimize highend hw/sw, networks etc.
Scaling & virtually limitless storage.
Disadvatages:
Do not have full control.
Performance can vary, connectivity problems.
Have to pay for services.
Privacy issues.
67
Current software
TensorFlow (Google)
open source software library for numerical computation
using data ﬂow graphs
allows implementation of most current neural networks
allows computation on multiple devices (CPUs, GPUs, ...)
Python API
Keras: a part of TensorFlow that allows easy description of
most modern neural networks
PyTorch (Facebook)
similar to TensorFlow
object oriented
Theano (dead):
The "academic" grand-daddy of deep-learning frameworks,
written in Python. Strongly inspired TensorFlow (some
people developing Theano moved on to develop
TensorFlow).
There are others: Caffe, Deeplearning4j, ...
68
Current software – Keras
69
Current software – Keras functional API
70
Current software – TensorFlow
71
Current software – TensorFlow
72
Current software – PyTorch
73
Other software implementations
Most "mathematical" software packages contain some support
of neural networks:
MATLAB
R
STATISTICA
Weka
...
The implementations are typically not on par with the previously
mentioned dedicated deep-learning libraries.
74
Training linear models
75
Linear regression (ADALINE)
Architecture:
x1 x2 xn
· · ·
y
x0 = 1
w0
w1 w2 wn
w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1.
Activity:
inner potential: ξ = w0 + n
i=1 wixi = n
i=0 wixi = w · x
activation function: σ(ξ) = ξ
network function: y[w](x) = σ(ξ) = w · x
76
Linear regression (ADALINE)
Learning:
Given a training dataset
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ R is the expected output.
Intuition: The network is supposed to compute an afﬁne approximation of the
function (some of) whose values are given in the training set.
77
Oaks in Wisconsin
78
Linear regression (ADALINE)
Error function:
E(w) =
1
2
p
k=1
w · xk − dk
2
=
1
2
p
k=1


n
i=0
wixki − dk


2
The goal is to ﬁnd w which minimizes E(w). 79
Error function
80
Gradient of the error function
Consider gradient of the error function:
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w)
Intuition: E(w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors xk are just parameters of the function E, and are thus
ﬁxed!
81
Gradient of the error function
Consider gradient of the error function:
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w)
Intuition: E(w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors xk are just parameters of the function E, and are thus
ﬁxed!
Fact
If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E.
For ADALINE, the error function E(w) is a convex paraboloid and thus has
the unique global minimum.
81
Gradient - illustration
Caution! This picture just illustrates the notion of gradient ... it is not
the convex paraboloid E(w) !
82
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
= (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
= (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1
δE
δw1
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
= (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1
δE
δw1
= (w0 +w1 ·2−1)·2+(w0 +w1 ·3−2)·3+(w0 +w1 ·4−5)·4
83
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


=
p
k=1
w · xk − dk xk
84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


=
p
k=1
w · xk − dk xk
Thus
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w) =
p
k=1
w · xk − dk xk
84
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε · E(w(t)
)
= w(t)
− ε ·
p
k=1
w(t)
· xk − dk · xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε · E(w(t)
)
= w(t)
− ε ·
p
k=1
w(t)
· xk − dk · xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
Proposition
For sufﬁciently small ε > 0 the sequence w(0), w(1), w(2), . . .
converges (componentwise) to the global minimum of E (i.e. to
the vector w satisfying E(w) = 0).
85
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
ADALINE - learning
Online algorithm (Delta-rule, Widrow-Hoff rule):
weights in w(0) initialized randomly close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε(t) · w(t)
· xk − dk · xk
Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in
the step t + 1.
Note that the algorithm does not work with the complete gradient but
only with its part determined by the currently considered training
example.
87
ADALINE - learning
Online algorithm (Delta-rule, Widrow-Hoff rule):
weights in w(0) initialized randomly close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε(t) · w(t)
· xk − dk · xk
Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in
the step t + 1.
Note that the algorithm does not work with the complete gradient but
only with its part determined by the currently considered training
example.
Theorem (Widrow & Hoff)
If ε(t) = 1
t , then w(0), w(1), w(2), . . . converges to the global
minimum of E.
87
What about classiﬁcation?
Binary classiﬁcation: Desired outputs 0 and 1.
Ideally, capture the probability distribution of classes.
88
What about classiﬁcation?
Binary classiﬁcation: Desired outputs 0 and 1.
... does not capture probability well (it is not a probability at all)
88
What about classiﬁcation?
Binary classiﬁcation: Desired outputs 0 and 1.
... logistic sigmoid 1
1+e−(w·x)
is much better!
88
Logistic regression
x1 x2 xn
· · ·
y
x0 = 1
w0
w1 w2 wn
w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1.
Activity:
inner potential: ξ = w0 + n
i=1 wixi = n
i=0 wixi = w · x
activation function: σ(ξ) = 1
1+e−ξ
network function: y[w](x) = σ(ξ) = 1
1+e−(w·x)
Intuition: The output y is now the probability of the class 1 given the input x.
89
But what is the meaning of the sigmoid?
The model gives a probability y of the class 1 given an input x.
But why we model such a probability using 1/(1 + e−w·x) ??
90
But what is the meaning of the sigmoid?
The model gives a probability y of the class 1 given an input x.
But why we model such a probability using 1/(1 + e−w·x) ??
What about odds of the class 1?
odds(y) =
y
1 − y
Resembles an exponential function ...
90
But what is the meaning of the sigmoid?
The model gives a probability y of the class 1 given an input x.
But why we model such a probability using 1/(1 + e−w·x) ??
What about log odds (aka logit) of the class 1?
logit(y) = log(y/(1 − y))
Looks almost linear ...
90
But what is the meaning of the sigmoid?
Assume that y is the probability of the class 1. Put
log(y/(1 − y)) = w · x
91
But what is the meaning of the sigmoid?
Assume that y is the probability of the class 1. Put
log(y/(1 − y)) = w · x
Then
log((1 − y)/y) = −w · x
91
But what is the meaning of the sigmoid?
Assume that y is the probability of the class 1. Put
log(y/(1 − y)) = w · x
Then
log((1 − y)/y) = −w · x
and
(1 − y)/y = e−w·x
91
But what is the meaning of the sigmoid?
Assume that y is the probability of the class 1. Put
log(y/(1 − y)) = w · x
Then
log((1 − y)/y) = −w · x
and
(1 − y)/y = e−w·x
and
y =
1
1 + e−w·x
That is, if we model log odds using a linear function, the probability is
obtained by applying the logistic sigmoid on the result of the linear function.
91
Logistic regression
Learning:
Given a training dataset
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ {0, 1} is the expected output.
92
Logistic regression
Learning:
Given a training dataset
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ {0, 1} is the expected output.
What error function?
(Binary) cross-entropy:
E(w) =
p
k=1
−(dk log(yk ) + (1 − dk ) log(1 − yk ))
What?!?
92
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
93
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is p and is unknown!
93
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is p and is unknown!
You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
93
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is p and is unknown!
You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
What is the best model y of p based on the data?
93
Log likelihood is your friend!
Let’s have a "coin" (sides 0 and 1).
The probability of 1 is p and is unknown!
You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
What is the best model y of p based on the data?
Answer: The one that generates the data with maximum
probability!
93
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
94
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
94
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
Maximize
LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y)
94
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
Maximize
LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y)
But then
−LL = −1·log(y)−1·log(y)−(1−0)·log(1−y)−(1−0)·log(1−y)−1·log(y)
and thus −LL is the cross-entropy.
94
Let the coin depend on the input
Consider our model:
y =
1
1 + e−(w·x)
95
Let the coin depend on the input
Consider our model:
y =
1
1 + e−(w·x)
The training dataset is now standard:
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input,
and dk ∈ {0, 1} is the expected output.
95
Let the coin depend on the input
Consider our model:
y =
1
1 + e−(w·x)
The training dataset is now standard:
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input,
and dk ∈ {0, 1} is the expected output.
The likelihood:
L =
p
k=1
ydk
k
· (1 − yk )(1−dk )
and LL = log(L) =
p
k=1
(dk log(yk ) + (1 − dk ) log(1 − yk ))
and thus −LL = the cross-entropy.
Minimizing the cross-netropy maximizes the log-likelihood
(and vice versa). 95
Normal Distribution
Distribution of continuous random variables.
Density (one dimensional, that is over R):
p(x) =
1
σ
√
2π
exp −
(x − µ)2
2σ2
=: N[µ, σ2
](x)
µ is the expected value (the mean), σ2 is the variance.
96
Maximum Likelihood vs Least Squares (Dim 1)
Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp
97
Maximum Likelihood vs Least Squares (Dim 1)
Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp
Assume that each dk has been generated randomly by
dk = (w0 + w1 · xk ) + k
w0, w1 are unknown numbers
k are normally distributed with mean 0 and an unknown
variance σ2
97
Maximum Likelihood vs Least Squares (Dim 1)
Keep in mind:
dk = (w0 + w1 · xk ) + k
Assume that 1, . . . , p were generated independently.
98
Maximum Likelihood vs Least Squares (Dim 1)
Keep in mind:
dk = (w0 + w1 · xk ) + k
Assume that 1, . . . , p were generated independently.
Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density
according to which the values d1, . . . , dn were generated
assuming ﬁxed w0, w1, σ2, x1, . . . , xp.
98
Maximum Likelihood vs Least Squares (Dim 1)
Keep in mind:
dk = (w0 + w1 · xk ) + k
Assume that 1, . . . , p were generated independently.
Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density
according to which the values d1, . . . , dn were generated
assuming ﬁxed w0, w1, σ2, x1, . . . , xp.
The independence and normality imply
p(d1, . . . , dp | w0, w1, σ2
) =
p
k=1
N[w0 + w1xk , σ2
](dk )
=
p
k=1
1
σ
√
2π
exp −
(dk − w0 − w1xk )2
2σ2
98
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
99
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
Theorem
(w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1)
minimizes squared error E(w0, w1) =
p
k=1
(dk − w0 − w1xk )2.
99
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
Theorem
(w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1)
minimizes squared error E(w0, w1) =
p
k=1
(dk − w0 − w1xk )2.
Note that the maximizing/minimizing (w0, w1) does not depend
on σ2.
99
Maximum Likelihood vs Least Squares
Our goal is to ﬁnd (w0, w1) that maximizes the likelihood that the
training set D with ﬁxed values d1, . . . , dn has been generated:
L(w0, w1, σ2
) := p(d1, . . . , dp | w0, w1, σ2
)
Theorem
(w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1)
minimizes squared error E(w0, w1) =
p
k=1
(dk − w0 − w1xk )2.
Note that the maximizing/minimizing (w0, w1) does not depend
on σ2.
Maximizing σ2 satisﬁes σ2 = 1
p
p
k=1
(dk − w0 − w1 · xk )2.
99
MLP training – theory
100
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
101
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
102
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
102
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
102
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
102
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
102
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
102
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
103
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
103
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
103
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
The network computes a function R|X|
do R|Y|
. Layer-wise computation:
First, all input neurons are assigned values of the input. In the -th step,
all neurons of the -th layer are evaluated.
103
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
104
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function:
E(w) =
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
104
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
105
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
105
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
Note that ∂E
∂wji
(w(t)
) is a component of the gradient E, i.e. the weight update
can be written as w(t+1)
= w(t)
− ε(t) · E(w(t)
).
105
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
106
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
106
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
106
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
(Here all yj are in fact yj(w, xk )).
106
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
107
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
107
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then
σj (ξj) =
b
a
(a − yj)(a + yj)
107
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
108
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
108
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
108
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
108
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
108
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
108
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
4. Eji := Eji + ∂Ek
∂wji
The resulting Eji equals ∂E
∂wji
.
108
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
109
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
109
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and
assuming that ∂Ek
∂yr
has already been computed for all
neurons in the + 1-st layer, compute
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
(This works because all neurons of r ∈ j→
belong to the + 1-st layer.)
109
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
110
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
110
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
110
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
110
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
110
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
110
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
Note that the speed of convergence of the gradient descent cannot be
estimated ...
110
Illustration of the gradient descent – XOR
Source: Pattern Classiﬁcation (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
111
MLP – learning algorithm
Online algorithm:
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂Ek
∂wji
(w
(t)
ji
)
is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1
is the learning rate in the step t + 1.
There are other variants determined by selection of the training examples
used for the error computation (more on this later).
112
SGD
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
113
MLP training – practical issues
114
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
115
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
116
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function: E(w) =
p
k=1
Ek (w)
117
SGD
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
118
MLP – mse gradient
For every wji we have
∂E
∂wji
=
1
p
p
k=1
∂Ek
∂wji
119
MLP – mse gradient
For every wji we have
∂E
∂wji
=
1
p
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get (for squared error)
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
(Here all yj are in fact yj(w, xk )).
119
(Some) error functions
squared error:
E(w) =
p
k=1
Ek (w)
where Ek (w) = 1
2 j∈Y yj(w, xk ) − dkj
2
mean squared error (mse):
E(w) =
1
p
p
k=1
Ek (w)
(categorical) cross entropy:
E(w) = −
1
p
p
k=1 j∈Y
dkj ln(yj)
120
Practical issues of gradient descent
Training efﬁciency:
What size of a minibatch?
How to choose the learning rate ε(t) and control SGD ?
How to pre-process the inputs?
How to initialize weights?
How to choose desired output values of the network?
121
Practical issues of gradient descent
Training efﬁciency:
What size of a minibatch?
How to choose the learning rate ε(t) and control SGD ?
How to pre-process the inputs?
How to initialize weights?
How to choose desired output values of the network?
Quality of the resulting model:
When to stop training?
Regularization techniques.
How large network?
For simplicity, I will illustrate the reasoning on MLP + mse.
Later we will see other topologies and error functions with
different but always somewhat related issues.
121
Issues in gradient descent
Small networks: Lots of local minima where the descent
gets stuck.
The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry.
Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
122
Issues in gradient descent
Small networks: Lots of local minima where the descent
gets stuck.
The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry.
Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
Saddle points
One can show (by a combinatorial
argument) that larger networks
have exponentially more saddle
points than local minima.
122
Issues in gradient descent – too slow descent
ﬂat regions
E.g. if the inner potentials are too large (in abs. value), then their
derivative is extremely small.
123
Issues in gradient descent – too fast descent
steep cliffs: the gradient is extremely large, descent skips
important weight vectors
124
Issues in gradient descent – local vs global
structure
What if we initialize on the left?
125
Gradient Descent in Large Networks
Theorem
Assume (roughly),
activation functions: "smooth" ReLU (softplus)
σ(z) = log(1 + exp(z))
In general: Smooth, non-polynomial, analytic, Lipschitzs.
inputs xk of Euclidean norm equal to 1, desired values dk
satisfying |dk | ∈ O(1),
the number of hidden neurons per layer sufﬁciently large
(polynomial in certain numerical characteristics of inputs roughly
measuring their similarity, and exponential in the depth of the network),
the learning rate constant and sufﬁciently small.
The gradient descent converges (with high probability) to a global
minimum with zero error at linear rate.
Later we get to a special type of networks called ResNet where the above
result demands only polynomially many neurons per layer (w.r.t. depth). 126
Issues in computing the gradient
vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
127
Issues in computing the gradient
vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
inexact gradient computation:
Minibatch gradient is only an estimate of the true gradient.
Note that the variance of the estimate is (roughly) σ/
√
m
where m is the size of the minibatch and σ is the variance
of the gradient estimate for a single training example.
(E.g. minibatch size 10 000 means 100 times more computation
than the size 100 but gives only 10 times less variance.)
127
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
It is common (especially when using GPUs) for power of 2 batch
sizes to offer better runtime. Typical power of 2 batch sizes
range from 32 to 256, with 16 sometimes being attempted for
large models.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
It is common (especially when using GPUs) for power of 2 batch
sizes to offer better runtime. Typical power of 2 batch sizes
range from 32 to 256, with 16 sometimes being attempted for
large models.
Small batches can offer a regularizing effect, perhaps due to the
noise they add to the learning process.
It has been observed in practice that when using a larger batch
there is a degradation in the quality of the model, as measured
by its ability to generalize.
("On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima". Keskar et al, ICLR’17) 128
Momentum
Issue in the gradient descent:
E(w(t)) constantly changes direction (but the error
steadily decreases).
129
Momentum
Issue in the gradient descent:
E(w(t)) constantly changes direction (but the error
steadily decreases).
Solution: In every step add the change made in the previous
step (weighted by a factor α):
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
) + α · ∆w
(t−1)
ji
where 0 < α < 1.
129
Momentum – illustration
130
SGD with momentum
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
) + α∆w(t−1)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
0 < α < 1 measures the "inﬂuence" of the momentum
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
131
Learning rate
132
Search for the learning rate
Use settings from a successful solution of a similar problem as a
baseline.
Search for the learning rate using the learning monitoring:
Search through values from small (e.g. 0.001) to (0.1),
possibly multiplying by 2.
Train for several epochs, observe the learning curves (see
cross-validation later).
133
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
134
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
Exponential scheduling: Set (t) = 0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
134
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
Exponential scheduling: Set (t) = 0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
Piecewise constant scheduling: A constant learning rate for a
number of steps/epochs, then a smaller learning rate, and so on.
134
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
Exponential scheduling: Set (t) = 0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
Piecewise constant scheduling: A constant learning rate for a
number of steps/epochs, then a smaller learning rate, and so on.
1cycle scheduling: Start by increasing the initial learning rate
from 0 linearly to 1 (approx. 1 = 10 0) halfway through
training. Then decrease from 1 linearly to 0. Finish by dropping
the learning rate by several orders of magnitude (still linearly).
According to a 2018 paper by Leslie Smith this may converge much
faster (100 epochs vs 800 epochs on CIFAR10 dataset).
For comparison of some methods see: AN EMPIRICAL STUDY OF LEARNING RATES IN DEEP NEURAL
NETWORKS FOR SPEECH RECOGNITION, Senior et al
134
AdaGrad
So far we have considered ﬁxed schedules for learning rates.
It is better to have
larger rates for weights with smaller updates,
smaller rates for weights with larger updates.
AdaGrad uses individually adapting learning rate for each
weight.
135
SGD with AdaGrad
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
136
SGD with AdaGrad
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
ji
+ δ
·
k∈T
∂Ek
∂wji
(w(t)
)
and
r
(t)
ji
= r
(t−1)
ji
+


k∈T
∂Ek
∂wji
(w(t)
)


2
η is a constant expressing the inﬂuence of the learning rate,
typically 0.01.
δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
136
RMSProp
The main disadvantage of AdaGrad is the accumulation of the
gradient throughout the whole learning process.
In case the learning needs to get over several "hills" before
settling in a deep "valley", the weight updates get far too small
before getting to it.
RMSProp uses an exponentially decaying average to discard
history from the extreme past so that it can converge rapidly
after ﬁnding a convex bowl, as if it were an instance of the
AdaGrad algorithm initialized within that bowl.
137
SGD with RMSProp
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
138
SGD with RMSProp
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
ji
+ δ
·
k∈T
∂Ek
∂wji
(w(t)
)
and
r
(t)
ji
= ρr
(t−1)
ji
+ (1 − ρ)


k∈T
∂Ek
∂wji
(w(t)
)


2
η is a constant expressing the inﬂuence of the learning rate
(Hinton suggests ρ = 0.9 and η = 0.001).
δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
138
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
139
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
139
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
Currently, the most popular optimization algorithms actively in
use include SGD, SGD with momentum, RMSProp, RMSProp
with momentum, AdaDelta and Adam.
The choice of which algorithm to use, at this point, seems to
depend largely on the user’s familiarity with the algorithm.
139
Choice of (hidden) activations
Generic requirements imposed on activation functions:
1. differentiability
(to do gradient descent)
2. non-linearity
(linear multi-layer networks are equivalent to single-layer)
3. monotonicity
(local extrema of activation functions induce local extrema of the error
function)
4. "linearity"
(i.e. preserve as much linearity as possible; linear models are easiest to
ﬁt; ﬁnd the "minimum" non-linearity needed to solve a given task)
The choice of activation functions is closely related to input
preprocessing and the initial choice of weights. I will illustrate the
reasoning on sigmoidal functions; say few words about other
activation functions later.
140
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ), we have limξ→∞ σ(ξ) = 1.7159 and
limξ→−∞ σ(ξ) = −1.7159
141
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ) is almost linear on [−1, 1]
142
Activation functions – tanh
ﬁrst derivative: σ(ξ) = 1.7159 · tanh(2
3 · ξ)
143
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
144
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
144
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
Typical standardization:
average = 0 (subtract the mean)
variance = 1 (divide by the standard deviation)
Here the mean and standard deviation may be estimated
from data (the training set).
(illustration of standard deviation)
144
Initial weights (for tanh)
Assume weights chosen uniformly in random from
an interval [−w, w] where w depends on the number of
inputs of a given neuron.
145
Initial weights (for tanh)
Assume weights chosen uniformly in random from
an interval [−w, w] where w depends on the number of
inputs of a given neuron.
Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
σ is almost linear on [−1, 1],
σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
145
Initial weights (for tanh)
Assume weights chosen uniformly in random from
an interval [−w, w] where w depends on the number of
inputs of a given neuron.
Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
σ is almost linear on [−1, 1],
σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
Thus
for too small w we may get (almost) linear model.
for too large w (i.e. much larger than 1) the activations may
get saturated and the learning will be very slow.
Hence, we want to choose w so that the inner potentials of
neurons will be roughly in the interval [−1, 1].
145
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
Our assumptions imply: oj = n
3 · w.
Thus we put w =
√
3√
n
.
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
Our assumptions imply: oj = n
3 · w.
Thus we put w =
√
3√
n
.
The same works for higher layers, n corresponds to the number
of neurons in the layer one level lower.
146
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
147
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
Glorot & Bengio (2010) presented a normalized initialization by
choosing w uniformly from the interval:

−
6
m + n
,
6
m + n

 =


−
3
(m + n)/2
,
3
(m + n)/2


Here n is the number of inputs to the layer, m is the number of
outputs of the layer (i.e. the number of neurons in the layer).
147
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
Glorot & Bengio (2010) presented a normalized initialization by
choosing w uniformly from the interval:

−
6
m + n
,
6
m + n

 =


−
3
(m + n)/2
,
3
(m + n)/2


Here n is the number of inputs to the layer, m is the number of
outputs of the layer (i.e. the number of neurons in the layer).
This is designed to compromise between the goal of initializing all
layers to have the same activation variance and the goal of initializing
all layers to have the same gradient variance.
The formula is derived using the assumption that the network consists only of
a chain of matrix multiplications, with no non-linearities. Real neural networks
obviously violate this assumption, but many strategies designed for the linear
model perform reasonably well on its non-linear counterparts.
147
Modern activation functions
For hidden neurons sigmoidal functions are often substituted with
piece-wise linear activations functions. Most prominent is ReLU:
σ(ξ) = max{0, ξ}
THE default activation function recommended for use with most
feedforward neural networks.
As close to linear function as possible; very simple; does not
saturate for large potentials.
Dead for negative potentials.
148
More modern activation functions
Leaky ReLU (greenboard):
Generalizes ReLU, not dead for negative potentials.
Experimentally not much better than ReLU.
149
More modern activation functions
Leaky ReLU (greenboard):
Generalizes ReLU, not dead for negative potentials.
Experimentally not much better than ReLU.
ELU: "Smoothed" ReLU:
σ(ξ) =



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Here α is a parameter, ELU converges to −α as ξ → −∞. As
opposed to ReLU: Smooth, always non-zero gradient (but
saturates), slower to compute.
149
More modern activation functions
Leaky ReLU (greenboard):
Generalizes ReLU, not dead for negative potentials.
Experimentally not much better than ReLU.
ELU: "Smoothed" ReLU:
σ(ξ) =



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Here α is a parameter, ELU converges to −α as ξ → −∞. As
opposed to ReLU: Smooth, always non-zero gradient (but
saturates), slower to compute.
SELU: Scaled variant of ELU: :
σ(ξ) = λ



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Self-normalizing, i.e. output of each layer will tend to preserve
a mean (close to) 0 and a standard deviation (close to) 1 for
λ ≈ 1.050 and α ≈ 1.673, properly initialized weights (see below)
and normalized inputs (zero mean, standard deviation 1).
149
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m
the number of neurons in the layer.
Glorot & Bengio (2010): Choose weights randomly from
the normal distribution with mean 0 and variance
2/(n + m)
Suitable for activation functions: None, tanh, logistic,
softmax
150
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m
the number of neurons in the layer.
Glorot & Bengio (2010): Choose weights randomly from
the normal distribution with mean 0 and variance
2/(n + m)
Suitable for activation functions: None, tanh, logistic,
softmax
He (2015): Choose weights randomly from the normal
distribution with mean 0 and variance 2/n
Designed for ReLU, leaky ReLU
150
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m
the number of neurons in the layer.
Glorot & Bengio (2010): Choose weights randomly from
the normal distribution with mean 0 and variance
2/(n + m)
Suitable for activation functions: None, tanh, logistic,
softmax
He (2015): Choose weights randomly from the normal
distribution with mean 0 and variance 2/n
Designed for ReLU, leaky ReLU
LeCun (1990): Choose weights randomly from the normal
distribution with mean 0 and variance 1/n
Suitable for SELU
150
How to choose activation of hidden neurons
Default is ReLU.
According to Aurélien Géron:
SELU > ELU > leakyReLU > ReLU > tanh > logistic
For discussion see: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and
Techniques to Build Intelligent Systems, Aurélien Géron
151
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear.
152
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear.
For classiﬁcation, the current activation functions of choice are
logistic sigmoid – binary classiﬁcation
softmax: Given an output neuron j ∈ Y
yj = σj(ξj) =
eξj
i∈Y eξi
for multi-class classiﬁcation.
152
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear.
For classiﬁcation, the current activation functions of choice are
logistic sigmoid – binary classiﬁcation
softmax: Given an output neuron j ∈ Y
yj = σj(ξj) =
eξj
i∈Y eξi
for multi-class classiﬁcation.
The error function used with softmax (assuming that the target values
dkj are from {0, 1}) is typically cross-entropy:
−
1
p
p
k=1 j∈Y
dkj ln(yj)
... which somewhat corresponds to the maximum likelihood principle.
152
Sigmoidal outputs with cross-entropy – in detail
Consider
Binary classiﬁcation, two classes {0, 1}
One output neuron j, its activation logistic sigmoid
σj(ξj) =
1
1 + e−ξj
The output of the network is y = σj(ξj).
153
Sigmoidal outputs with cross-entropy – in detail
Consider
Binary classiﬁcation, two classes {0, 1}
One output neuron j, its activation logistic sigmoid
σj(ξj) =
1
1 + e−ξj
The output of the network is y = σj(ξj).
For a training set
T = xk , dk k = 1, . . . , p
(here xk ∈ R|X| and dk ∈ R), the cross-entropy looks like
this:
Ecross
= −
1
p
p
k=1
[dk ln(yk ) + (1 − dk ) ln(1 − yk )]
where yk is the output of the network for the k-th training
input xk , and dk is the k-th desired output.
153
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
154
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
More formally: It is typically assumed that the training set has
been generated as follows:
dkj = gj(xk ) + Θkj
where gj is the "underlying" function corresponding to
the output neuron j ∈ Y and Θkj is random noise.
The network should ﬁt gj not the noise.
Methods improving generalization are called regularization
methods.
154
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
155
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
von Neumann: "With four parameters I can ﬁt an elephant,
and with ﬁve I can make him wiggle his trunk."
... and I ask you prof. Neumann:
What can you ﬁt with 40GB of parameters??
155
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
156
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
In many applications the error function is not the main thing we
want to optimize.
E.g. in the case of a trading system, we typically want to maximize our proﬁt
not to minimize (strange) error functions designed to be easily differentiable.
Also, as noted before, minimizing E completely is not good for
generalization.
For start: We may employ standard approach of training on one
set and stopping on another one.
156
Early stopping
Divide your dataset into several subsets:
training set (e.g. 60%) – train the network here
validation set (e.g. 20%) – use to stop the training
(possibly) test set (e.g. 20%) – use to compare trained
models
What to use as a stopping rule?
157
Early stopping
Divide your dataset into several subsets:
training set (e.g. 60%) – train the network here
validation set (e.g. 20%) – use to stop the training
(possibly) test set (e.g. 20%) – use to compare trained
models
What to use as a stopping rule?
You may observe E (or any other function of interest) on the
validation set, if it does not improve for last k steps, stop.
Alternatively, you may observe the gradient, if it is small for
some time, stop.
(recent studies shown that this traditional rule is not too good: it may happen
that the gradient is larger close to minimum values; on the other hand, E
does not have to be evaluated which saves time.
To compare models you may use ML techniques such as
various types of cross-validation etc.
157
Size of the network
Similar problem as in the case of the training duration:
Too small network is not able to capture intrinsic properties
of the training set.
Large networks overﬁt faster.
Solution: Optimal number of neurons :-)
158
Size of the network
Similar problem as in the case of the training duration:
Too small network is not able to capture intrinsic properties
of the training set.
Large networks overﬁt faster.
Solution: Optimal number of neurons :-)
there are some (useless) theoretical bounds
there are algorithms dynamically adding/removing neurons
(not much use nowadays)
In practice:
start using a rule of thumb: the number of neurons ≈ ten
times less than the number of training instances.
experiment, experiment, experiment.
158
Feature extraction
Consider a two layer network. Hidden neurons are supposed to
represent "patterns" in the inputs.
Example: Network 64-2-3 for letter classiﬁcation:
159
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
160
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
Bagging:
Generate k training sets T1, ..., Tk by sampling from T
uniformly with replacement.
If the number of samples is |T |, then on average |Ti| = (1 − 1/e)|T |.
For each i, train a model Mi on Ti.
Combine outputs of the models: for regression by
averaging, for classiﬁcation by (majority) voting.
160
Dropout
The algorithm: In every step of the gradient descent
choose randomly a set N of neurons, each neuron is included in
N independently with probability 1/2,
(in practice, different probabilities are used as well).
do forward and backward propagations only using the selected
neurons
(i.e. leave weights of the other neurons unchanged)
161
Dropout
The algorithm: In every step of the gradient descent
choose randomly a set N of neurons, each neuron is included in
N independently with probability 1/2,
(in practice, different probabilities are used as well).
do forward and backward propagations only using the selected
neurons
(i.e. leave weights of the other neurons unchanged)
Dropout resembles bagging: Large ensemble of neural networks is
trained "at once" on parts of the data.
Dropout is not exactly the same as bagging: The models share
parameters, with each model inheriting a different subset of
parameters from the parent neural network. This parameter sharing
makes it possible to represent an exponential number of models with
a tractable amount of memory.
In the case of bagging, each model is trained to convergence on its respective
training set. This would be infeasible for large networks/training sets.
161
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
162
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)(w
(t)
ji
+ ∆w
(t)
ji
)
Intuition: Unimportant weights will be pushed to 0, important
weights will survive the decay.
162
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)(w
(t)
ji
+ ∆w
(t)
ji
)
Intuition: Unimportant weights will be pushed to 0, important
weights will survive the decay.
Weight decay is equivalent to the gradient descent with a
constant learning rate ε and the following error function:
E (w) = E(w) +
2ζ
ε
(w · w)
Here 2ζ
ε (w · w) is the L2 regularization that penalizes large
weights.
162
More optimization, regularization ...
There are many more practical tips, optimization methods,
regularization methods, etc.
For a very nice survey see
http://www.deeplearningbook.org/
... and also all other inﬁnitely many urls concerned with deep
learning.
163
Some applications
164
ALVINN (history)
165
ALVINN
Architecture:
MLP, 960 − 4 − 30 (also 960 − 5 − 30)
inputs correspond to pixels
166
ALVINN
Architecture:
MLP, 960 − 4 − 30 (also 960 − 5 − 30)
inputs correspond to pixels
Activity:
activation functions: logistic sigmoid
Steering wheel position determined by "center of mass" of
neuron values.
166
ALVINN
Learning: Trained during (live) drive.
Front window view captured by a camera, 25 images per
second.
Training samples of the form (xk , dk ) where
xk = image of the road
dk = corresponding position of the steering wheel
position of the steering wheel "blurred" by Gaussian
distribution:
dki = e−D2
i
/10
where Di is the distance of the i-th output from the one
which corresponds to the correct position of the wheel.
(The authors claim that this was better than the binary
output.)
167
ALVINN – Selection of training samples
Naive approach: take images directly from the camera and
adapt accordingly.
168
ALVINN – Selection of training samples
Naive approach: take images directly from the camera and
adapt accordingly.
Problems:
If the driver is gentle enough, the car never learns how to
get out of dangerous situations. A solution may be
turn off learning for a moment, then suddenly switch on,
and let the net catch on,
let the driver drive as if being insane (dangerous, possibly
expensive).
The real view out of the front window is repetitive and
boring, the net would overﬁt on few examples.
168
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
169
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
15 distorted copies of each image:
desired output generated for each copy
169
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
15 distorted copies of each image:
desired output generated for each copy
"Boring" images solved as follows:
a buffer of 200 images (including 15 copies of the original), in
every step the system trains on the buffer
after several updates a new image is captured, 15 copies are
made and they will substitute 15 images in the buffer (5 chosen
randomly, 10 with the smallest error).
169
ALVINN - learning
pure backpropagation
constant learning rate
momentum, slowly increasing.
Results:
Trained for 5 minutes, speed 4 miles per hour.
ALVINN was able to drive well on a new road it has never
seen (in different weather conditions).
170
ALVINN - learning
pure backpropagation
constant learning rate
momentum, slowly increasing.
Results:
Trained for 5 minutes, speed 4 miles per hour.
ALVINN was able to drive well on a new road it has never
seen (in different weather conditions).
The maximum speed was limited by the hydraulic controller
of the steering wheel, not the learning algorithm.
170
ALVINN - weight development
round 0
round 10
round 20
round 50
h1 h2 h3 h4 h5
Here h1, . . . , h5 are hidden neurons.
171
MNIST – handwritten digits recognition
Database of labelled images of
handwritten digits: 60 000
training examples, 10 000 testing.
Dimensions: 28 x 28, digits are
centered to the "center of gravity"
of pixel values and normalized to
ﬁxed size.
More at http:
//yann.lecun.com/exdb/mnist/
The database is used as a standard benchmark in lots of publications.
172
MNIST – handwritten digits recognition
Database of labelled images of
handwritten digits: 60 000
training examples, 10 000 testing.
Dimensions: 28 x 28, digits are
centered to the "center of gravity"
of pixel values and normalized to
ﬁxed size.
More at http:
//yann.lecun.com/exdb/mnist/
The database is used as a standard benchmark in lots of publications.
Allows comparison of various methods.
172
MNIST
One of the best "old" results is the following:
6-layer NN 784-2500-2000-1500-1000-500-10 (on GPU)
(Ciresan et al. 2010)
Abstract: Good old on-line back-propagation for plain multi-layer
perceptrons yields a very low 0.35 error rate on the famous MNIST
handwritten digits benchmark. All we need to achieve this best result so far
are many hidden layers, many neurons per layer, numerous deformed
training images, and graphics cards to greatly speed up learning.
A famous application of a learning convolutional network LeNet-1 in
1998.
173
MNIST – LeNet1
174
MNIST – LeNet1
Interpretation of output:
the output neuron with the highest value identiﬁes the digit.
the same, but if the two largest neuron values are too close
together, the input is rejected (i.e. no answer).
Learning:
Inputs:
training on 7291 samples, tested on 2007 samples
Results:
error on test set without rejection: 5%
error on test set with rejection: 1% (12% rejected)
compare with dense MLP with 40 hidden neurons: error
1% (19.4% rejected)
175
Modern convolutional networks
The rest of the lecture is based on the online book Neural
Networks and Deep Learning by Michael Nielsen.
http://neuralnetworksanddeeplearning.com/index.html
Convolutional networks are currently the best networks for
image classiﬁcation.
Their common ancestor is LeNet-5 (and other LeNets)
from nineties.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 1998
176
AlexNet
In 2012 this network made a breakthrough in ILVSCR
competition, taking the classiﬁcation error from around 28% to
16%:
A convolutional network, trained on two GPUs.
177
Convolutional networks - local receptive ﬁelds
Every neuron is connected with a ﬁeld of k × k (in this case
5 × 5) neurons in the lower layer (this ﬁled is receptive ﬁeld).
Neuron is "standard": Computes a weighted sum of its inputs,
applies an activation function.
178
Convolutional networks - stride length
Then we slide the local receptive ﬁeld over by one pixel to the right
(i.e., by one neuron), to connect to a second hidden neuron:
The "size" of the slide is
called stride length.
The group of all such
neurons is feature map.
all these neurons share
weights and biases!
179
Feature maps
Each feature map represents a property of the input that is
supposed to be spatially invariant.
Typically, we consider several feature maps in a single layer.
180
Pooling
Neurons in the pooling layer compute functions of their
receptive ﬁelds:
Max-pooling : maximum of inputs
L2-pooling : square root of the sum of squres
Average-pooling : mean
· · · 181
Trained feature maps
(20 feature maps, receptive ﬁelds 5 × 5)
182
Trained feature maps
183
Simple convolutional network
28 × 28 input image, 3 feature maps, each feature map has its
own max-pooling (ﬁeld 5 × 5, stride = 1), 10 output neurons.
Each neuron in the output layer gets input from each neuron in
the pooling layer.
Trained using backprop, which can be easily adapted to
convolutional networks.
184
Convolutional network
185
Simple convolutional network vs MNIST
two convolutional-pooling layers, one 20, second 40 feature
maps, two dense (MLP) layers (1000-1000), outputs (10)
Activation functions of the feature maps and dense layers:
ReLU
max-pooling
output layer: soft-max
Error function: negative log-likelihood (= cross-entropy)
Training: SGD, mini-batch size 10
learning rate 0.03
L2 regularization with "weight" λ = 0.1 + dropout with prob.
1/2
training for 40 epochs (i.e. every training example is
considered 40 times)
Expanded dataset: displacement by one pixel to an
arbitrary direction.
Committee voting of 5 networks. 186
MNIST
Out of 10 000 images in the test set, only these 33 have been
incorrectly classiﬁed:
187
More complex convolutional networks
Convolutional networks have been used for classiﬁcation of
images from the ImageNet database (16 million color images,
20 thousand classes)
188
ImageNet Large-Scale Visual Recognition
Challenge (ILSVRC)
Competition in classiﬁcation over a subset of images from
ImageNet.
Started in 2010, assisted in breakthrough in image recognition.
Training set 1.2 million images, 1000 classes. Validation set: 50
000, test set: 150 000.
Many images contain more than one object ⇒ model is allowed
to choose ﬁve classes, the correct label must be among the
ﬁve. (top-5 criterion).
189
AlexNet
ImageNet classiﬁcation with deep convolutional neural networks, by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton (2012).
Trained on two GPUs (NVIDIA GeForce GTX 580)
Výsledky:
accuracy 84.7% in top-5 (second best algorithm at the time
73.8%)
63.3% "perfect" (top-1) classiﬁcation
190
ILSVRC 2014
The same set as in 2012, top-5 criterion.
GoogLeNet: deep convolutional network, 22 layers
Results:
Accuracy 93.33% top-5
191
ILSVRC 2015
Deep convolutional network
Various numbers of layers, the winner has
152 layers
Skip connections implementing residual
learning
Error 3.57% in top-5.
192
ILSVRC 2016
Trimps-Soushen (The Third Research Institute of Ministry of
Public Security)
There is no new innovative technology or novelty by
Trimps-Soushen.
Ensemble of the pretrained models from Inception-v3,
Inception-v4, Inception-ResNet-v2, Pre-Activation ResNet-200,
and Wide ResNet (WRN-68–2).
Each of the models are strong at classifying some categories,
but also weak at classifying some categories.
Test error: 2.99%
193
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
194
Top-20 typical errors
Out of 1458 misclassiﬁed images in Top-20:
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
195
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
196
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd 197
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
198
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
199
Superhuman convolutional nets?!
Andrej Karpathy: ...the task of labeling images with 5 out of 1000
categories quickly turned out to be extremely challenging, even for some
friends in the lab who have been working on ILSVRC and its classes for a
while. First we thought we would put it up on [Amazon Mechanical Turk].
Then we thought we could recruit paid undergrads. Then I organized a
labeling party of intense labeling effort only among the (expert labelers) in
our lab. Then I developed a modiﬁed interface that used GoogLeNet
predictions to prune the number of categories from 1000 to only about 100. It
was still too hard - people kept missing categories and getting up to ranges of
13-15% error rates. In the end I realized that to get anywhere competitively
close to GoogLeNet, it was most efﬁcient if I sat down and went through the
painfully long training process and the subsequent careful annotation process
myself... The labeling happened at a rate of about 1 per minute, but this
decreased over time... Some images are easily recognized, while some
images (such as those of ﬁne-grained breeds of dogs, birds, or monkeys) can
require multiple minutes of concentrated effort. I became very good at
identifying breeds of dogs... Based on the sample of images I worked on, the
GoogLeNet classiﬁcation error turned out to be 6.8%... My own error in the
end turned out to be 5.1%, approximately 1.7% better.
200
Does it really work?
201
Convolutional networks – theory
202
Convolutional network
203
Convolutional layers
Every neuron is connected with a (typically small) receptive
ﬁeld of neurons in the lower layer.
Neuron is "standard": Computes a weighted sum of its inputs,
applies an activation function.
204
Convolutional layers
Neurons grouped into
feature maps sharing
weights.
205
Convolutional layers
Each feature map represents a property of the input that is
supposed to be spatially invariant.
Typically, we consider several feature maps in a single layer.
206
Pooling layers
Neurons in the pooling layer compute simple functions of their
receptive ﬁelds (the ﬁelds are typically disjoint):
Max-pooling : maximum of inputs
L2-pooling : square root of the sum of squres
Average-pooling : mean
· · · 207
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
208
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
input layer L0
208
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
input layer L0
dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
208
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
input layer L0
dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
convolutional & pooling layer Lm: Contains two
sub-layers:
convolutional layer: Neurons organized into disjoint
feature maps, all neurons of a given feature map share
weights (but have different inputs)
pooling layer: Each (convolutional) feature map F has
a corresponding pooling map P. Neurons of P
have inputs only from F (typically few of them),
compute a simple aggregate function (such as max),
have disjoint inputs.
208
Convolutional networks – architecture
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
[ji] is a set of all connections (i.e. pairs of neurons) sharing
the weight wji. 209
Convolutional networks – activity
neurons of dense and convolutional layers:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
210
Convolutional networks – activity
neurons of dense and convolutional layers:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
Neurons of pooling layers: Apply the "pooling" function:
max-pooling:
yj = max
i∈j←
yi
avg-pooling:
yj =
i∈j←
yi
|j←|
A convolutional network is evaluated layer-wise (as MLP), for each j ∈ Y we
have that yj(w, x) is the value of the output neuron j after evaluating the
network with weights w and input x.
210
Convolutional networks – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function – mean squared error (for example):
E(w) =
1
p
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
211
Convolutional networks – SGD
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
1
|T|
k∈T
Ek (w(t)
)
Here T is a minibatch (of a ﬁxed size),
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented
by randomly shufﬂing all data and then choosing minibatches
sequentially. Epoch consists of one round through all data. 212
Backprop
Recall that Ek (w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
213
Backprop
Recall that Ek (w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
Recall that for every wji where j is in a dense layer, i.e.
does not share weights:
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
213
Backprop
Recall that Ek (w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
Recall that for every wji where j is in a dense layer, i.e.
does not share weights:
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
Now for every wji where j is in a convolutional layer:
∂Ek
∂wji
=
r ∈[ji]
∂Ek
∂yr
· σr (ξr ) · y
Neurons of pooling layers do not have weights.
213
Backprop
Now compute derivatives w.r.t. yj:
for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
214
Backprop
Now compute derivatives w.r.t. yj:
for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
for every j ∈ Z Y such that j→
is either a dense layer, or a
convolutional layer:
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
214
Backprop
Now compute derivatives w.r.t. yj:
for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
for every j ∈ Z Y such that j→
is either a dense layer, or a
convolutional layer:
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
for every j ∈ Z Y such that j→
is max-pooling: Then j→
= {i} for
a single "max" neuron and we have
∂Ek
∂yj
=



∂Ek
∂yi
if j = arg maxr∈i←
yr
0 otherwise
I.e. gradient can be propagated from the output layer downwards as in MLP.
214
Convolutional networks – summary
Conv. nets. are nowadays the most used networks in
image processing (and also in other areas where input has
some local, "spatially" invariant properties)
Typically trained using backpropagation.
Due to the weight sharing allow (very) deep architectures.
Typically extended with more adjustments and tricks in
their topologies.
215
Recurrent Neural Networks - LSTM
216
RNN
Input:
x = (x1, . . . , xM)
Hidden:
h = (h1, . . . , hH)
Output:
y = (y1, . . . , yN)
217
RNN example
Activation function:
σ(ξ) =



1 ξ ≥ 0
0 ξ < 0
y 1 0 1
h (0, 0) (1, 1) (1, 0) (0, 1) · · ·
x (0, 0) (1, 0) (1, 1)
218
RNN example
Activation function:
σ(ξ) =



1 ξ ≥ 0
0 ξ < 0
y y1 = 1 y2 = 0 y3 = 1
h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · ·
x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1)
218
RNN example
y y1 = 1 y2 = 0 y3 = 1
h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · ·
x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1)
218
RNN – formally
M inputs: x = (x1, . . . , xM)
H hidden neurons: h = (h1, . . . , hH)
N output neurons: y = (y1, . . . , yN)
Weights:
Ukk from input xk to hidden hk
Wkk from hidden hk to hidden hk
Vkk from hidden hk to output yk
219
RNN – formally
Input sequence: x = x1, . . . , xT
xt = (xt1, . . . , xtM)
220
RNN – formally
Input sequence: x = x1, . . . , xT
xt = (xt1, . . . , xtM)
Hidden sequence: h = h0, h1, . . . , hT
ht = (ht1, . . . , htH)
We have h0 = (0, . . . , 0) and
htk = σ


M
k =1
Ukk xtk +
H
k =1
Wkk h(t−1)k


220
RNN – formally
Input sequence: x = x1, . . . , xT
xt = (xt1, . . . , xtM)
Hidden sequence: h = h0, h1, . . . , hT
ht = (ht1, . . . , htH)
We have h0 = (0, . . . , 0) and
htk = σ


M
k =1
Ukk xtk +
H
k =1
Wkk h(t−1)k


Output sequence: y = y1, . . . , yT
yt = (yt1, . . . , ytN)
where ytk = σ H
k =1 Vkk htk .
220
RNN – in matrix form
Input sequence: x = x1, . . . , xT
221
RNN – in matrix form
Input sequence: x = x1, . . . , xT
Hidden sequence: h = h0, h1, . . . , hT where
h0 = (0, . . . , 0)
and
ht = σ(Uxt + Wht−1)
221
RNN – in matrix form
Input sequence: x = x1, . . . , xT
Hidden sequence: h = h0, h1, . . . , hT where
h0 = (0, . . . , 0)
and
ht = σ(Uxt + Wht−1)
Output sequence: y = y1, . . . , yT where
yt = σ(Vht )
221
RNN – Comments
ht is the memory of the network, captures what happened
in all previous steps (with decaying quality).
RNN shares weights U, V, W along the sequence.
Note the similarity to convolutional networks where the weights were
shared spatially over images, here they are shared temporally over
sequences.
RNN can deal with sequences of variable length.
Compare with MLP which accepts only ﬁxed-dimension vectors on
input.
222
RNN – training
Training set
T = (x1, d1), . . . , (xp, yp)
here
each x = x 1, . . . , x T is an input sequence,
each d = d 1, . . . , d T is an expected output sequence.
Here each x t = (x t1, . . . , x tM) is an input vector and each
d t = (d t1, . . . , d tN) is an expected output vector.
223
Error function
In what follows I will consider a training set with a single
element (x, d). I.e. drop the index and have
x = x1, . . . , xT where xt = (xt1, . . . , xtM)
d = d1, . . . , dT where dt = (dt1, . . . , dtN)
The squared error of (x, d) is deﬁned by
E(x,d) =
T
t=1
N
k=1
1
2
(ytk − dtk )2
Recall that we have a sequence of network outputs
y = y1, . . . , yT and thus ytk is the k-th component of yt
224
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
225
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
Initialize all weights randomly close to 0.
225
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
Initialize all weights randomly close to 0.
In the step + 1 (here = 0, 1, 2, . . .) compute "new"
weights U( +1), V( +1), W( +1) from the "old" weights
U( ), V( ), W( ) as follows:
U
( +1)
kk
= U
( )
kk
− ε( ) ·
δE(x,d)
δUkk
V
( +1)
kk
= V
( )
kk
− ε( ) ·
δE(x,d)
δVkk
W
( +1)
kk
= W
( )
kk
− ε( ) ·
δE(x,d)
δWkk
225
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
Initialize all weights randomly close to 0.
In the step + 1 (here = 0, 1, 2, . . .) compute "new"
weights U( +1), V( +1), W( +1) from the "old" weights
U( ), V( ), W( ) as follows:
U
( +1)
kk
= U
( )
kk
− ε( ) ·
δE(x,d)
δUkk
V
( +1)
kk
= V
( )
kk
− ε( ) ·
δE(x,d)
δVkk
W
( +1)
kk
= W
( )
kk
− ε( ) ·
δE(x,d)
δWkk
The above is THE learning algorithm that modiﬁes weights!
225
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
226
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
δE(x,d)
δUkk
=
T
t=1
δE(x,d)
δhtk
· σ · xtk k = 1, . . . , M
δE(x,d)
δVkk
=
T
t=1
δE(x,d)
δytk
· σ · htk k = 1, . . . , H
δE(x,d)
δWkk
=
T
t=1
δE(x,d)
δhtk
· σ · h(t−1)k k = 1, . . . , H
226
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
δE(x,d)
δUkk
=
T
t=1
δE(x,d)
δhtk
· σ · xtk k = 1, . . . , M
δE(x,d)
δVkk
=
T
t=1
δE(x,d)
δytk
· σ · htk k = 1, . . . , H
δE(x,d)
δWkk
=
T
t=1
δE(x,d)
δhtk
· σ · h(t−1)k k = 1, . . . , H
Backpropagation:
δE(x,d)
δytk
= ytk − dtk (assuming squared error)
δE(x,d)
δhtk
=
N
k =1
δE(x,d)
δytk
· σ · Vk k +
H
k =1
δE(x,d)
δh(t+1)k
· σ · Wk k
226
Long-term dependencies
δE(x,d)
δhtk
=
N
k =1
δE(x,d)
δytk
· σ · Vk k +
H
k =1
δE(x,d)
δh(t+1)k
· σ · Wk k
Unless H
k =1 σ · Wk k ≈ 1, the gradient either vanishes, or
explodes.
For a large T (long-term dependency), the gradient
"deeper" in the past tends to be too small (large).
A solution: LSTM
227
LSTM
ht = ot ◦ σh(Ct ) output
Ct = ft ◦ Ct−1 + it ◦ ˜Ct memory
˜Ct = σh(WC · ht−1 + UC · xt ) new memory contents
ot = σg(Wo · ht−1 + Uo · xt ) output gate
ft = σg(Wf · ht−1 + Uf · xt ) forget gate
it = σg(Wi · ht−1 + Ui · xt ) input gate
◦ is the component-wise product of vectors
· is the matrix-vector product
σh hyperbolic tangents (applied component-wise)
σg logistic sigmoid (aplied component-wise)
228
RNN vs LSTM
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
229
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM – summary
LSTM (almost) solves the vanishing gradient problem w.r.t.
the "internal" state of the network.
Learns to control its own memory (via forget gate).
Revolution in machine translation and text processing.
231
Convolutions & LSTM in action – cancer research
232
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
233
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
Input: Digitized haematoxylin-eosin-stained
tumour tissue microarray samples.
Output: Estimated survival probability.
233
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
Input: Digitized haematoxylin-eosin-stained
tumour tissue microarray samples.
Output: Estimated survival probability.
Data:
Training set: 420 patients of Helsinki University Centre
Hospital, diagnosed with colorectal cancer, underwent
primary surgery.
Test set: 182 patients
Follow-up time and outcome known for each patient.
233
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
Input: Digitized haematoxylin-eosin-stained
tumour tissue microarray samples.
Output: Estimated survival probability.
Data:
Training set: 420 patients of Helsinki University Centre
Hospital, diagnosed with colorectal cancer, underwent
primary surgery.
Test set: 182 patients
Follow-up time and outcome known for each patient.
Human expert comparison:
Histological grade assessed at the time of diagnosis.
Visual Risk Score: Three pathologists classiﬁed to
high/low-risk categories (by majority vote).
Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientiﬁc
Reports, Nature, 2018. 233
Colorectal cancer outcome prediction
234
Colorectal cancer outcome prediction
234
Data & workﬂow
Input images: 3500 px × 3500 px
Cut into tiles: 224 px × 224 px ⇒ 256 tiles
Each tile pased to a convolutional network (CNN)
Ouptut of CNN: 4096 dimensional vector.
A "string" of 256 vectors (each of the dimension 4096)
pased into a LSTM.
LSTM outputs the probability of 5-year survival.
235
Data & workﬂow
Input images: 3500 px × 3500 px
Cut into tiles: 224 px × 224 px ⇒ 256 tiles
Each tile pased to a convolutional network (CNN)
Ouptut of CNN: 4096 dimensional vector.
A "string" of 256 vectors (each of the dimension 4096)
pased into a LSTM.
LSTM outputs the probability of 5-year survival.
The authors also tried to substitute the LSTM on top of CNN
with
logistic regression
naive Bayes
support vector machines
235
CNN architecture – VGG-16
(Pre)trained on ImageNet (cats, dogs, chairs, etc.)
236
LSTM architecture
LSTM has three layers (264, 128, 64 cells)
237
LSTM – training
L1 regularization (0.005) at each hidden layer of LSTM
i.e. 0.005 times the sum of absolute values of weights added to the error
L2 regularization (0.005) at each hidden layer of LSTM
i.e. 0.005 times the sum of squared values of weights added to the error
Dropout 5% at the input and the last hidden layers of LSTM
Datasets:
Training: 220 samples,
Validation 60 samples,
Test 140 samples.
238
Colorectal cancer outcome prediction
Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientiﬁc
Reports, Nature, 2018.
239
Feed-forward networks summary
Architectures:
Multi-layer perceptron (MLP):
dense connections between layers
Convolutional networks (CNN):
local receptors, feature maps
pooling
Recurrent networks (RNN, LSTM):
self-loops but still feed-forward through time
Training:
gradient descent algorithm + heuristics
240
Generative models
241
Restricted Boltzmann machine (RBM)
Architecture:
Neural network with cycles and symmetric connections,
neurons divided into two disjoint sets:
V - visible
H - hidden
Connections: V × S (complete bipartite graph)
N is a set of all neurons.
Denote by ξj the inner potential and by yj the output (i.e.
state) of neuron j.
State of the machine: y ∈ {0, 1}|N|.
Denote by wji ∈ R the weight of the connection from i to j
(and thus also from j to i).
Consider bias: wj0 is the weight between j and a neuron 0
whose value y0 is always 1.
242
RBM – activity
Activity: States of neurons initially set to values of {0, 1}, i.e.
y
(0)
j
∈ {0, 1} for j ∈ N.
243
RBM – activity
Activity: States of neurons initially set to values of {0, 1}, i.e.
y
(0)
j
∈ {0, 1} for j ∈ N.
In the step t + 1 do the following:
t even: randomly choose new values of all hidden neurons,
for every j ∈ H
P y
(t+1)
j
= 1 = 1

1 + exp

−wj0 −
i∈V
wjiy
(t)
i




t odd: randomly choose new values of all visible neurons,
for every j ∈ V
P y
(t+1)
j
= 1 = 1

1 + exp

−wj0 −
i∈H
wjiy
(t)
i




243
Equilibrium
Theorem
For every γ∗ ∈ {0, 1}|N| we have that
lim
t→∞
P y(t)
= γ∗
=
1
Z
e−E(γ∗)
where
Z =
γ∈{0,1}|N|
e−E(γ)
and
E(γ) = −
i∈V, j∈H
wjiy
γ
j
y
γ
i
−
i∈V
wi0y
γ
i
−
j∈H
wj0y
γ
j
Here y
γ
i
is the value of the neuron i in the state γ.
244
RBM – Probability distribution
RBM deﬁnes the following probability distribution on {0, 1}|N|
(recall that N is the set of all neurons):
pN(γ∗
) := lim
t→∞
P y(t)
= γ∗
for every γ∗
∈ {0, 1}|N|
We obtain a distribution on states of visible neurons by
marginalization:
pV (α) =
β∈{0,1}|H|
pN(αβ) for every α ∈ {0, 1}|V|
Here αβ ∈ {0, 1}|N| is a vector of values of all states obtained by
concatenating values α of visible neurons and values β of
hidden neurons.
245
RBM – learning
Learning:
Let pd be a probability distribution on states of visible neurons,
i.e. on {0, 1}|V|.
Our goal is to ﬁnd a conﬁguration of the network W such that
pV ≈ pd.
A suitable measure of difference between probability
distributions pV and pd is relative entropy weighted by
probabilities of states (Kullback-Leibler divergence):
E(W) =
α∈{0,1}|V|
pd(α) ln
pd(α)
pV (α)
E is minimized using the gradient descent algorithm.
246
RBM – learning
Minimize E(w) using gradient descent, i.e. compute a
sequence of weight matrices: W(0), W(1), . . .
247
RBM – learning
Minimize E(w) using gradient descent, i.e. compute a
sequence of weight matrices: W(0), W(1), . . .
initialise W(0) randomly, close to 0
247
RBM – learning
Minimize E(w) using gradient descent, i.e. compute a
sequence of weight matrices: W(0), W(1), . . .
initialise W(0) randomly, close to 0
in step t + 1 compute W(t+1) as follows:
W
(t+1)
ji
= W
(t)
ji
+ ∆W
(t)
ji
where
∆W
(t)
ji
= −ε(t) ·
∂E
∂wji
(W(t)
)
is the update of the weight wji in the step t + 1 and
0 < ε(t) ≤ 1 is the learning rate in the step t + 1.
It remains to compute ∂E
∂wji
(W)
(skipped).
247
Deep MLP
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
248
Why deep networks
... if one hidden layer is able to represent an arbitrary (reasonable)
function?
249
Why deep networks
... if one hidden layer is able to represent an arbitrary (reasonable)
function?
One hidden layer may be very inefﬁcient, i.e. huge amount of
neurons may be needed. One can show that
the number of hidden neurons may be exponential w.r.t. the
dimension of the input,
networks with multiple layers may be exponentially more
succinct as opposed to single hidden layer.
249
Why deep networks
... if one hidden layer is able to represent an arbitrary (reasonable)
function?
One hidden layer may be very inefﬁcient, i.e. huge amount of
neurons may be needed. One can show that
the number of hidden neurons may be exponential w.r.t. the
dimension of the input,
networks with multiple layers may be exponentially more
succinct as opposed to single hidden layer.
... ok, so let’s try to teach deep networks ... using backpropagation?
Problems:
Gradient may vanish/explode when backpropagated through
many layers.
Deep networks (with many neurons) overﬁt very easily.
249
Deep MLP – pretraining
Assume k layers. Denote
Wi the weight matrix between layers i − 1 and i
250
Deep MLP – pretraining
Assume k layers. Denote
Wi the weight matrix between layers i − 1 and i
Fi function computed by the "lower" part of the MLP consisting of
layers 0, 1, . . . , i
F1 is a function which consists of the input and the ﬁrst hidden layer
(which is now considered as the output layer).
250
Deep MLP – pretraining
Assume k layers. Denote
Wi the weight matrix between layers i − 1 and i
Fi function computed by the "lower" part of the MLP consisting of
layers 0, 1, . . . , i
F1 is a function which consists of the input and the ﬁrst hidden layer
(which is now considered as the output layer).
Crucial observation: For every i, the layers i − 1 and i together with
the matrix Wi can be considered as a RBM.
Denote such a RBM as Bi.
250
Deep MLP – pretraining
For now, consider only input vectors x1, . . . , xp where xk ∈ {0, 1}n
for
all k = 1, . . . , p.
unsupervised pretraining: Gradually, for every i = 1, . . . , k,
train RBM Bi on randomly selected inputs from the training set:
Fi−1(x1), . . . , Fi−1(xp)
using the training algorithm for RBM (here F0(xi) = xi).
(Thus Bi learns from training samples transformed by the already
pretrained layers 0, . . . , i − 1)
We obtain a deep belief network D representing a distribution given
by x1, . . . , xp.
(Recall that in such a distribution the probability of a given x is equal to
the relative frequency of x in x1, . . . , xp.)
251
Deep belief network
The network D can be used to sample from the distribution as
follows:
Simulate the topmost RBM for some steps (ideally to thermal
equilibrium), this gives values of neurons in the two topmost
layers.
252
Deep belief network
The network D can be used to sample from the distribution as
follows:
Simulate the topmost RBM for some steps (ideally to thermal
equilibrium), this gives values of neurons in the two topmost
layers.
Propagate the values downwards by always simulating one step
of the corresponding RBM. That is,
you have already computed values of neurons in layers k
and k − 1.
To compute values of neurons in the layer k − 2, simulate
one step of RBM Bk−1, that is sample values of neurons in
the layer k − 2 using RBM dynamics of Bk−1 with values of
the layer k − 1 ﬁxed.
Similarly, compute values of k − 3 by simulating Bk−2 ... etc.
... ﬁnally obtain values of input neurons.
252
Deep belief network
The network D can be used to sample from the distribution as
follows:
Simulate the topmost RBM for some steps (ideally to thermal
equilibrium), this gives values of neurons in the two topmost
layers.
Propagate the values downwards by always simulating one step
of the corresponding RBM. That is,
you have already computed values of neurons in layers k
and k − 1.
To compute values of neurons in the layer k − 2, simulate
one step of RBM Bk−1, that is sample values of neurons in
the layer k − 2 using RBM dynamics of Bk−1 with values of
the layer k − 1 ﬁxed.
Similarly, compute values of k − 3 by simulating Bk−2 ... etc.
... ﬁnally obtain values of input neurons.
Probability with which a concrete input x is sampled by the
above procedure is the probability of x in the distribution
represented by D.
252
Deep MLP – training with pretraining
Now consider supervised learning with a training set:
T = xk , dk k = 1, . . . , p .
Still assume that xk ∈ {0, 1}n
.
unsupervised pretraining: Gradually, for every i = 1, . . . , k,
train RBM Bi on randomly selected inputs from the training set:
Fi−1(x1), . . . , Fi−1(xp)
using the training algorithm for RBM (here F0(xi) = xi).
(Thus Bi learns from training samples transformed by the already
pretrained layers 0, . . . , i − 1)
Obtain D.
Add one (or more) layer to the top of D and consider the result
to be MLP.
(i.e. forget the RBM dynamics and start considering the network as
MLP with sigmoidal activations).
supervised ﬁne-tuning: Train in supervised mode (on the
training set T ) using e.g. gradient descent + backprop.
253
Autoencoders
An autoencoder consists of two parts:
φ : Rn → Rm the encoder
ψ : Rm → Rn the decoder
The goal is to ﬁnd φ, ψ so that ψ ◦ φ is (almost) identity.
The value h = φ(x) is called the latent representation of x.
254
Autoencoders – training
Assume
T = {x1, . . . , xp}
where xi ∈ Rn for all i ∈ {1, . . . , n}.
Minimize the reconstruction error
E =
p
i=1
(xi − ψ(φ(xi)))2
255
Autoencoders – neural networks
Both φ and ψ can be represented using MLP Mφ and Mψ,
respectively.
Mφ and Mψ can be connected into a single network.
256
Autoencoders – Usage
Compression – from x to h.
Dimensionality reduction – the latent representation h has
a smaller dimension.
Pretraining (next slides)
Generative versions – (roughly) generate h from a known
distribution, let Mψ generate realistic inputs x
257
Autoencoder – compression – historical
implementation
Architecture: MLP 64 − 16 − 64
Activity: activation function: hyperbolic tangens with limits −1
and 1
258
Autoencoder – compression – historical
implementation
Architecture: MLP 64 − 16 − 64
Activity: activation function: hyperbolic tangens with limits −1
and 1
Data:
Images 256 × 256, 8 bits per pixel.
Samples: input and output is a frame 8 × 8, randomly
selected in the image.
Inputs normalized to [−1, 1].
258
Autoencoder – compression – historical
implementation
Architecture: MLP 64 − 16 − 64
Activity: activation function: hyperbolic tangens with limits −1
and 1
Data:
Images 256 × 256, 8 bits per pixel.
Samples: input and output is a frame 8 × 8, randomly
selected in the image.
Inputs normalized to [−1, 1].
The goal was to compress images to smaller data size.
258
Autoencoder – compression – historical
implementation
A frame 8 × 8 passes through the
image 256 × 256 (no overlap)
(A) original
(B) compression
(C) compression + rounding to 6
bits (1.5 bit per pixel)
(D) compression + rounding to 4
bits (1 bit per pixel)
259
Dimensionality reduction – compression
New image (trained on the previous
one):
(A) original
(B) compression
(C) compression + rounding to 6
bits (1.5 bit per pixel)
(D) compression + rounding to 4
bits (1 bit per pixel)
260
Application – dimensionality reduction
Dimensionality reduction: A mapping R from Rn to Rm
where
m < n,
for every example x we have that x can be "reconstructed"
from R(x).
261
Application – dimensionality reduction
Dimensionality reduction: A mapping R from Rn to Rm
where
m < n,
for every example x we have that x can be "reconstructed"
from R(x).
Standard method: PCA (there are many linear as well as
non-linear variants)
261
Reconstruction – PCA
1024 pixels compressed to 100 dimensions (i.e. 100 numbers).
262
PCA vs Autoencoders
263
Autoencoders – Pretraining
An autoencoder is (pre)trained on input data xi without
desired outputs (unsupervised)
typically much larger datasets of unlabelled data
the encoder Mφ computes a latent representation for
every input vector, it is supposed to extract important
features (controversial)
A new part of the model Mtop is added on top of Mφ (e.g.
a MLP taking the output of Mφ as an input).
Subsequently, labels are added and the whole model
(composed of Mφ and Mtop) is trained on labelled data.
264
Autoencoders – Pretraining
265
Deep MLP – dimensionality reduction
Hinton, G. E., Osindero, S. and Teh, Y. (2006)
A fast learning algorithm for deep belief nets.
Neural Computation, 18, pp 1527-1554.
Hinton, G. E. and Salakhutdinov, R. R. (2006)
Reducing the dimensionality of data with neural networks.
Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.
This basically started all the deep learning craze ...
266
Deep MLP – dimensionality reduction
267
Images – pretraining
Data: 165 600 black-white images, 25 × 25, mean intensity
0, variance 1.
Images obtained from Olivetti Faces database of images 64 × 64 using
standard transformations.
103 500 training set, 20 700 validation, 41 400 test
Network: 2000-100-500-30, training using layered RBM.
Notes:
Training of the lowest layer (2000 neurons): Values of pixels distorted
using Gaussian noise, low learning rate: 0.001, 200 iterations
Training all hidden layers: Values of neurons are binary.
Training of output layer: Values computed directly using the sigmoid
activation functions + noise. That is, values of output neurons are
from the interval [0, 1].
268
Images – ﬁne-tuning
Stochastic activation substituted with deterministic.
That is the value of hidden neurons is not chosen randomly but directly
computed by application of sigmoid on the inner potential (this gives the
mean activation).
Backpropagation.
Error function: cross-entropy
−
i
pi ln ˆpi −
i
(1 − pi) ln(1 − ˆpi)
here pi is the intensity of i-th pixel of the input and ˆpi of
the reconstruction.
269
Results
1. Original
2. Reconstruction using deep networks (reduction to 30-dim)
3. Reconstruction using PCA (reduction to 30-dim)
270
Generative adversarial networks
Generative adversarial Nets, Goodfellow et al, NIPS 2014
An unsupervised generative model.
Two networks:
Generator: A network computing a function G : Rk → Rn
which takes a random input z with a distribution pz
(e.g. multivariate normal distribution) and returns G(z)
which should follow the target probability distribution.
E.g. G(z) could be realistically looking faces.
Discriminator: A network computing a function
D : Rn → [0, 1] that given x ∈ Rn gives a probability D(x)
that x is not "generated" by G.
E.g. x can be an image, D(x) is a probability that it is a true face of an
existing person.
What error function will "motivate" G to generate realistically
and D to discriminate appropriately?
271
Generative adversarial networks – error function
Let T = {x1, . . . , xp} be a training multiset (or a minibatch).
Intuition: G should produce outputs similar to elements of T .
D should recognize that its input is not from T .
272
Generative adversarial networks – error function
Let T = {x1, . . . , xp} be a training multiset (or a minibatch).
Intuition: G should produce outputs similar to elements of T .
D should recognize that its input is not from T .
Generate a multiset of noise samples: F = {z1, . . . , zp} from
the distribution pz.
ET ,F (G, D) = −
1
p
p
i=1
ln D(x1) + ln(1 − D(G(z1)))
This is just the binary cross entropy error of D which classiﬁes the input as
either real, or fake.
The problem can be seen as a game: The discriminator wants
to minimize E, the generator wants to maximize E!
272
The learning algorithm
Denote by WG and WD the weights of G and D, respectively.
In every iteration of the training, modify weights of the discriminator
and the generator as follows:
273
The learning algorithm
Denote by WG and WD the weights of G and D, respectively.
In every iteration of the training, modify weights of the discriminator
and the generator as follows:
For k steps (here k is a hyperparameter) update the discriminator as
follows:
Sample a minibatch T = {x1, . . . , xm} from the training set T .
Sample a minibatch F = {z1, . . . , zm} from the distribution pz.
Update WD using the gradient descent w.r.t. E:
WD := WD − α · WD
ET,F (G, D)
273
The learning algorithm
Denote by WG and WD the weights of G and D, respectively.
In every iteration of the training, modify weights of the discriminator
and the generator as follows:
For k steps (here k is a hyperparameter) update the discriminator as
follows:
Sample a minibatch T = {x1, . . . , xm} from the training set T .
Sample a minibatch F = {z1, . . . , zm} from the distribution pz.
Update WD using the gradient descent w.r.t. E:
WD := WD − α · WD
ET,F (G, D)
Now update the generator:
Sample a minibatch F = {z1, . . . , zm} from the distribution pz.
Update the generator by gradient descent:
WG := WG − α · WG


1
p
p
i=1
ln(1 − D(G(z1)))


(The updates may also use momentum, adaptive learning rate etc.) 273
GAN MNIST
274
GAN faces
... from the original paper.
275
GAN reﬁned
... after some reﬁnements.
... none of these people ever lived.
276
Kohonen’s Map
277
Vector quantization
Assume we are given a probability density function p(x) on
input vectors x ∈ Rn.
I.e. assume that the inputs are randomly generated according to p(x).
Our goal is to approximate p(x) using ﬁnitely many
centres wi ∈ Rn where i = 1, . . . , h.
Roughly speaking: We want more centres in areas of higher density
and less in areas of low density.
278
Vector quantization
Assume we are given a probability density function p(x) on
input vectors x ∈ Rn.
I.e. assume that the inputs are randomly generated according to p(x).
Our goal is to approximate p(x) using ﬁnitely many
centres wi ∈ Rn where i = 1, . . . , h.
Roughly speaking: We want more centres in areas of higher density
and less in areas of low density.
Formally: To every input x we assign its closest centre
wc(x) :
c(x) = arg min
i=1,...,h
x − wi
and then minimize the error
E = x − wc(x)
2
p(x)dx
Caution! c(x) depends on x.
278
Vector quantization
In practice, p(x) is obtained by sampling uniformly from a given
training (multi)set:
T = {xj ∈ Rn
| j = 1, . . . , }
279
Vector quantization
In practice, p(x) is obtained by sampling uniformly from a given
training (multi)set:
T = {xj ∈ Rn
| j = 1, . . . , }
The error then corresponds to
E =
1
j=1
xj − wc(xj)
2
(keep in mind that c(xj) = arg mini=1,...,h xj − wi .)
279
Vector quantization
In practice, p(x) is obtained by sampling uniformly from a given
training (multi)set:
T = {xj ∈ Rn
| j = 1, . . . , }
The error then corresponds to
E =
1
j=1
xj − wc(xj)
2
(keep in mind that c(xj) = arg mini=1,...,h xj − wi .)
If T has been randomly selected according to p(x) and is large
eough, then
1
j=1
xj − wc(xj )
2
≈ x − wc(x)
2
p(x)dx
279
Example – image compression
Every pixel has 256 shades of grey,
each pair of neighbouring pixels is a
two-dimensional vector from
{0, . . . , 255} × {0, . . . , 255},
our compression ﬁnds a small set of
centres that will encode shades of grey
of pairs of pixels,
image is then encoded by simple
substitution of pairs of pixels with their
centres.
280
Example – image compression
pair distribution
naive quantization
smart quantization
281
k-means clustering algorithm
Assume a ﬁnite training set: T = {xj ∈ Rn | j = 1, . . . , }
The algorithm moves centres closer to the centres of mass of
closest points.
282
k-means clustering algorithm
Assume a ﬁnite training set: T = {xj ∈ Rn | j = 1, . . . , }
The algorithm moves centres closer to the centres of mass of
closest points.
In the step t computes w
(t)
1
, . . . , w
(t)
h
as follows:
282
k-means clustering algorithm
Assume a ﬁnite training set: T = {xj ∈ Rn | j = 1, . . . , }
The algorithm moves centres closer to the centres of mass of
closest points.
In the step t computes w
(t)
1
, . . . , w
(t)
h
as follows:
for every k = 1, . . . , h compute a set Tk of all vectors of T
to which w
(t−1)
k
is the closest centre:
Tk = xj ∈ T | k = arg min
i=1,...,h
xj − w
(t−1)
i
282
k-means clustering algorithm
Assume a ﬁnite training set: T = {xj ∈ Rn | j = 1, . . . , }
The algorithm moves centres closer to the centres of mass of
closest points.
In the step t computes w
(t)
1
, . . . , w
(t)
h
as follows:
for every k = 1, . . . , h compute a set Tk of all vectors of T
to which w
(t−1)
k
is the closest centre:
Tk = xj ∈ T | k = arg min
i=1,...,h
xj − w
(t−1)
i
compute w
(t)
k
to be the centroid of Tk :
w
(t)
k
=
1
|Tk |
x∈Tk
x
We may stop the computation when, e.g. the error E is
sufﬁciently small.
282
Kohonen’s learning
The k-means algorithm is not online.
283
Kohonen’s learning
The k-means algorithm is not online.
The following Kohonen’s algorithm is online (i.e. the inputs may
be generated one by one and the centres are adapted online):
283
Kohonen’s learning
The k-means algorithm is not online.
The following Kohonen’s algorithm is online (i.e. the inputs may
be generated one by one and the centres are adapted online):
In step t, consider the input xt and compute w
(t)
k
as follows:
283
Kohonen’s learning
The k-means algorithm is not online.
The following Kohonen’s algorithm is online (i.e. the inputs may
be generated one by one and the centres are adapted online):
In step t, consider the input xt and compute w
(t)
k
as follows:
If w
(t−1)
k
is the closest centre to xt , i.e.
k = arg mini xt − w
(t−1)
i
then
w
(t)
k
= w
(t−1)
k
+ θ · (xt − w
(t−1)
k
)
else w
(t)
k
= w
(t−1)
k
283
Kohonen’s learning
The k-means algorithm is not online.
The following Kohonen’s algorithm is online (i.e. the inputs may
be generated one by one and the centres are adapted online):
In step t, consider the input xt and compute w
(t)
k
as follows:
If w
(t−1)
k
is the closest centre to xt , i.e.
k = arg mini xt − w
(t−1)
i
then
w
(t)
k
= w
(t−1)
k
+ θ · (xt − w
(t−1)
k
)
else w
(t)
k
= w
(t−1)
k
0 < θ ≤ 1 determines how much to move the centre towards
the input.
Let us formulate this algorithm in the language of neural
networks.
283
Kohonen’s learning – neural network
Architecture: Single layer
x1 xi xn
· · · · · ·
y1 yk yh
· · · · · ·
wk1 wki wkn
284
Kohonen’s learning – neural network
Architecture: Single layer
x1 xi xn
· · · · · ·
y1 yk yh
· · · · · ·
wk1 wki wkn
Activity: For an input x ∈ Rn and k = 1, . . . , h:
yk =



1 k = arg mini=1,...,h x − wi
0 otherwise
284
Kohonen’s learning
In step t, consider the input xt and compute w
(t)
k
as follows:
285
Kohonen’s learning
In step t, consider the input xt and compute w
(t)
k
as follows:
If w
(t−1)
k
is the closest neuron to xt , i.e.
k = arg mini xt − w
(t−1)
i
then
w
(t)
k
= w
(t−1)
k
+ θ · (xt − w
(t−1)
k
)
else w
(t)
k
= w
(t−1)
k
0 < θ ≤ 1 determines how much to move the neuron towards
the input.
285
Kohonen’s learning – efﬁciency
Works well if most input vectors evenly distributed in a
convex area.
286
Kohonen’s learning – efﬁciency
Works well if most input vectors evenly distributed in a
convex area.
In case of two (or more) separated clusters, the density
may not correspond to p(x) at all:
Ex. Two separated areas with the same density.
Assume that the centres are initially in one of the areas.
286
Kohonen’s learning – efﬁciency
Works well if most input vectors evenly distributed in a
convex area.
In case of two (or more) separated clusters, the density
may not correspond to p(x) at all:
Ex. Two separated areas with the same density.
Assume that the centres are initially in one of the areas.
The second then "drags" only one of the centres (which
always wins the competition).
286
Kohonen’s learning – efﬁciency
Works well if most input vectors evenly distributed in a
convex area.
In case of two (or more) separated clusters, the density
may not correspond to p(x) at all:
Ex. Two separated areas with the same density.
Assume that the centres are initially in one of the areas.
The second then "drags" only one of the centres (which
always wins the competition).
Result: One of the areas will be covered by a single centre
even though it contains half of the mass of the input
examples.
Solution: We tie centres together so that they have to move
together.
286
Kohonen’s map
Architecture: Single layer
x1 xi xn
· · · · · ·
y1 yk yh
· · · · · ·
wk1 wki wkn
Topological structure: neurons connected by edges so
that they are nodes in an undirected graph.
In most cases, this structure is either a one dimensional
sequence or a two dimensional grid.
287
Kohonen’s map – illustration
288
Kohonen’s map – bio motivation
Source: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
289
Kohonen’s map
Activity: Given an input vector x ∈ Rn and k = 1, . . . , h:
yk =



1 k = arg mini=1,...,h x − wi
0 jinak
290
Kohonen’s map
Activity: Given an input vector x ∈ Rn and k = 1, . . . , h:
yk =



1 k = arg mini=1,...,h x − wi
0 jinak
Learning: We use the topological structure.
Denote by d(c, k) the length of the shortest path from
neuron c to neuron k in the topological structure.
For every neuron c and a given s ∈ N0 deﬁne topological
neighbourhood of the neuron c of size s :
Ns(c) = {k | d(c, k) ≤ s}
290
Kohonen’s map
Activity: Given an input vector x ∈ Rn and k = 1, . . . , h:
yk =



1 k = arg mini=1,...,h x − wi
0 jinak
Learning: We use the topological structure.
Denote by d(c, k) the length of the shortest path from
neuron c to neuron k in the topological structure.
For every neuron c and a given s ∈ N0 deﬁne topological
neighbourhood of the neuron c of size s :
Ns(c) = {k | d(c, k) ≤ s}
In step t, given training example xt adapt wk as follows:
w
(t)
k
=



w
(t−1)
k
+ θ · xt − w
(t−1)
k
k ∈ Ns(c(xt ))
w
(t−1)
k
otherwise
where c(xt ) = arg mini=1,...,h xt − w
(t−1)
i
and θ ∈ R and s ∈ N0
are parameters that may change during training. 290
Kohonen’s map – learning
More general version:
w
(t)
k
= w
(t−1)
k
+ Θ(c(xt ), k) · xt − w
(t−1)
k
where c(xt ) = arg mini=1,...,h xt − w
(t−1)
i
. The previous case
then corresponds to
Θ(c(xt ), k) =



θ k ∈ Ns(c(xt ))
0 jinak
A smoother version:
Θ(c(xt ), k) = θ0 · exp
−d(c(xt ), k)2
σ2
where θ0 ∈ R is a learning rate and σ ∈ R is the width (both
parameters may change during training).
291
Example 1
Inputs uniformly distributed in a rectangle.
Image source: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
292
Example 2
Inputs uniformly distributed in a triangle. Zdroj obrázku: Neural Networks - A
Systematic Introduction, Raul Rojas, Springer, 1996
293
Example 3
Inputs uniformly distributed in a cuboid.
Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
294
Example 4
Inputs uniformly distributed in a cactus.
Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 295
Example – defect
Topological defect – twisted network.
Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
296
Kohonen’s map – theory
Convergence to "ordered" state has been proved only for
one dimensional maps and special cases of the distribution
p(x) (uniform), ﬁxed neighbourhoods of size 1, and a ﬁxed
learning rate.
There are simple counterexamples disproving convergence in case
these assumptions are not satisﬁed.
297
Kohonen’s map – theory
Convergence to "ordered" state has been proved only for
one dimensional maps and special cases of the distribution
p(x) (uniform), ﬁxed neighbourhoods of size 1, and a ﬁxed
learning rate.
There are simple counterexamples disproving convergence in case
these assumptions are not satisﬁed.
In more than one dimension there are no guarantees at all,
convergence depends on several factors:
initial distribution of neurons (centres)
size of the neighbourhood
learning rate
What dimension to choose? Typically one or two
dimensional map is used (as a coarse version of
dimensionality reduction).
297
LVQ – classiﬁcation using Kohonen’s map
Assume randomly generated training examples of the form
(xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq}
corresponds to one of the q classes.
298
LVQ – classiﬁcation using Kohonen’s map
Assume randomly generated training examples of the form
(xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq}
corresponds to one of the q classes.
Our goal is to classify objects based on our knowledge of their
features, i.e. to every xt assign a class so that the probability of
error is minimized.
298
LVQ – classiﬁcation using Kohonen’s map
Assume randomly generated training examples of the form
(xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq}
corresponds to one of the q classes.
Our goal is to classify objects based on our knowledge of their
features, i.e. to every xt assign a class so that the probability of
error is minimized.
Ex.: Conveyor belt with fruits, apples and oranges: Formally,
(xt , dt ) where
298
LVQ – classiﬁcation using Kohonen’s map
Assume randomly generated training examples of the form
(xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq}
corresponds to one of the q classes.
Our goal is to classify objects based on our knowledge of their
features, i.e. to every xt assign a class so that the probability of
error is minimized.
Ex.: Conveyor belt with fruits, apples and oranges: Formally,
(xt , dt ) where
xt ∈ R2, here the ﬁrst component is the weight and the
second the diameter.
298
LVQ – classiﬁcation using Kohonen’s map
Assume randomly generated training examples of the form
(xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq}
corresponds to one of the q classes.
Our goal is to classify objects based on our knowledge of their
features, i.e. to every xt assign a class so that the probability of
error is minimized.
Ex.: Conveyor belt with fruits, apples and oranges: Formally,
(xt , dt ) where
xt ∈ R2, here the ﬁrst component is the weight and the
second the diameter.
dt is either A or O depending on whether the given object
is an apple or an orange.
298
LVQ – classiﬁcation using Kohonen’s map
Assume randomly generated training examples of the form
(xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq}
corresponds to one of the q classes.
Our goal is to classify objects based on our knowledge of their
features, i.e. to every xt assign a class so that the probability of
error is minimized.
Ex.: Conveyor belt with fruits, apples and oranges: Formally,
(xt , dt ) where
xt ∈ R2, here the ﬁrst component is the weight and the
second the diameter.
dt is either A or O depending on whether the given object
is an apple or an orange.
We allow apples and oranges with the same features.
The goal is to sort out the fruits based on their weight and
diameter.
298
Classiﬁcation using Kohonen’s map
We use Kohonen’s map as follows:
1. Train the map on feature vectors xt where t = 1, . . . ,
(ignore the classes for now).
299
Classiﬁcation using Kohonen’s map
We use Kohonen’s map as follows:
1. Train the map on feature vectors xt where t = 1, . . . ,
(ignore the classes for now).
2. Label neurons with classes. The class vc of a given neuron
c is determined as follows:
For every neuron c and every class Ci count the number
#(c, Ci) of training examples xt with class Ci for which the
neuron c returns 1 (i.e. is the closest to them).
To c, assign the class vc satisfying
vc = argmaxCi
#(c, Ci)
299
Classiﬁcation using Kohonen’s map
We use Kohonen’s map as follows:
1. Train the map on feature vectors xt where t = 1, . . . ,
(ignore the classes for now).
2. Label neurons with classes. The class vc of a given neuron
c is determined as follows:
For every neuron c and every class Ci count the number
#(c, Ci) of training examples xt with class Ci for which the
neuron c returns 1 (i.e. is the closest to them).
To c, assign the class vc satisfying
vc = argmaxCi
#(c, Ci)
3. Fine tune the network using LVQ (next slide)
The trained network is used as follows: Given a feature vector
x, evaluate the network with x as the input. A single neuron c
has the value 1, return vc as the class of x.
299
LVQ
Iterate over training examples. For (xt , dt ) ﬁnd the closes
neuron c
c = arg min
i=1,...,h
xt − wi
Adjust weights of c as follows:
w
(t)
c =



w
(t−1)
c + α(xt − w
(t−1)
c ) dt = vc
w
(t−1)
c − α(xt − w
(t−1)
c ) dt vc
The parameter α should be small right from the beginning
(approx. 0.01 − 0.02) and go to 0 steadily.
300
LVQ
Iterate over training examples. For (xt , dt ) ﬁnd the closes
neuron c
c = arg min
i=1,...,h
xt − wi
Adjust weights of c as follows:
w
(t)
c =



w
(t−1)
c + α(xt − w
(t−1)
c ) dt = vc
w
(t−1)
c − α(xt − w
(t−1)
c ) dt vc
The parameter α should be small right from the beginning
(approx. 0.01 − 0.02) and go to 0 steadily.
By Kohonen: The border between classes should be a good
approximation of the Bayes decision boundary.
What is it??
300
Bayes classiﬁer
For simplicity, consider two classes C0 and C1 (e.g. A and O).
Let P(Ci | x) be the probability that the object belongs to Ci
assuming that it has features x.
(e.g. P(A | (a, b)) is the probability that a fruit with weight a and diameter b is
an apple.)
Bayes classiﬁer assigns to x the class Ci which satisﬁes
P(Ci | x) ≥ P(C1−i | x).
Denote by R0 the set of all x satisfying P(C0 | x) ≥ P(C1 | x)
and R1 = Rn R0.
301
Bayes classiﬁer
For simplicity, consider two classes C0 and C1 (e.g. A and O).
Let P(Ci | x) be the probability that the object belongs to Ci
assuming that it has features x.
(e.g. P(A | (a, b)) is the probability that a fruit with weight a and diameter b is
an apple.)
Bayes classiﬁer assigns to x the class Ci which satisﬁes
P(Ci | x) ≥ P(C1−i | x).
Denote by R0 the set of all x satisfying P(C0 | x) ≥ P(C1 | x)
and R1 = Rn R0.
Bayes classiﬁer minimizes the error probability:
P(x ∈ R0 ∧ C1) + P(x ∈ R1 ∧ C0)
Bayes decision boundary is the boundary between the sets R0
and R1.
301
Bayes decision boundary vs LVQ
Zdroj obrázku: The Self-Organizing Map, Teuvo Kohonen, IEEE, 1990
302
Oceanographic data
Source: Patterns of ocean current variability on the West Florida Shelf using
the self-organizing map. Y. Liu a R. H. Weisberg, JOURNAL OF
GEOPHYSICAL RESEARCH, 2005
Investigates currents in the ocean around Florida.
303
Oceanographic data
11 measuring stations, 3 depths (surface, bottom, in
between).
data: 2D velocity vectors of the current
measured by every hour, for 25585 hours
304
Oceanographic data
11 measuring stations, 3 depths (surface, bottom, in
between).
data: 2D velocity vectors of the current
measured by every hour, for 25585 hours
Thus we have 25585 data samples, 66 dimensions.
Kohonen’s map:
grid 3 × 4
neighbourhoods given by Gaussian functions
Θ(c, k) = θ0 · exp
−d(c, k)2
σ2
shrinking width
(linearly decreasing learning rate)
304
Oceanographic data
305
Oceanographic data
crosses are winning neurons)
inﬂuenced by local ﬂuctuations
observable trend:
winter: neurons 1-6 (south-east)
summer: neurons 10-12 (north-west)
306
Grimm’s fairy tales
Zdroj: Contextual Relations of Words in Grimm Tales, Analyzed by
Self-Organizing Map. T. Kohonen, T. Honkela a V. Pulkki, ICANN, 1995
Our goal is to visualize syntactic and semantic categories of
words in fairy tales (depending on context).
307
Grimm’s fairy tales
Zdroj: Contextual Relations of Words in Grimm Tales, Analyzed by
Self-Organizing Map. T. Kohonen, T. Honkela a V. Pulkki, ICANN, 1995
Our goal is to visualize syntactic and semantic categories of
words in fairy tales (depending on context).
Input: Grimm’s fairy tales (understandably encoded using a
stream of 270-dimensional vectors)
triples of words (predecessor, key, successor)
every component in the triple encoded using a randomly
generated 90 dimensional real vector
Network: Kohonen’s map, 42 × 36 neurons, weights of the form
w = (wp, wk , wn) where wp, wk , wn ∈ R90.
307
Grimm’s fairy tales
Learning:
Trained on triples of successive words in fairy tales
The training set consisted of 150 most common words, with "average"
context.
Training: Approx. 1000 000 iterations.
In the end, 150 most common words labelled neurons:
A word u labels a neuron with weights w = (wp, wk , wn) when
wk is closest to the code of u.
308
Grimm’s fairy tales
309
Course Summary
310
Great summary – models
We have considered several models of neural networks:
ADALINE (aka linear regression)
Multilayer Perceptron
Convolutional Networks
Recurrent Networks (LSTM)
Restricted Boltzmann Machines and Deep Belief Networks
Kohonen’s Maps
311
Great summary – algorithms
Gradient descent!
The only exception were Kohonen’s maps (Kohonen learning).
The gradient computed using
Backpropagation: MLP, Convolutional, Recurrent (LSTM)
(Simulations: RBM)
312
Deeper thoughts
Most neural network models are universal approximators
(i.e. capable of approximating any reasonable function),
but it is difﬁcult to ﬁnd the appropriate conﬁguration →
such conﬁguration can be learned efﬁciently (without
guarantees of course)
Depth is stronger than size: deep networks are more
succinct in their representation but are harder to train: Do
not forget the vanishin/exploding gradient problem!
Weight tying = single most effective trick in the history of
neural networks!
313