(Primitive) Mathematical Model of Neuron
σ
ξ
x1 x2 xn
y
1
Formal neuron
σ
ξ
x1 x2 xn
x0 = 1
y
w1 w2
· · ·
wn
w0
x1, . . . , xn real inputs
x0 special input, always 1
w0, w1, . . . , wn real weights
ξ = w0 + n
i=1 wi xi inner potential;
In general, other potentials are considered
(e.g. Gaussian), more on this in PV021.
y output deﬁned by y = σ(ξ)
where σ is an activation function.
We consider several activation functions.
e.g. linear threshold function
σ(ξ) = sgn(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
2
Formal Neuron vs Linear Models
Both linear classiﬁer and linear (aﬃne) function are special cases of
the formal neuron.
If σ is a linear threshold function
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
we obtain a linear classiﬁer.
If σ is identity, i.e. σ(ξ) = ξ, we obtain a linear (aﬃne)
function.
Many more activation functions are used in neural networks!
3
Sigmoid Functions
Logistic sigmoid σ(ξ) =
1
1 + e−ξ
Others ...
4
Multilayer Perceptron (MLP)
Input
Hidden
Output
· · ·
· · ·
Neurons are organized in layers
(input layer, output layer, possibly
several hidden layers)
Layers are numbered from 0;
the input is 0-th
Neurons in the -th layer are connected
with all neurons in the + 1-th layer
Intuition: The network computes a function as follows: Assign input
values to the input neurons and 0 to the rest. Proceed upwards through
the layers, one layer per step. In the -th step consider output values of
neurons in − 1-th layer as inputs to neurons of the -th layer. Compute
output values of neurons in the -th layer.
5
Example
1010 1001
σ 01000110111 σ0001011011 1
σ
0101
1
−22 2 −2
1
−1
1
3
−2
Activation function: linear
threshold
σ(ξ) =
1 ξ ≥ 0 ;
0 ξ < 0.
6
Classical Example – ALVINN
One of the ﬁrst autonomous car
driving systems (in 90s)
ALVINN drives a car
The net has 30 × 32 = 960 input
neurons (the input space is R960
).
The value of each input captures
the shade of gray of
the corresponding pixel.
Output neurons indicate where to
turn (to the center of gravity).
Source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 7
A Bit of History
Perceptron (Rosenblatt et al, 1957)
Single layer (i.e. no hidden layers), the activation function
linear threshold
(i.e., a bit more general linear classiﬁer)
Perceptron learning algorithm
Used to recognize numbers
Adaline (Widrow & Hof, 1960)
Single layer, the activation function identity
(i.e., a bit more linear function)
Online version of the gradient descent
Used a new circuitry element called memristor which was able
to "remember"history of current in form of resistance
In both cases, the expressive power is rather limited – can express only
linear decision boundaries and linear (aﬃne) functions. 8
A Bit of History
One of the famous (counter)-examples: XOR
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
x1
x2
No perceptron can distinguish between ones and zeros.
9
XOR vs Multilayer Perceptron
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
P1 P2
x1
x2
σ1 σ 1
σ1
−22 2 −2
1
−1
1
3
−2
(Here σ is a linear threshold function.)
P1 : −1 + 2x1 + 2x2 = 0 P2 : 3 − 2x1 − 2x2 = 0
The output neuron performs an intersection of half-spaces.
10
Expressive Power of MLP
Cybenko’s theorem:
Two layer networks with a single output neuron and a single layer of
hidden neurons (with the logistic sigmoid as the activation function)
are able to
approximate with arbitrarily small error any "reasonable"
boundary
a given input is classiﬁed as 1 iﬀ the output value of the network is
≥ 1/2.
approximate with arbitrarily small error any "reasonable"
function with range (0, 1).
Here "reasonable" means that it is pretty tough to ﬁnd a function that is
not reasonable.
So multi-layer perceptrons are suﬀuciently powerful for any application.
But for a long time, at least throughout 60s and 70s, nobody well-known
knew any eﬃcient method for training multilayer networks!
... then the backpropagation appeared in 1986!
11
MLP – Notation
X set of input neurons
Y set of output neurons
Z set of all neurons (tedy X, Y ⊆ Z)
individual neurons are denoted by indices, e.g. i, j.
ξj is the inner potential of the neuron j when the computation
is ﬁnished.
yj is the output value of the neuron j when the computation is
ﬁnished.
(we formally assume y0 = 1)
wji is the weight of the arc from the neuron i to the neuron j.
j← is the set of all neurons from which there are edges to j
(i.e. j← is the layer directly below j)
j→ is the set of all neurons to which there are edges from j.
(i.e. j→
is the layer directly above j)
12
MLP – Notation
Inner potential of a neuron j:
ξj =
i∈j←
wji yi
for simplicity, the activation function of every neuron will be
the logistic sigmoid σ(ξ) = 1
1+e−ξ .
(We may of course consider logistic sigmoids with diﬀerent steepness
paramaters, or other sigmoidal functions, more in PV021.)
A value of a non-input neuron j ∈ Z \ X when the computation is
ﬁnished is yj = σ(ξj )
(yj is determined by weights w and a given input x, so it’s sometimes
written as yj [w](x) )
Fixing weights of all neurons, the network computes a function
F[w] : R|X|
→ R|Y |
as follows: Assign values of a given vector
x ∈ R|X|
to the input neurons, evaluate the network, then F[w](x)
is the vector of values of the output neurons.
Here we implicitly assume a ﬁxed orderings on input and output vectors.
13
MLP – Learning
Given a set D of training examples:
D = xk, dk k = 1, . . . , p
Here xk ∈ R|X| and dk ∈ R|Y |. We write dkj to denote the
value in dk corresponding to the output neuron j.
Least Squares Error Function: Let w be a vector of all
weights in the network.
E(w) =
p
k=1
Ek(w)
where
Ek(w) =
1
2
j∈Y
(yj [w](xk) − dkj )2
14
MLP – Learning Algorithm
Batch Learning – Gradient Descent:
The algorithm computes a sequence of weights w(0), w(1), . . ..
weights w(0) are initialized randomly close to 0
in the step t + 1 (here t = 0, 1, 2 . . .) is w(t+1) computed as
follows:
w
(t+1)
ji = w
(t)
ji + ∆w
(t)
ji
where
∆w
(t)
ji = −ε(t) ·
∂E
∂wji
(w(t)
)
is the weight change wji and 0 < ε(t) ≤ 1 is the learning
speed in the step t + 1.
Note that ∂E
∂wji
(w(t)
) is a component of E, i.e. the weight change in the step
t + 1 can be written as follows: w(t+1)
= w(t)
− ε(t) · E(w(t)
).
15
MLP – Gradient Computation
For every weight wji we have (obviously)
∂E
∂wji
=
p
k=1
∂Ek
∂wji
So now it suﬃces to compute ∂Ek
∂wji
, that is the error for a ﬁxed training
example (xk , dk ).
It holds that
∂Ek
∂wji
= δj · yj (1 − yj ) · yi
where
δj = yj − dkj pro j ∈ Y
δj =
r∈j→
δr · yr (1 − yr ) · wrj pro j ∈ Z (Y ∪ X)
(Here yr = y[w](xk ) where w are the current weights and xk is the input of the
k-th training example.)
16
Multilayer Perceptron – Backpropagation
So to compute all ∂E
∂wji
=
p
k=1
∂Ek
∂wji
:
Compute all ∂Ek
∂wji
= δj · yj (1 − yj ) · yi for every training example (xk , dk ):
Evaluate all values yi of neurons using the standard bottom-up
procedure with the input xk .
Compute δj using backpropagation through layers top-down :
Assign δj = yj − dkj for all j ∈ Y
In the layer , assuming that δr has been computed for all
neurons r in the layer + 1, compute
δj =
r∈j→
δr · yr (1 − yr ) · wrj
for all j from the -th layer.
Compute ∂Ek
∂wji
= δj · yj (1 − yj ) · yi for every weight wji .
17
Example
Assume w
(0)
30 = w
(0)
50 = w
(0)
41 = w
(0)
42 = w
(0)
54 = 1 and
w
(0)
40 = w
(0)
31 = w
(0)
32 = w
(0)
53 = −1. Consider a training set {((1, 0), 1)}.
Then
y1 = 1,
y2 = 0,
y3 = σ(w30 + w
(0)
31 · y1 + w
(0)
32 · y2) = 0.5,
y4 = 0.5,
y5 = 0.7310586.
δ5 = y5 − 1 = −0.268941,
δ4 = d5 · y5 · (1 − y5) · w
(0)
54 = −0.052877,
δ3 = 0.052877.
y1 y2
σ y31 σy4 1
σ y51
w41
w31 w32
w42
w53
w30
w54
w40
w50
∂E1
∂w53
= δ5 · y5 · (1 − y5) · y3 = −0.026438,
∂E1
∂w54
= δ5 · y5 · (1 − y5) · y4 = −0.026438,
∂E1
∂w31
= δ3 · y3 · (1 − y3) · y1 = 0.013219,
.... 18
Illustration of Gradient Descent – XOR
Source: Pattern Classiﬁcation (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
19
Comments on Training Algorithm
Not guaranteed to converge to zero training error, may converge to
local optima or oscillate indeﬁnitely.
In practice, does converge to low error for many large networks on
real data.
Many epochs (thousands) may be required, hours or days of training
for large networks.
To avoid local-minima problems, run several trials starting with
diﬀerent random weights (random restarts).
Take results of trial with lowest training set error.
Build a committee of results from multiple trials (possibly
weighting votes by training set accuracy).
There are many more issues concerning learning eﬃciency (data
normalization, selection of activation functions, weight initialization,
training speed, eﬃciency of the gradient descent itself etc.) – see PV021.
20
Hidden Neurons Representations
Trained hidden neurons can be seen as newly constructed features.
E.g., in a two layer network used for classiﬁcation, the hidden layer transforms
the input so that important features become explicit (and hence the result may
become linearly separable).
Consider a two-layer MLP, 64-2-3 for classiﬁcation of letters (three
output neurons, each corresponds to one of the letters).
21
Over-Training
Due to their expressive power, neural networks are quite
sensitive to over-training.
Keep a hold-out validation set and test accuracy on it after
every epoch. Stop training when additional epochs actually
increase the validation error.
22
Over-Fitting – The Number of Hidden Neurons
Too few hidden neurons prevent the network from adequately
ﬁtting the data.
Too many hidden units can result in over-ﬁtting.
(There are advanced methods that prevent over-ﬁtting even for rich
models, such as regularization, where the error function penalizes
over-ﬁtting – see PV021.)
Use cross-validation to empirically determine an optimal
number of hidden units.
There are methods that automatically construct the structure of the
network based on data, they are not much used though.
23
Applications
Text to Speech
Fraud detection
ﬁnance & business predictions
HNC software company acquired by Fair Isaac for $3/4 billion
Game playing (backgammon is a classical example)
Handwriting recognition and image recognition in general.
This is the main area in which the current state-of-the-art deep networks
excel.
(artiﬁcial brain and intelligence)
...
24
ALVINN
25
ALVINN
Two layer MLP, 960 − 4 − 30 (sometimes 960 − 5 − 30)
Inputs correspond to pixels.
Sigmoidal activation function (logistic sigmoid).
Direction corresponds to the center of gravity.
I.e., output neurons are considered as points of mass evenly distributed
along a line. Weight of each neuron corresponds to its value.
26
ALVINN – Training
Trained while driving.
A camera captured the road from the front window, approx. 25
pictures per second
Training examples (xk, dk) where
xk = image of the road
dk ≈ corresponding direction of the steering wheel set by the
driver
the values dk computed using Gaussian distribution:
dki = e−D2
i /10
where Di is the distance between the i-th output from the one
that corresponds to the real direction of the steering wheel.
(This is better than the binary output because similar road
directions induce similar reaction of the driver.)
27
Selection of Training Examples
Naive approach: just take images from the camera.
Problems:
A too good driver never teaches the network how to solve
deviations from the right track. Couple of harsh solutions:
turn the learning oﬀ for a moment, deviate from the right
track, then turn on the learning and let the network learn how
to solve the situation.
let the driver go crazy! (a bit dangerous, expensive, unreliable)
Images are very similar (the network basically sees the road
from the right lane), may be overtrained.
28
Selection of Training Examples
Problem with too good driver were solved as follows:
every image of the road has been has been transformed to 15
slightly diﬀerent copies
Repetitiveness of images was solved as follows:
the system has a buﬀer of 200 images (including the 15 copies of
the current one), in every round trains on these images
afterwards, a new image is captured, 15 copies made, and these new
15 substitute 15 selected from the buﬀer (10 with the smallest
training error, 5 randomly)
29
ALVINN – Training
standard backpropagation
constant speed of learning (possibly diﬀerent for each neuron –
see PV021)
some other optimizations (see PV021)
Výsledek:
Training took 5 minutes, the speed was 4 miles per hour
(The speed was limited by the hydraulic controller of the steering wheel
not the learning algorithm.)
ALVINN was able to go through roads it never "seen" and in
diﬀerent weather
30
ALVINN – Weight Learning
round 0
round 10
round 20
round 50
h1 h2 h3 h4 h5
Here h1, . . . , h5 are values of hidden neurons.
31
Extensions and Directions (PV021)
Other types of learning inspired by neuroscience – Hebbian
learning
More biologically plausible models of neural networks – spiking
neurons
This goes into the direction of HUGE area of (computational)
neuroscience, only very lightly touched in PV021.
Unsupervised learning – Self-Organizing Maps
Reinforcement learning
learning to make decisions, or play games, sequentially
neural networks have been used – temporal diﬀerence learning
32
Deep Learning
Cybenko’s theorem shows that two-layer networks are omnipotent –
such results nearly killed NN when support vector machines were
found to be easier to train in 00’s.
Later, it has been shown (experimentally) that deep networks (with
many layers) have better represenational properties.
... but how to train them? The backpropagation suﬀers from
so-called vanishing gradient, intuitively, updates of weights in lower
layers are very slow.
In 2006 a solution was found by Hinton et al:
Use unsupervised methods to initialize the weights so that they
capture important features in data.
More precisely: The lowest hidden layer learns patterns in data,
second lowest learns patterns in data transformed through the ﬁrst
layer, and so on.
Then use a supervised learning algorithm to only ﬁne tune
the weights to the desired input-output behavior.
A rather heavy machinery is needed to develop this, but you will be
rewarded by insight into a very modern and expensive technology. 33
Deeper Insight into the Logistic Sigmoid
Consider a perceptron (that is a linear classiﬁer):
ξ = w0 +
n
i=1
wi · xi
and y = sgn(ξ) =
1 ξ ≥ 0
0 ξ < 0
.
Recall, that the signed distance from the decision boundary determined
by ξ = 0 is (here x = (x1, . . . , xn) and w = (w1, . . . , wn))
w0 +
n
i=1 wi · xi
n
i=1 w2
i
=
ξ
n
i=1 w2
i
This value is positive for x on the side of w and negative on the opposite.
For simplicity, assume that
n
i=1 w2
i = 1, and thus that the potential ξ
is equal to the signed distance of x from the boundary.
34
Deeper Insight into the Logistic Sigmoid
Assume that training examples (x, c(x)) are randomly generated.
Denote:
ξ1
mean signed distance from the boundary of points classiﬁed 1.
ξ0
mean signed distance from the boundary of points classiﬁed 0.
It is not unreasonable to assume that
conditioned on c = 1, the signed distance ξ is normally distributed
with the mean ξ1
and variance (for simplicity) 1,
conditioned on c = 0, the signed distance ξ is normally distributed
with the mean ξ0
and variance (for simplicity) 1.
(Notice that ξ may be negative, which means that such point is on the wrong
side of the boundary (the same for ξ > 0).)
Now, can we decide what is the probability of c = 1 given a distance?
35
Deeper Insight into the Logistic Sigmoid
For simplicity, assume that ξ1
= −ξ0
= 1/2.
P(1 | ξ) =
p(ξ | 1)P(1)
p(ξ | 1)P(1) + p(ξ | 0)P(0)
=
LR
LR + 1/clr
where
LR =
p(ξ | 1)
p(ξ | 0)
=
exp(−(ξ − 1/2)2
/2)
exp(−(ξ + 1/2)2/2)
= exp(ξ)
and
clr =
P(1)
P(0)
which we assume (for simplicity) = 1
So
P(1 | ξ) =
exp(ξ)
exp(ξ) + 1
=
1
1 + e−ξ
Thus the logistic sigmoid applied to ξ = w0 +
n
i=1 wi · xi gives
the probability of c = 1 given the input!
36
Deeper Insight into the Logistic Sigmoid
So if we use the logistic sigmoid as an activation function,
and turn the neuron into a classiﬁer as follows:
classify a given input x as 1 iﬀ y ≥ 1/2
Then the neuron basically works as the Bayes classiﬁer!
This is the basis of logistic regression.
Given training data, we may compute the weights w that maximize the
likelihood of the training data (w.r.t. the probabilities returned by the neuron).
An extremely interesting observation is that such w maximizing the likelihood
coincides with the minimum of least squares for the corresponding linear
function (that is the same neuron but with identity as the activation function).
37