Restricted Boltzmann machine (RBM)
Architecture:
Neural network with cycles and symmetric connections,
neurons divided into two disjoint sets:
V - visible
H - hidden
Connections: V × S (complete bipartite graph)
N is a set of all neurons.
Denote by ξj the inner potential and by yj the output (i.e.
state) of neuron j.
State of the machine: y ∈ {0, 1}|N|.
Denote by wji ∈ R the weight of the connection from i to j
(and thus also from j to i).
Consider bias: wj0 is the weight between j and a neuron 0
whose value y0 is always 1.
1
RBM – activity
Activity: States of neurons initially set to values of {0, 1}, i.e.
y
(0)
j
∈ {0, 1} for j ∈ N.
In the step t + 1 do the following:
t even: randomly choose new values of all hidden neurons,
for every j ∈ H
P y
(t+1)
j
= 1 = 1

1 + exp

−wj0 −
i∈V
wjiy
(t)
i




t odd: randomly choose new values of all visible neurons,
for every j ∈ V
P y
(t+1)
j
= 1 = 1

1 + exp

−wj0 −
i∈H
wjiy
(t)
i




2
Thermal equilibrium
Fix a temperature T (i.e. T(t) = T for t = 1, 2, . . .).
Theorem
For every γ∗ ∈ {0, 1}|N| we have that
lim
t→∞
P y(t)
= γ∗
=
1
Z
e−E(γ∗)/T
where
Z =
γ∈{0,1}|N|
e−E(γ)/T
and
E(γ) = −
i∈V, j∈H
wjiy
γ
j
y
γ
i
−
i∈V
wi0y
γ
i
−
j∈H
wj0y
γ
j
Deﬁne pN(γ∗) := limt→∞ P y(t) = γ∗ for every γ∗ ∈ {0, 1}|N|.
3
RBM – learning
Learning:
Let pd be a probability distribution on states of visible neurons,
i.e. on {0, 1}|V|.
Our goal is to ﬁnd a conﬁguration of the network W such that
pV ≈ pd.
A suitable measure of difference between probability
distributions pV and pd is relative entropy weighted by
probabilities of states (Kullback-Leibler divergence):
E(W) =
α∈{0,1}|V|
pd(α) ln
pd(α)
pV (α)
4
RBM – learning
Minimize E(w) using gradient descent, i.e. compute a
sequence of weight matrices: W(0), W(1), . . .
initialise W(0) randomly, close to 0
in step t + 1 compute W(t+1) as follows:
W
(t+1)
ji
= W
(t)
ji
+ ∆W
(t)
ji
where
∆W
(t)
ji
= −ε(t) ·
∂E
∂wji
(W(t)
)
is the update of the weight wji in the step t + 1 and
0 < ε(t) ≤ 1 is the learning rate in the step t + 1.
It remains to compute ∂E
∂wji
(W).
5
RBM – learning
For sufﬁciently large t∗ (i.e. in thermal equilibrium) we have
∂E
∂wji
≈ −
1
T
y
(t∗)
j
y
(t∗)
i ﬁxed
− y
(t∗)
j
y
(t∗)
i free
y
(t∗)
j
y
(t∗)
i ﬁxed
is the expected value of y
(t∗)
j
y
(t∗)
i
in the
thermal equilibrium assuming that values of visible neurons
are ﬁxed at the beginning of computation according to pd.
y
(t∗)
j
y
(t∗)
i free
is the expected value of y
(t∗)
j
y
(t∗)
i
in the
thermal equilibrium (no values ﬁxed).
Problem: Computation of y
(t∗)
j
y
(t∗)
i free
takes long time.
y
(t∗)
j
y
(t∗)
i free
can be estimated with yjyi recon
, the expectation
of y
(3)
j
y
(3)
i
when values of visible neurons chosen by pd.
6
RBM – learning
Thus
∆w
(t)
ji
= ε(t) · yjyi ﬁxed
− yjyi recon
Compute yjyi ﬁxed
as follows: Let Y := 0 and repeat the
following q times:
ﬁx values of visible neurons randomly by pd
simulate one step of computation, add yjyi to Y
For a suitable q we have that Y/q estimates yjyi
ﬁxed
well.
Compute yjyi recon
as follows: Let Y := 0 and repeat q
times:
choose initial values of visible neurons by pd
simuate three steps, add yjyi to Y
(i.e. compute values of hidden neurons, then of visible ones
(reconstruction of the input) and then of hidden neurons)
For a suitable q we have that Y/q estimates yjyi
recon
well.
7
Deep MLP
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
8
Deep MLP
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
9
Deep MLP – activity
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
A deep MLP is evaluated layer-wise, for each j ∈ Y we have that
yj(w, x) is the value of the output neuron j after evaluating the
network with weights w and input x.
10
Deep MLP – learning
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function – mean square error (for example):
E(w) =
1
p
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
Other errors such as cross-entropy possible. 11
Convolutional networks – SGD
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
1
|T|
k∈T
Ek (w(t)
)
Here T is a minibatch (of a ﬁxed size),
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented
by randomly shufﬂing all data and then choosing minibatches
sequentially. Epoch consists of one round through all data. 12
Why deep networks
... if one hidden layer is able to represent an arbitrary (reasonable)
function?
One hidden layer may be very inefﬁcient, i.e. huge amount of
neurons may be needed. One can show that
the number of hidden neurons may be exponential w.r.t. the
dimension of the input,
networks with multiple layers may be exponentially more
succinct as opposed to single hidden layer.
... ok, so let’s try to teach deep networks ... using backpropagation?
Problems:
Gradient may vanish/explode when backpropagated through
many layers.
Deep networks (with many neurons) overﬁt very easily.
13
Deep MLP - vanishing gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X holds
∂Ek
∂yj
= yj − dkj pro j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj pro j ∈ Z (Y ∪ X)
σr (ξr ) · wrj is less than one for standard logistic sigmoid and
weights initialized close to 0.
14
Deep MLP – pretraining
Assume k layers. Denote
Wi the weight matrix between layers i − 1 and i
Fi function computed by the "lower" part of the MLP consisting of
layers 0, 1, . . . , i
F1 is a function which consists of the input and the ﬁrst hidden layer
(which is now considered as the output layer).
Crucial observation: For every i, the layers i − 1 and i together with
the matrix Wi can be considered as a RBM (assume T = 1).
Denote such a RBM as Bi.
15
Deep MLP – pretraining
For now, consider only input vectors x1, . . . , xp where xk ∈ {0, 1}n
for
all k = 1, . . . , p.
unsupervised pretraining: Gradually, for every i = 1, . . . , k,
train RBM Bi on randomly selected inputs from the training set:
Fi−1(x1), . . . , Fi−1(xp)
using the training algorithm for RBM (here F0(xi) = xi).
(Thus Bi learns from training samples transformed by the already
pretrained layers 0, . . . , i − 1)
We obtain a deep belief network D representing a distribution given
by x1, . . . , xp.
(Recall that in such a distribution the probability of a given x is equal to
the relative frequency of x in x1, . . . , xp.)
16
Deep belief network
The network D can be used to sample from the distribution as
follows:
Simulate the topmost RBM for some steps (ideally to thermal
equilibrium), this gives values of neurons in the two topmost
layers.
Propagate the values downwards by always simulating one step
of the corresponding RBM. That is,
you have already computed values of neurons in layers k
and k − 1.
To compute values of neurons in the layer k − 2, simulate
one step of RBM Bk−1, that is sample values of neurons in
the layer k − 2 using RBM dynamics of Bk−1 with values of
the layer k − 1 ﬁxed.
Similarly, compute values of k − 3 by simulating Bk−2 ... etc.
... ﬁnally obtain values of input neurons.
Probability with which a concrete input x is sampled by the
above procedure is the probability of x in the distribution
represented by D.
17
Deep MLP – training with pretraining
Now consider supervised learning with a training set:
T = xk , dk k = 1, . . . , p .
Still assume that xk ∈ {0, 1}n
.
unsupervised pretraining: Gradually, for every i = 1, . . . , k,
train RBM Bi on randomly selected inputs from the training set:
Fi−1(x1), . . . , Fi−1(xp)
using the training algorithm for RBM (here F0(xi) = xi).
(Thus Bi learns from training samples transformed by the already
pretrained layers 0, . . . , i − 1)
Obtain D.
Add one (or more) layer to the top of D and consider the result
to be MLP.
(i.e. forget the RBM dynamics and start considering the network as
MLP with sigmoidal activations).
supervised ﬁne-tuning: Train in supervised mode (on the
training set T ) using e.g. gradient descent + backprop.
18
Application – dimensionality reduction
Dimensionality reduction: A mapping R from Rn to Rm
where
m < n,
for every example x we have that x can be "reconstructed"
from R(x).
Standard method: PCA (there are many linear as well as
non-linear variants)
19
Reconstruction – PCA
1024 pixels compressed to 100 dimensions (i.e. 100 numbers).
20
Autoencoders
Dimensionality reduction using MLP.
Consider MLP n − m − n where m << n.
The same vector on the input as well as output.
Dimensionality reduction:
Encoding: Compute values of hidden neurons.
Reconstruction: Compute values of output neurons given
values of hidden neurons.
Can also be used for compression (in communication).
One can show that if linear neurons are used, the method
implements PCA.
21
Autoencoder – historical implementation
Architecture: MLP 64 − 16 − 64
Activity: activation function: hyperbolic tangens with limits −1
and 1
Learning:
Data:
Images 256 × 256, 8 bits per pixel.
Samples: input and output is a frame 8 × 8, randomly
selected in the image.
Inputs normalized to [−1, 1].
Learning:
Backpropagation
Learning rate: 0.01 for hidden, 0.1 pro output
trained in 50 000 - 100 000 iterations
The goal was to compress images to smaller data size.
22
Dimensionality reduction – compression
A frame 8 × 8 passes through the
image 256 × 256 (no overlap)
(A) original
(B) compression
(C) compression + rounding to 6
bits (1.5 bit per pixel)
(D) compression + rounding to 4
bits (1 bit per pixel)
23
Dimensionality reduction – compression
New image (trained on the previous
one):
(A) original
(B) compression
(C) compression + rounding to 6
bits (1.5 bit per pixel)
(D) compression + rounding to 4
bits (1 bit per pixel)
24
Deep MLP – dimensionality reduction
Hinton, G. E., Osindero, S. and Teh, Y. (2006)
A fast learning algorithm for deep belief nets.
Neural Computation, 18, pp 1527-1554.
Hinton, G. E. and Salakhutdinov, R. R. (2006)
Reducing the dimensionality of data with neural networks.
Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.
This basically started all the deep learning craze ...
25
Deep MLP – dimensionality reduction
26
Images – pretraining
Data: 165 600 black-white images, 25 × 25, mean intensity
0, variance 1.
Images obtained from Olivetti Faces database of images 64 × 64 using
standard transformations.
103 500 training set, 20 700 validation, 41 400 test
Network: 2000-100-500-30, training using layered RBM.
Notes:
Training of the lowest layer (2000 neurons): Values of pixels distorted
using Gaussian noise, low learning rate: 0.001, 200 iterations
Training all hidden layers: Values of neurons are binary.
Training of output layer: Values computed directly using the sigmoid
activation functions + noise. That is, values of output neurons are
from the interval [0, 1].
27
Images – ﬁne-tuning
Stochastic activation substituted with deterministic.
That is the value of hidden neurons is not chosen randomly but directly
computed by application of sigmoid on the inner potential (this gives the
mean activation).
Backpropagation.
Error function: cross-entropy
−
i
pi ln ˆpi −
i
(1 − pi) ln(1 − ˆpi)
here pi is the intensity of i-th pixel of the input and ˆpi of
the reconstruction.
28
Results
1. Original
2. Reconstruction using deep networks (reduction to 30-dim)
3. Reconstruction using PCA (reduction to 30-dim)
29