Restricted Boltzmann machine (RBM) Architecture: Neural network with cycles and symmetric connections, neurons divided into two disjoint sets: V - visible H - hidden Connections: V × S (complete bipartite graph) N is a set of all neurons. Denote by ξj the inner potential and by yj the output (i.e. state) of neuron j. State of the machine: y ∈ {0, 1}|N|. Denote by wji ∈ R the weight of the connection from i to j (and thus also from j to i). Consider bias: wj0 is the weight between j and a neuron 0 whose value y0 is always 1. 1 RBM – activity Activity: States of neurons initially set to values of {0, 1}, i.e. y (0) j ∈ {0, 1} for j ∈ N. In the step t + 1 do the following: t even: randomly choose new values of all hidden neurons, for every j ∈ H P y (t+1) j = 1 = 1  1 + exp  −wj0 − i∈V wjiy (t) i     t odd: randomly choose new values of all visible neurons, for every j ∈ V P y (t+1) j = 1 = 1  1 + exp  −wj0 − i∈H wjiy (t) i     2 Thermal equilibrium Fix a temperature T (i.e. T(t) = T for t = 1, 2, . . .). Theorem For every γ∗ ∈ {0, 1}|N| we have that lim t→∞ P y(t) = γ∗ = 1 Z e−E(γ∗)/T where Z = γ∈{0,1}|N| e−E(γ)/T and E(γ) = − i∈V, j∈H wjiy γ j y γ i − i∈V wi0y γ i − j∈H wj0y γ j Define pN(γ∗) := limt→∞ P y(t) = γ∗ for every γ∗ ∈ {0, 1}|N|. 3 RBM – learning Learning: Let pd be a probability distribution on states of visible neurons, i.e. on {0, 1}|V|. Our goal is to find a configuration of the network W such that pV ≈ pd. A suitable measure of difference between probability distributions pV and pd is relative entropy weighted by probabilities of states (Kullback-Leibler divergence): E(W) = α∈{0,1}|V| pd(α) ln pd(α) pV (α) 4 RBM – learning Minimize E(w) using gradient descent, i.e. compute a sequence of weight matrices: W(0), W(1), . . . initialise W(0) randomly, close to 0 in step t + 1 compute W(t+1) as follows: W (t+1) ji = W (t) ji + ∆W (t) ji where ∆W (t) ji = −ε(t) · ∂E ∂wji (W(t) ) is the update of the weight wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. It remains to compute ∂E ∂wji (W). 5 RBM – learning For sufficiently large t∗ (i.e. in thermal equilibrium) we have ∂E ∂wji ≈ − 1 T y (t∗) j y (t∗) i fixed − y (t∗) j y (t∗) i free y (t∗) j y (t∗) i fixed is the expected value of y (t∗) j y (t∗) i in the thermal equilibrium assuming that values of visible neurons are fixed at the beginning of computation according to pd. y (t∗) j y (t∗) i free is the expected value of y (t∗) j y (t∗) i in the thermal equilibrium (no values fixed). Problem: Computation of y (t∗) j y (t∗) i free takes long time. y (t∗) j y (t∗) i free can be estimated with yjyi recon , the expectation of y (3) j y (3) i when values of visible neurons chosen by pd. 6 RBM – learning Thus ∆w (t) ji = ε(t) · yjyi fixed − yjyi recon Compute yjyi fixed as follows: Let Y := 0 and repeat the following q times: fix values of visible neurons randomly by pd simulate one step of computation, add yjyi to Y For a suitable q we have that Y/q estimates yjyi fixed well. Compute yjyi recon as follows: Let Y := 0 and repeat q times: choose initial values of visible neurons by pd simuate three steps, add yjyi to Y (i.e. compute values of hidden neurons, then of visible ones (reconstruction of the input) and then of hidden neurons) For a suitable q we have that Y/q estimates yjyi recon well. 7 Deep MLP Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 8 Deep MLP Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 9 Deep MLP – activity inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable): yj = σj(ξj) A deep MLP is evaluated layer-wise, for each j ∈ Y we have that yj(w, x) is the value of the output neuron j after evaluating the network with weights w and input x. 10 Deep MLP – learning Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function – mean square error (for example): E(w) = 1 p p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 Other errors such as cross-entropy possible. 11 Convolutional networks – SGD The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · 1 |T| k∈T Ek (w(t) ) Here T is a minibatch (of a fixed size), 0 < ε(t) ≤ 1 is a learning rate in step t + 1 Ek (w(t)) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. Epoch consists of one round through all data. 12 Why deep networks ... if one hidden layer is able to represent an arbitrary (reasonable) function? One hidden layer may be very inefficient, i.e. huge amount of neurons may be needed. One can show that the number of hidden neurons may be exponential w.r.t. the dimension of the input, networks with multiple layers may be exponentially more succinct as opposed to single hidden layer. ... ok, so let’s try to teach deep networks ... using backpropagation? Problems: Gradient may vanish/explode when backpropagated through many layers. Deep networks (with many neurons) overfit very easily. 13 Deep MLP - vanishing gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X holds ∂Ek ∂yj = yj − dkj pro j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj pro j ∈ Z (Y ∪ X) σr (ξr ) · wrj is less than one for standard logistic sigmoid and weights initialized close to 0. 14 Deep MLP – pretraining Assume k layers. Denote Wi the weight matrix between layers i − 1 and i Fi function computed by the "lower" part of the MLP consisting of layers 0, 1, . . . , i F1 is a function which consists of the input and the first hidden layer (which is now considered as the output layer). Crucial observation: For every i, the layers i − 1 and i together with the matrix Wi can be considered as a RBM (assume T = 1). Denote such a RBM as Bi. 15 Deep MLP – pretraining For now, consider only input vectors x1, . . . , xp where xk ∈ {0, 1}n for all k = 1, . . . , p. unsupervised pretraining: Gradually, for every i = 1, . . . , k, train RBM Bi on randomly selected inputs from the training set: Fi−1(x1), . . . , Fi−1(xp) using the training algorithm for RBM (here F0(xi) = xi). (Thus Bi learns from training samples transformed by the already pretrained layers 0, . . . , i − 1) We obtain a deep belief network D representing a distribution given by x1, . . . , xp. (Recall that in such a distribution the probability of a given x is equal to the relative frequency of x in x1, . . . , xp.) 16 Deep belief network The network D can be used to sample from the distribution as follows: Simulate the topmost RBM for some steps (ideally to thermal equilibrium), this gives values of neurons in the two topmost layers. Propagate the values downwards by always simulating one step of the corresponding RBM. That is, you have already computed values of neurons in layers k and k − 1. To compute values of neurons in the layer k − 2, simulate one step of RBM Bk−1, that is sample values of neurons in the layer k − 2 using RBM dynamics of Bk−1 with values of the layer k − 1 fixed. Similarly, compute values of k − 3 by simulating Bk−2 ... etc. ... finally obtain values of input neurons. Probability with which a concrete input x is sampled by the above procedure is the probability of x in the distribution represented by D. 17 Deep MLP – training with pretraining Now consider supervised learning with a training set: T = xk , dk k = 1, . . . , p . Still assume that xk ∈ {0, 1}n . unsupervised pretraining: Gradually, for every i = 1, . . . , k, train RBM Bi on randomly selected inputs from the training set: Fi−1(x1), . . . , Fi−1(xp) using the training algorithm for RBM (here F0(xi) = xi). (Thus Bi learns from training samples transformed by the already pretrained layers 0, . . . , i − 1) Obtain D. Add one (or more) layer to the top of D and consider the result to be MLP. (i.e. forget the RBM dynamics and start considering the network as MLP with sigmoidal activations). supervised fine-tuning: Train in supervised mode (on the training set T ) using e.g. gradient descent + backprop. 18 Application – dimensionality reduction Dimensionality reduction: A mapping R from Rn to Rm where m < n, for every example x we have that x can be "reconstructed" from R(x). Standard method: PCA (there are many linear as well as non-linear variants) 19 Reconstruction – PCA 1024 pixels compressed to 100 dimensions (i.e. 100 numbers). 20 Autoencoders Dimensionality reduction using MLP. Consider MLP n − m − n where m << n. The same vector on the input as well as output. Dimensionality reduction: Encoding: Compute values of hidden neurons. Reconstruction: Compute values of output neurons given values of hidden neurons. Can also be used for compression (in communication). One can show that if linear neurons are used, the method implements PCA. 21 Autoencoder – historical implementation Architecture: MLP 64 − 16 − 64 Activity: activation function: hyperbolic tangens with limits −1 and 1 Learning: Data: Images 256 × 256, 8 bits per pixel. Samples: input and output is a frame 8 × 8, randomly selected in the image. Inputs normalized to [−1, 1]. Learning: Backpropagation Learning rate: 0.01 for hidden, 0.1 pro output trained in 50 000 - 100 000 iterations The goal was to compress images to smaller data size. 22 Dimensionality reduction – compression A frame 8 × 8 passes through the image 256 × 256 (no overlap) (A) original (B) compression (C) compression + rounding to 6 bits (1.5 bit per pixel) (D) compression + rounding to 4 bits (1 bit per pixel) 23 Dimensionality reduction – compression New image (trained on the previous one): (A) original (B) compression (C) compression + rounding to 6 bits (1.5 bit per pixel) (D) compression + rounding to 4 bits (1 bit per pixel) 24 Deep MLP – dimensionality reduction Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554. Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. This basically started all the deep learning craze ... 25 Deep MLP – dimensionality reduction 26 Images – pretraining Data: 165 600 black-white images, 25 × 25, mean intensity 0, variance 1. Images obtained from Olivetti Faces database of images 64 × 64 using standard transformations. 103 500 training set, 20 700 validation, 41 400 test Network: 2000-100-500-30, training using layered RBM. Notes: Training of the lowest layer (2000 neurons): Values of pixels distorted using Gaussian noise, low learning rate: 0.001, 200 iterations Training all hidden layers: Values of neurons are binary. Training of output layer: Values computed directly using the sigmoid activation functions + noise. That is, values of output neurons are from the interval [0, 1]. 27 Images – fine-tuning Stochastic activation substituted with deterministic. That is the value of hidden neurons is not chosen randomly but directly computed by application of sigmoid on the inner potential (this gives the mean activation). Backpropagation. Error function: cross-entropy − i pi ln ˆpi − i (1 − pi) ln(1 − ˆpi) here pi is the intensity of i-th pixel of the input and ˆpi of the reconstruction. 28 Results 1. Original 2. Reconstruction using deep networks (reduction to 30-dim) 3. Reconstruction using PCA (reduction to 30-dim) 29