(Primitive) Mathematical Model of Neuron σ ξ x1 x2 xn y 1 Formal neuron σ ξ x1 x2 xn x0 = 1 y w1 w2 · · · wn w0 x1, . . . , xn real inputs x0 special input, always 1 w0, w1, . . . , wn real weights ξ = w0 + n i=1 wi xi inner potential; In general, other potentials are considered (e.g. Gaussian), more on this in PV021. y output defined by y = σ(ξ) where σ is an activation function. We consider several activation functions. e.g. linear threshold function σ(ξ) = sgn(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 2 Formal Neuron vs Linear Models Both linear classifier and linear (affine) function are special cases of the formal neuron. If σ is a linear threshold function σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. we obtain a linear classifier. If σ is identity, i.e. σ(ξ) = ξ, we obtain a linear (affine) function. Many more activation functions are used in neural networks! 3 Sigmoid Functions Logistic sigmoid σ(ξ) = 1 1 + e−ξ Others ... 4 Multilayer Perceptron (MLP) Input Hidden Output · · · · · · Neurons are organized in layers (input layer, output layer, possibly several hidden layers) Layers are numbered from 0; the input is 0-th Neurons in the -th layer are connected with all neurons in the + 1-th layer Intuition: The network computes a function as follows: Assign input values to the input neurons and 0 to the rest. Proceed upwards through the layers, one layer per step. In the -th step consider output values of neurons in − 1-th layer as inputs to neurons of the -th layer. Compute output values of neurons in the -th layer. 5 Example 1010 1001 σ 01000110111 σ0001011011 1 σ 0101 1 −22 2 −2 1 −1 1 3 −2 Activation function: linear threshold σ(ξ) = 1 ξ ≥ 0 ; 0 ξ < 0. 6 Classical Example – ALVINN One of the first autonomous car driving systems (in 90s) ALVINN drives a car The net has 30 × 32 = 960 input neurons (the input space is R960 ). The value of each input captures the shade of gray of the corresponding pixel. Output neurons indicate where to turn (to the center of gravity). Source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 7 A Bit of History Perceptron (Rosenblatt et al, 1957) Single layer (i.e. no hidden layers), the activation function linear threshold (i.e., a bit more general linear classifier) Perceptron learning algorithm Used to recognize numbers Adaline (Widrow & Hof, 1960) Single layer, the activation function identity (i.e., a bit more linear function) Online version of the gradient descent Used a new circuitry element called memristor which was able to "remember"history of current in form of resistance In both cases, the expressive power is rather limited – can express only linear decision boundaries and linear (affine) functions. 8 A Bit of History One of the famous (counter)-examples: XOR 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) x1 x2 No perceptron can distinguish between ones and zeros. 9 XOR vs Multilayer Perceptron 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) P1 P2 x1 x2 σ1 σ 1 σ1 −22 2 −2 1 −1 1 3 −2 (Here σ is a linear threshold function.) P1 : −1 + 2x1 + 2x2 = 0 P2 : 3 − 2x1 − 2x2 = 0 The output neuron performs an intersection of half-spaces. 10 Expressive Power of MLP Cybenko’s theorem: Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to approximate with arbitrarily small error any "reasonable" boundary a given input is classified as 1 iff the output value of the network is ≥ 1/2. approximate with arbitrarily small error any "reasonable" function with range (0, 1). Here "reasonable" means that it is pretty tough to find a function that is not reasonable. So multi-layer perceptrons are suffuciently powerful for any application. But for a long time, at least throughout 60s and 70s, nobody well-known knew any efficient method for training multilayer networks! ... then the backpropagation appeared in 1986! 11 MLP – Notation X set of input neurons Y set of output neurons Z set of all neurons (tedy X, Y ⊆ Z) individual neurons are denoted by indices, e.g. i, j. ξj is the inner potential of the neuron j when the computation is finished. yj is the output value of the neuron j when the computation is finished. (we formally assume y0 = 1) wji is the weight of the arc from the neuron i to the neuron j. j← is the set of all neurons from which there are edges to j (i.e. j← is the layer directly below j) j→ is the set of all neurons to which there are edges from j. (i.e. j→ is the layer directly above j) 12 MLP – Notation Inner potential of a neuron j: ξj = i∈j← wji yi for simplicity, the activation function of every neuron will be the logistic sigmoid σ(ξ) = 1 1+e−ξ . (We may of course consider logistic sigmoids with different steepness paramaters, or other sigmoidal functions, more in PV021.) A value of a non-input neuron j ∈ Z \ X when the computation is finished is yj = σ(ξj ) (yj is determined by weights w and a given input x, so it’s sometimes written as yj [w](x) ) Fixing weights of all neurons, the network computes a function F[w] : R|X| → R|Y | as follows: Assign values of a given vector x ∈ R|X| to the input neurons, evaluate the network, then F[w](x) is the vector of values of the output neurons. Here we implicitly assume a fixed orderings on input and output vectors. 13 MLP – Learning Given a set D of training examples: D = xk, dk k = 1, . . . , p Here xk ∈ R|X| and dk ∈ R|Y |. We write dkj to denote the value in dk corresponding to the output neuron j. Least Squares Error Function: Let w be a vector of all weights in the network. E(w) = p k=1 Ek(w) where Ek(w) = 1 2 j∈Y (yj [w](xk) − dkj )2 14 MLP – Learning Algorithm Batch Learning – Gradient Descent: The algorithm computes a sequence of weights w(0), w(1), . . .. weights w(0) are initialized randomly close to 0 in the step t + 1 (here t = 0, 1, 2 . . .) is w(t+1) computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is the weight change wji and 0 < ε(t) ≤ 1 is the learning speed in the step t + 1. Note that ∂E ∂wji (w(t) ) is a component of E, i.e. the weight change in the step t + 1 can be written as follows: w(t+1) = w(t) − ε(t) · E(w(t) ). 15 MLP – Gradient Computation For every weight wji we have (obviously) ∂E ∂wji = p k=1 ∂Ek ∂wji So now it suffices to compute ∂Ek ∂wji , that is the error for a fixed training example (xk , dk ). It holds that ∂Ek ∂wji = δj · yj (1 − yj ) · yi where δj = yj − dkj pro j ∈ Y δj = r∈j→ δr · yr (1 − yr ) · wrj pro j ∈ Z (Y ∪ X) (Here yr = y[w](xk ) where w are the current weights and xk is the input of the k-th training example.) 16 Multilayer Perceptron – Backpropagation So to compute all ∂E ∂wji = p k=1 ∂Ek ∂wji : Compute all ∂Ek ∂wji = δj · yj (1 − yj ) · yi for every training example (xk , dk ): Evaluate all values yi of neurons using the standard bottom-up procedure with the input xk . Compute δj using backpropagation through layers top-down : Assign δj = yj − dkj for all j ∈ Y In the layer , assuming that δr has been computed for all neurons r in the layer + 1, compute δj = r∈j→ δr · yr (1 − yr ) · wrj for all j from the -th layer. Compute ∂Ek ∂wji = δj · yj (1 − yj ) · yi for every weight wji . 17 Example Assume w (0) 30 = w (0) 50 = w (0) 41 = w (0) 42 = w (0) 54 = 1 and w (0) 40 = w (0) 31 = w (0) 32 = w (0) 53 = −1. Consider a training set {((1, 0), 1)}. Then y1 = 1, y2 = 0, y3 = σ(w30 + w (0) 31 · y1 + w (0) 32 · y2) = 0.5, y4 = 0.5, y5 = 0.7310586. δ5 = y5 − 1 = −0.268941, δ4 = d5 · y5 · (1 − y5) · w (0) 54 = −0.052877, δ3 = 0.052877. y1 y2 σ y31 σy4 1 σ y51 w41 w31 w32 w42 w53 w30 w54 w40 w50 ∂E1 ∂w53 = δ5 · y5 · (1 − y5) · y3 = −0.026438, ∂E1 ∂w54 = δ5 · y5 · (1 − y5) · y4 = −0.026438, ∂E1 ∂w31 = δ3 · y3 · (1 − y3) · y1 = 0.013219, .... 18 Illustration of Gradient Descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 19 Comments on Training Algorithm Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. In practice, does converge to low error for many large networks on real data. Many epochs (thousands) may be required, hours or days of training for large networks. To avoid local-minima problems, run several trials starting with different random weights (random restarts). Take results of trial with lowest training set error. Build a committee of results from multiple trials (possibly weighting votes by training set accuracy). There are many more issues concerning learning efficiency (data normalization, selection of activation functions, weight initialization, training speed, efficiency of the gradient descent itself etc.) – see PV021. 20 Hidden Neurons Representations Trained hidden neurons can be seen as newly constructed features. E.g., in a two layer network used for classification, the hidden layer transforms the input so that important features become explicit (and hence the result may become linearly separable). Consider a two-layer MLP, 64-2-3 for classification of letters (three output neurons, each corresponds to one of the letters). 21 Over-Training Due to their expressive power, neural networks are quite sensitive to over-training. Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase the validation error. 22 Over-Fitting – The Number of Hidden Neurons Too few hidden neurons prevent the network from adequately fitting the data. Too many hidden units can result in over-fitting. (There are advanced methods that prevent over-fitting even for rich models, such as regularization, where the error function penalizes over-fitting – see PV021.) Use cross-validation to empirically determine an optimal number of hidden units. There are methods that automatically construct the structure of the network based on data, they are not much used though. 23 Applications Text to Speech Fraud detection finance & business predictions HNC software company acquired by Fair Isaac for $3/4 billion Game playing (backgammon is a classical example) Handwriting recognition and image recognition in general. This is the main area in which the current state-of-the-art deep networks excel. (artificial brain and intelligence) ... 24 ALVINN 25 ALVINN Two layer MLP, 960 − 4 − 30 (sometimes 960 − 5 − 30) Inputs correspond to pixels. Sigmoidal activation function (logistic sigmoid). Direction corresponds to the center of gravity. I.e., output neurons are considered as points of mass evenly distributed along a line. Weight of each neuron corresponds to its value. 26 ALVINN – Training Trained while driving. A camera captured the road from the front window, approx. 25 pictures per second Training examples (xk, dk) where xk = image of the road dk ≈ corresponding direction of the steering wheel set by the driver the values dk computed using Gaussian distribution: dki = e−D2 i /10 where Di is the distance between the i-th output from the one that corresponds to the real direction of the steering wheel. (This is better than the binary output because similar road directions induce similar reaction of the driver.) 27 Selection of Training Examples Naive approach: just take images from the camera. Problems: A too good driver never teaches the network how to solve deviations from the right track. Couple of harsh solutions: turn the learning off for a moment, deviate from the right track, then turn on the learning and let the network learn how to solve the situation. let the driver go crazy! (a bit dangerous, expensive, unreliable) Images are very similar (the network basically sees the road from the right lane), may be overtrained. 28 Selection of Training Examples Problem with too good driver were solved as follows: every image of the road has been has been transformed to 15 slightly different copies Repetitiveness of images was solved as follows: the system has a buffer of 200 images (including the 15 copies of the current one), in every round trains on these images afterwards, a new image is captured, 15 copies made, and these new 15 substitute 15 selected from the buffer (10 with the smallest training error, 5 randomly) 29 ALVINN – Training standard backpropagation constant speed of learning (possibly different for each neuron – see PV021) some other optimizations (see PV021) Výsledek: Training took 5 minutes, the speed was 4 miles per hour (The speed was limited by the hydraulic controller of the steering wheel not the learning algorithm.) ALVINN was able to go through roads it never "seen" and in different weather 30 ALVINN – Weight Learning round 0 round 10 round 20 round 50 h1 h2 h3 h4 h5 Here h1, . . . , h5 are values of hidden neurons. 31 Extensions and Directions (PV021) Other types of learning inspired by neuroscience – Hebbian learning More biologically plausible models of neural networks – spiking neurons This goes into the direction of HUGE area of (computational) neuroscience, only very lightly touched in PV021. Unsupervised learning – Self-Organizing Maps Reinforcement learning learning to make decisions, or play games, sequentially neural networks have been used – temporal difference learning 32 Deep Learning Cybenko’s theorem shows that two-layer networks are omnipotent – such results nearly killed NN when support vector machines were found to be easier to train in 00’s. Later, it has been shown (experimentally) that deep networks (with many layers) have better represenational properties. ... but how to train them? The backpropagation suffers from so-called vanishing gradient, intuitively, updates of weights in lower layers are very slow. In 2006 a solution was found by Hinton et al: Use unsupervised methods to initialize the weights so that they capture important features in data. More precisely: The lowest hidden layer learns patterns in data, second lowest learns patterns in data transformed through the first layer, and so on. Then use a supervised learning algorithm to only fine tune the weights to the desired input-output behavior. A rather heavy machinery is needed to develop this, but you will be rewarded by insight into a very modern and expensive technology. 33 Deeper Insight into the Logistic Sigmoid Consider a perceptron (that is a linear classifier): ξ = w0 + n i=1 wi · xi and y = sgn(ξ) = 1 ξ ≥ 0 0 ξ < 0 . Recall, that the signed distance from the decision boundary determined by ξ = 0 is (here x = (x1, . . . , xn) and w = (w1, . . . , wn)) w0 + n i=1 wi · xi n i=1 w2 i = ξ n i=1 w2 i This value is positive for x on the side of w and negative on the opposite. For simplicity, assume that n i=1 w2 i = 1, and thus that the potential ξ is equal to the signed distance of x from the boundary. 34 Deeper Insight into the Logistic Sigmoid Assume that training examples (x, c(x)) are randomly generated. Denote: ξ1 mean signed distance from the boundary of points classified 1. ξ0 mean signed distance from the boundary of points classified 0. It is not unreasonable to assume that conditioned on c = 1, the signed distance ξ is normally distributed with the mean ξ1 and variance (for simplicity) 1, conditioned on c = 0, the signed distance ξ is normally distributed with the mean ξ0 and variance (for simplicity) 1. (Notice that ξ may be negative, which means that such point is on the wrong side of the boundary (the same for ξ > 0).) Now, can we decide what is the probability of c = 1 given a distance? 35 Deeper Insight into the Logistic Sigmoid For simplicity, assume that ξ1 = −ξ0 = 1/2. P(1 | ξ) = p(ξ | 1)P(1) p(ξ | 1)P(1) + p(ξ | 0)P(0) = LR LR + 1/clr where LR = p(ξ | 1) p(ξ | 0) = exp(−(ξ − 1/2)2 /2) exp(−(ξ + 1/2)2/2) = exp(ξ) and clr = P(1) P(0) which we assume (for simplicity) = 1 So P(1 | ξ) = exp(ξ) exp(ξ) + 1 = 1 1 + e−ξ Thus the logistic sigmoid applied to ξ = w0 + n i=1 wi · xi gives the probability of c = 1 given the input! 36 Deeper Insight into the Logistic Sigmoid So if we use the logistic sigmoid as an activation function, and turn the neuron into a classifier as follows: classify a given input x as 1 iff y ≥ 1/2 Then the neuron basically works as the Bayes classifier! This is the basis of logistic regression. Given training data, we may compute the weights w that maximize the likelihood of the training data (w.r.t. the probabilities returned by the neuron). An extremely interesting observation is that such w maximizing the likelihood coincides with the minimum of least squares for the corresponding linear function (that is the same neuron but with identity as the activation function). 37