(Primitive) Mathematical Model of Neuron Formal neuron 2 Formal neuron ► xi,..., xn real inputs y ► xq special input, always 1 2 Formal neuron y ► xi,..., xn real inputs ► xo special input, always 1 ► i/i/q, 1/1/1,..., wn real weights 2 ► xi,..., xn real inputs ► xo special input, always 1 ► wo, wi,..., wn real weights ► £ = 1/1/0 + J^/Li w/x; /Viner potential] In general, other potentials are considered (e.g. Gaussian), more on this in PV021. wn 2 Formal neuron x0 = 1 y ► xi,..., xn real inputs ► xo special input, always 1 ► 1/1/0, i/i/i,..., wn real weights ► £ = wo + S/Li w/x; //i/ier potential; In general, other potentials are considered (e.g. Gaussian), more on this in PV021. ► y output defined by y = cr(£) where a is an activation function. We consider several activation functions. e.g. //near threshold function *(0 = «"(0 = 1 £>0; 0 £<0. 2 Formal Neuron vs Linear Models Both linear classifier and linear (affine) function are special cases of the formal neuron. ► If a is a linear threshold function 0; 0 £<0. we obtain a linear classifier. ► If a is identity, i.e. = £, we obtain a linear (affine) function. 3 Formal Neuron vs Linear Models Both linear classifier and linear (affine) function are special cases the formal neuron. ► If a is a linear threshold function *(0 = 1 £>0; 0 £<0. we obtain a linear classifier. ► If a is identity, i.e. = we obtain a linear (affine) function. Many more activation functions are used in neural networks! Sigmoid Functions Multilayer Perceptron (MLP) Output Hidden Input ► Neurons are organized in layers (input layer, output layer, possibly several hidden layers) ► Layers are numbered from 0; the input is 0-th ► Neurons in the £-th layer are connected with all neurons in the £ + 1-th layer 5 Multilayer Perceptron (MLP) Output Hidden Input ► Neurons are organized in layers (input layer, output layer, possibly several hidden layers) ► Layers are numbered from 0; the input is 0-th ► Neurons in the £-th layer are connected with all neurons in the £ + 1-th layer Intuition: The network computes a function as follows: Assign input values to the input neurons and 0 to the rest. Proceed upwards through the layers, one layer per step. In the £-th step consider output values of neurons in £ — 1-th layer as inputs to neurons of the £-th layer. Compute output values of neurons in the £-th layer. Example 6 Example Example 6 Example Example 6 Example Example Example 6 Example Example Classical Example - ALVINN Sharp Lett Straight Ahead Sharp Right 4 Hidden Units 30 Output Units 30x32 Sensor Input Retina ► One of the first autonomous car driving systems (in 90s) ► ALVINN drives a car Classical Example - ALVINN Sharp Left Straight Ahead Sharp Right 4 Hidden Units 30 Output Units 30x32 Sensor Input Retina ► One of the first autonomous car driving systems (in 90s) ► ALVINN drives a car ► The net has 30 x 32 = 960 input neurons (the input space is IR960). Classical Example - ALVINN Sharp Left Straight Ahead Sharp Right 4 Hidden Units 30 Output Units 30x32 Sensor Input Retina ► One of the first autonomous car driving systems (in 90s) ► ALVINN drives a car ► The net has 30 x 32 = 960 input neurons (the input space is IR960). ► The value of each input captures the shade of gray of the corresponding pixel. Classical Example - ALVINN Sharp Left Straight Ahead Sharp Right 4 Hidden Units 30 Output Units 30x32 Sensor Input Retina ► One of the first autonomous car driving systems (in 90s) ► ALVINN drives a car ► The net has 30 x 32 = 960 input neurons (the input space is IR960). ► The value of each input captures the shade of gray of the corresponding pixel. ► Output neurons indicate where to turn (to the center of gravity). Source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 7 A Bit of History ► Perceptron (Rosenblatt et al, 1957) ► Single layer (i.e. no hidden layers), the activation function linear threshold (i.e., a bit more general linear classifier) ► Perceptron learning algorithm ► Used to recognize numbers 8 A Bit of History ► Perceptron (Rosenblatt et al, 1957) Hi-' * i: .....mm IK HM ► Single layer (i.e. no hidden layers), the activation function linear threshold (i.e., a bit more general linear classifier) ► Perceptron learning algorithm ► Used to recognize numbers ► Adaline (Widrow & Hof, 1960) ► Single layer, the activation function identity (i.e., a bit more linear function) ► Online version of the gradient descent ► Used a new circuitry element called memristor which was able to ,,remember,,history of current in form of resistance In both cases, the expressive power is rather limited - can express only linear decision boundaries and linear (affine) functions. 8 A Bit of History One of the famous (counter)-examples: XOR Xl (0,1) (1,1) 0—© (0,0) © (0,1) *2 No perceptron can distinguish between ones and zeros. 9 XOR vs Multilayer Perceptron Pi Pi (Here a is a linear threshold function.) Pi : -1 + 2xi + 2x2 = 0 P2 : 3 - 2xi - The output neuron performs an intersection of half-spaces. Expressive Power of MLP Cybenko's theorem: ► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to li Expressive Power of MLP Cybenko's theorem: ► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to ► approximate with arbitrarily small error any "reasonable" boundary a given input is classified as 1 iff the output value of the network is > 1/2. li Expressive Power of MLP Cybenko's theorem: ► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to ► approximate with arbitrarily small error any "reasonable" boundary a given input is classified as 1 iff the output value of the network is > 1/2. ► approximate with arbitrarily small error any "reasonable" function with range (0,1). Here "reasonable" means that it is pretty tough to find a function that is not reasonable. li Expressive Power of MLP Cybenko's theorem: ► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to ► approximate with arbitrarily small error any "reasonable" boundary a given input is classified as 1 iff the output value of the network is > 1/2. ► approximate with arbitrarily small error any "reasonable" function with range (0,1). Here "reasonable" means that it is pretty tough to find a function that is not reasonable. So multi-layer perceptrons are suffuciently powerful for any application. li Expressive Power of MLP Cybenko's theorem: ► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to ► approximate with arbitrarily small error any "reasonable" boundary a given input is classified as 1 iff the output value of the network is > 1/2. ► approximate with arbitrarily small error any "reasonable" function with range (0,1). Here "reasonable" means that it is pretty tough to find a function that is not reasonable. So multi-layer perceptrons are suffuciently powerful for any application. But for a long time, at least throughout 60s and 70s, nobody well-known knew any efficient method for training multilayer networks! li Expressive Power of MLP Cybenko's theorem: ► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to ► approximate with arbitrarily small error any "reasonable" boundary a given input is classified as 1 iff the output value of the network is > 1/2. ► approximate with arbitrarily small error any "reasonable" function with range (0,1). Here "reasonable" means that it is pretty tough to find a function that is not reasonable. So multi-layer perceptrons are suffuciently powerful for any application. But for a long time, at least throughout 60s and 70s, nobody well-known knew any efficient method for training multilayer networks! ... then the backpropagation appeared in 1986! li MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) 12 MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) ► individual neurons are denoted by indices, e.g. ij. 12 MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) ► individual neurons are denoted by indices, e.g. ij. ► £j is the inner potential of the neuron j when the computation is finished. 12 MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) ► individual neurons are denoted by indices, e.g. ij. ► £j is the inner potential of the neuron j when the computation is finished. ► yj is the output value of the neuron j when the computation is finished. (we formally assume yo = 1) 12 MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) ► individual neurons are denoted by indices, e.g. ij. ► £j is the inner potential of the neuron j when the computation is finished. ► yj is the output value of the neuron j when the computation is finished. (we formally assume yo = 1) ► wjj is the weight of the arc from the neuron / to the neuron j. 12 MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) ► individual neurons are denoted by indices, e.g. ij. ► £j is the inner potential of the neuron j when the computation is finished. ► yj is the output value of the neuron j when the computation is finished. (we formally assume yo = 1) ► wjj is the weight of the arc from the neuron / to the neuron j. ► y<_ is the set of all neurons from which there are edges to j (i.e. jV- is the layer directly below j) 12 MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) ► individual neurons are denoted by indices, e.g. ij. ► £j is the inner potential of the neuron j when the computation is finished. ► yj is the output value of the neuron j when the computation is finished. (we formally assume y0 = 1) ► wjj is the weight of the arc from the neuron / to the neuron j. ► y<_ is the set of all neurons from which there are edges to j (i.e. jV- is the layer directly below j) ► is the set of all neurons to which there are edges from j. (i.e. is the layer directly above j) 12 MLP - Notation ► Inner potential of a neuron j 13 MLP - Notation ► Inner potential of a neuron j\ ► for simplicity, the activation function of every neuron will be the logistic sigmoid cr(£) = . (We may of course consider logistic sigmoids with different steepness paramaters, or other sigmoidal functions, more in PV021.) 13 MLP - Notation ► Inner potential of a neuron j: o = Yl wjw ► for simplicity, the activation function of every neuron will be the logistic sigmoid cr(£) = . (We may of course consider logistic sigmoids with different steepness paramaters, or other sigmoidal functions, more in PV021.) ► A value of a non-input neuron j E Z \ X when the computation is finished is yj = cr(£/) (yj is determined by weights w and a given input x, so it's sometimes written as yy[w](xO) 13 MLP - Notation ► Inner potential of a neuron j\ ► for simplicity, the activation function of every neuron will be the logistic sigmoid cr(£) = . (We may of course consider logistic sigmoids with different steepness paramaters, or other sigmoidal functions, more in PV021.) ► A value of a non-input neuron j g Z \ X when the computation is finished is yj = cr(£/) (y7 is determined by weights w and a given input x, so it's sometimes written as yj[w](x)) ► Fixing weights of all neurons, the network computes a function F[w] : RIxI —)► Rlyl as follows: Assign values of a given vector x g Mlxl to the input neurons, evaluate the network, then /~[vv](x) is the vector of values of the output neurons. Here we implicitly assume a fixed orderings on input and output vectors. 13 MLP — Learning ► Given a set D of training examples: D=((xk,dkj | k = l,...,p Here xk G K'xl and dk € Rly'. We write c/^- to denote the —> value in corresponding to the output neuron / 14 MLP — Learning ► Given a set D of training examples: D = | (xk, dkj | k = 1,... ,p Here e and cf^ e We write cf^- to denote the value in dk corresponding to the output neuron j. ► Least Squares Error Function: Let w be a vector of a weights in the network. E{w) = j^Ek{w) k=l where E*(^) = oE (X/HWO - dkj)' 2 14 LP — Learning Algorithm Batch Learning - Gradient Descent: The algorithm computes a sequence of weights w^°\ ► weights are initialized randomly close to 0 MLP — Learning Algorithm Batch Learning - Gradient Descent: The algorithm computes a sequence of weights w^°\ .... ► weights are initialized randomly close to 0 ► in the step t + 1 (here £ = 0,1, 2 ...) is i/i/(t+1) computed as follows: (t+l) (t) . A (t) Wjj = wjj + Awji } 15 MLP — Learning Algorithm Batch Learning - Gradient Descent: The algorithm computes a sequence of weights w^°\ .... ► weights are initialized randomly close to 0 ► in the step t + 1 (here £ = 0,1, 2 ...) is i/i/(t+1) computed follows: (t+l) (t) . A (t) Wjj = wjj + Awji } where Awjp = -e(t) ■ J O Wji is the weight change wjj and 0 < s(t) < 1 is the learning speed in the step t + l. MLP — Learning Algorithm Batch Learning - Gradient Descent: The algorithm computes a sequence of weights w^°\ .... ► weights are initialized randomly close to 0 ► in the step t + 1 (here £ = 0,1, 2 ...) is i/i/(t+1) computed as follows: (t+l) (t) . A (t) Wjj = wjj + Awji } where Awjp = -e(t) ■ J O Wji is the weight change wjj and 0 < s(t) < 1 is the learning speed in the step t + l. Note that -§^:{w^) is a component of VE, i.e. the weight change in the step t + 1 can be written as follows: w(t+1) = w(t) - s(t) • VE(w(t)). 15 MLP — Gradient Computation For every weight wy, we have (obviously) dE ^ dEk dwji 9wy So now it suffices to compute that is the error for a fixed training example (x^, dk). 16 MLP — Gradient Computation For every weight wy, we have (obviously) dE ^ dEk dwji 9wy So now it suffices to compute that is the error for a fixed training example (x^, dk). It holds that dEk dwji Sj-yj(i-yj)-yi where 5j = yj - dkj pro j t Y Sj = ^2 ()r ■ yr(l - yr) • wrj pro j e Z \ (Y U X) (Here yr = y[w](x/c) where w are the current weights and Xk is the input of the /c-th training example.) 16 Multilayer Perceptron — Backpropagation So to compute all g = £Li f§ 17 ultilayer Perceptron — Backpropagation So to compute all = £Li dE> k dwji Z-^k—\ dwji Compute all = Sj • y/(l — yj) • y\ for every training example (x^ Multilayer Perceptron — Backpropagation dE v^P dEk So to compute all 4^ = Compute all = Sj • y/(l — yj) • y\ for every training example (x/c, ► Evaluate all values y\ of neurons using the standard bottom-up procedure with the input x/<. ultilayer Perceptron — Backpropagation So to compute all = £Li dE> k dwji Z-^k=l dwji Compute all 5% = Sj • y/(l — yj) • y; for every training example (x^, ■ji ► Evaluate all values y; of neurons using the standard bottom-up procedure with the input x^. ► Compute Sj using backpropagation through layers top-down : ► Assign Sj = yj — d^j for all j g Y ultilayer Perceptron — Backpropagation So to compute all dE Compute all Sj • y/(l — yj) • y\ for every training example (xk, dk ► Evaluate all values y\ of neurons using the standard bottom-up procedure with the input x*k. ► Compute Sj using backpropagation through layers top-down : ► Assign Sj = yj — dkj for all j g Y ► In the layer £, assuming that Sr has been computed for all neurons r in the layer £ + 1, compute for all j from the ^-th layer. Example w, (0) 40 (0) W30 = (0) (0) (0) W31 = (0) W32 = (0) W53 = = 1/1/4^ = w!^) = 1 and 54 — 1. Consider a training set {((1,0), 1)} Then yi = i. Y2 = o, y3 = a(i/i/30 + + W32V2) = 0-5, y4 = 0.5, y5 = 0.731058. ,(0).,N _ S3 y5 - 1 = -0.268942, 0.052877. y5)*i^5(0) ^50 '54 = -0.052877, dEi dwS3 dE1 dEi dw^o 55 ■ y5 ■ (1 - y5) • y3 = -0.026438, S5 ■ ys ■ (1 - ys) • y4 = -0.026438, 53 • y3 • (1 - y3) • 1 = 0.01321925, lustration of Gradient Descent — XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 19 Comments on Training Algorithm ► Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. 20 Comments on Training Algorithm ► Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. ► In practice, does converge to low error for many large networks on real data. 20 Comments on Training Algorithm ► Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely ► In practice, does converge to low error for many large networks on real data. ► Many epochs (thousands) may be required, hours or days of training for large networks. 20 Comments on Training Algorithm ► Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely ► In practice, does converge to low error for many large networks on real data. ► Many epochs (thousands) may be required, hours or days of training for large networks. ► To avoid local-minima problems, run several trials starting with different random weights (random restarts). ► Take results of trial with lowest training set error. ► Build a committee of results from multiple trials (possibly weighting votes by training set accuracy). There are many more issues concerning learning efficiency (data normalization, selection of activation functions, weight initialization, training speed, efficiency of the gradient descent itself etc.) - see PV021. 20 Hidden Neurons Representations Trained hidden neurons can be seen as newly constructed features. E.g., in a two layer network used for classification, the hidden layer transforms the input so that important features become explicit (and hence the result may become linearly separable). 21 idden Neurons Representations Trained hidden neurons can be seen as newly constructed features. E.g., in a two layer network used for classification, the hidden layer transforms the input so that important features become explicit (and hence the result may become linearly separable). Consider a two-layer MLP, 64-2-3 for classification of letters (three output neurons, each corresponds to one of the letters). sample training patterns - ■ - learned input-to-hidden weights Overfitting ► Due to their expressive power, neural networks are quite sensitive to overfitting. on test data on training data # training epochs 22 Overfitting ► Due to their expressive power, neural networks are quite sensitive to overfitting. on test data on training data # training epochs ► Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase the validation error. 22 Overfitting — The Number of Hidden Neurons ► Too few hidden neurons prevent the network from adequately fitting the data. ► Too many hidden units can result in overfitting. (There are advanced methods that prevent overfitting even for rich models, such as regularization, where the error function penalizes overfitting - see PV021.) on test data on training data # hidden units 23 Overfitting — The Number of Hidden Neurons ► Too few hidden neurons prevent the network from adequately fitting the data. ► Too many hidden units can result in overfitting. (There are advanced methods that prevent overfitting even for rich models, such as regularization, where the error function penalizes overfitting - see PV021.) on test data on training data # hidden units ► Use cross-validation to empirically determine an optimal number of hidden units. There are methods that automatically construct the structure of the network based on data, they are not much used though. 23 Applications ► Text to Speech and vice versa ► Fraud detection ► finance & business predictions ► Game playing (backgammon is a classical example, AlphaGo is the modern one) ► Image recognition This is the main area in which the current state-of-the-art deep networks excel. ► (artificial brain and intelligence) ► ... 24 ALVINN 25 ALVINN ► Two layer MLP, 960 - 4 - 30 (sometimes 960 - 5 - 30) 26 ALVINN ► Two layer MLP, 960 - 4 - 30 (sometimes 960 - 5 - 30) ► Inputs correspond to pixels. ► Sigmoidal activation function (logistic sigmoid). 26 ALVINN ► Two layer MLP, 960 - 4 - 30 (sometimes 960 - 5 - 30) ► Inputs correspond to pixels. ► Sigmoidal activation function (logistic sigmoid). ► Direction corresponds to the center of gravity. I.e., output neurons are considered as points of mass evenly distributed along a line. Weight of each neuron corresponds to its value. 26 ALVINN - Training Trained while driving. 27 ALVINN - Training Trained while driving. ► A camera captured the road from the front window, approx. 25 pictures per second 27 ALVINN - Training Trained while driving. ► A camera captured the road from the front window, approx. pictures per second ► Training examples (x/c, d^) where ALVINN - Training Trained while driving. ► A camera captured the road from the front window, approx. 25 pictures per second ► Training examples (x/c, d^) where ► Xk = image of the road 27 ALVINN - Training Trained while driving. ► A camera captured the road from the front window, approx. 25 pictures per second ► Training examples (x/c, dk) where ► Xk = image of the road ► dk ~ corresponding direction of the steering wheel set by the driver 27 ALVINN - Training Trained while driving. ► A camera captured the road from the front window, approx. 25 pictures per second ► Training examples (x^, d^) where ► x/c = image of the road ► dk ~ corresponding direction of the steering wheel set by the driver ► the values d^ computed using Gaussian distribution: where D\ is the distance between the /-th output from the one that corresponds to the real direction of the steering wheel. (This is better than the binary output because similar road directions induce similar reaction of the driver.) 27 Selection of Training Examples Naive approach: just take images from the camera. 28 Selection of Training Examples Naive approach: just take images from the camera. Problems: 28 Selection of Training Examples Naive approach: just take images from the camera. Problems: ► A too good driver never teaches the network how to solve deviations from the right track. Couple of harsh solutions: 28 Selection of Training Examples Naive approach: just take images from the camera. Problems: ► A too good driver never teaches the network how to solve deviations from the right track. Couple of harsh solutions: ► turn the learning off for a moment, deviate from the right track, then turn on the learning and let the network learn to solve the situation. Selection of Training Examples Naive approach: just take images from the camera. Problems: ► A too good driver never teaches the network how to solve deviations from the right track. Couple of harsh solutions: ► turn the learning off for a moment, deviate from the right track, then turn on the learning and let the network learn how to solve the situation. ► let the driver go crazy! (a bit dangerous, expensive, unreliable) 28 Selection of Training Examples Naive approach: just take images from the camera. Problems: ► A too good driver never teaches the network how to solve deviations from the right track. Couple of harsh solutions: ► turn the learning off for a moment, deviate from the right track, then turn on the learning and let the network learn how to solve the situation. ► let the driver go crazy! (a bit dangerous, expensive, unreliable) ► Images are very similar (the network basically sees the road from the right lane), may be overtrained. 28 Selection of Training Examples Problem with too good driver were solved as follows: 29 Selection of Training Examples Problem with too good driver were solved as follows: ► every image of the road has been has been transformed to 15 slightly different copies 29 Selection of Training Examples Problem with too good driver were solved as follows: ► every image of the road has been has been transformed to 15 slightly different copies 29 Selection of Training Examples Problem with too good driver were solved as follows: ► every image of the road has been has been transformed to 15 slightly different copies Original Image im Repetitiveness of images was solved as follows: 29 Selection of Training Examples Problem with too good driver were solved as follows: ► every image of the road has been has been transformed to 15 slightly different copies Original Image 7r/inwrw • •« Repetitiveness of images was solved as follows: ► the system has a buffer of 200 images (including the 15 copies of the current one), in every round trains on these images 29 Selection of Training Examples Problem with too good driver were solved as follows: ► every image of the road has been has been transformed to 15 slightly different copies Original Image • • • Repetitiveness of images was solved as follows: ► the system has a buffer of 200 images (including the 15 copies of the current one), in every round trains on these images ► afterwards, a new image is captured, 15 copies made, and these new 15 substitute 15 selected from the buffer (10 with the smallest training error, 5 randomly) 29 ALVINN - Training ► standard backpropagation 30 ALVINN - Training ► standard backpropagation ► constant speed of learning (possibly different for each neuron -see PV021) 30 ALVINN - Training ► standard backpropagation ► constant speed of learning (possibly different for each neuron -see PV021) ► some other optimizations (see PV021) 30 ALVINN - Training ► standard backpropagation ► constant speed of learning (possibly different for each neuron -see PV021) ► some other optimizations (see PV021) Výsledek: ► Training took 5 minutes, the speed was 4 miles per hour (The speed was limited by the hydraulic controller of the steering wheel not the learning algorithm.) 30 ALVINN - Training ► standard backpropagation ► constant speed of learning (possibly different for each neuron -see PV021) ► some other optimizations (see PV021) Výsledek: ► Training took 5 minutes, the speed was 4 miles per hour (The speed was limited by the hydraulic controller of the steering wheel not the learning algorithm.) ► ALVINN was able to go through roads it never "seen" and in different weather 30 ALVINN - Weight Learning hl hl h3 h4 h5 round 0 round 10 round 20 round 50 Here /?1,..., /?5 are values of hidden neurons. Extensions and Directions (PV021) ► Other types of learning inspired by neuroscience - Hebbian learning 32 Extensions and Directions (PV021) ► Other types of learning inspired by neuroscience - Hebbian learning ► More biologically plausible models of neural networks - spiking neurons This goes into the direction of HUGE area of (computational) neuroscience, only very lightly touched in PV021. 32 Extensions and Directions (PV021) ► Other types of learning inspired by neuroscience - Hebbian learning ► More biologically plausible models of neural networks - spiking neurons This goes into the direction of HUGE area of (computational) neuroscience, only very lightly touched in PV021. ► Unsupervised learning - Self-Organizing Maps 32 Extensions and Directions (PV021) ► Other types of learning inspired by neuroscience - Hebbian learning ► More biologically plausible models of neural networks - spiking neurons This goes into the direction of HUGE area of (computational) neuroscience, only very lightly touched in PV021. ► Unsupervised learning - Self-Organizing Maps ► Reinforcement learning ► learning to make decisions, or play games, sequentially ► neural networks have been used - temporal difference learning 32 Deep Learning ► Cybenko's theorem shows that two-layer networks are omnipotent -such results nearly killed NN when support vector machines were found to be easier to train in 00's. 33 Deep Learning ► Cybenko's theorem shows that two-layer networks are omnipotent -such results nearly killed NN when support vector machines were found to be easier to train in 00's. ► Later, it has been shown (experimentally) that deep networks (with many layers) have better represenational properties. Deep Learning ► Cybenko's theorem shows that two-layer networks are omnipotent -such results nearly killed NN when support vector machines were found to be easier to train in 00's. ► Later, it has been shown (experimentally) that deep networks (with many layers) have better represenational properties. ► ... but how to train them? The backpropagation suffers from so-called vanishing gradient, intuitively, updates of weights in lower layers are very slow. 33 Deep Learning ► Cybenko's theorem shows that two-layer networks are omnipotent -such results nearly killed NN when support vector machines were found to be easier to train in 00's. ► Later, it has been shown (experimentally) that deep networks (with many layers) have better represenational properties. ► ... but how to train them? The backpropagation suffers from so-called vanishing gradient, intuitively, updates of weights in lower layers are very slow. ► In 2006 a solution was found by Hinton et al: ► Use unsupervised methods to initialize the weights so that they capture important features in data. More precisely: The lowest hidden layer learns patterns in data, second lowest learns patterns in data transformed through the first layer, and so on. 33 Deep Learning ► Cybenko's theorem shows that two-layer networks are omnipotent -such results nearly killed NN when support vector machines were found to be easier to train in 00's. ► Later, it has been shown (experimentally) that deep networks (with many layers) have better represenational properties. ► ... but how to train them? The backpropagation suffers from so-called vanishing gradient, intuitively, updates of weights in lower layers are very slow. ► In 2006 a solution was found by Hinton et al: ► Use unsupervised methods to initialize the weights so that they capture important features in data. More precisely: The lowest hidden layer learns patterns in data, second lowest learns patterns in data transformed through the first layer, and so on. ► Then use a supervised learning algorithm to only fine tune the weights to the desired input-output behavior. A rather heavy machinery is needed to develop this, but you will be rewarded by insight into a very modern and expensive technology. I mage Net Large-Scale Visual Recognition Challenge (ILSVRC) ImageNet database (16,000,000 color images, 20,000 categories) I mage Net Large-Scale Visual Recognition Challenge (ILSVRC) Competition in classification over a subset of images from ImageNet. In 2012: Training se 1,200,000 images, 1000 categories. Validation set 50,000, Test set 150,000. Many images contain several objects —> typical rule is top-5 highest probability assigned by the net. 35 KSH sit ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, llya Sutskever, and Geoffrey E. Hinton (2012). dense Max pooling Trained on two GPUs (NVIDIA GeForce GTX 580) Results: ► Accuracy 84.7% in top-5 (second best alg. at the time: 73.8%) ► 63.3% in "perfect" classification (top-1) 36 ILSVRC 2014 The same set of images as in 2012, top-5 criterium. GoogLeNet: deep convolutional net, 22 layers Results: ► 93.33% in top-5 Superhuman power? 37 Superhuman GoogLeNet?! Andrej Karpathy: ...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while. First we thought we would put it up on [Amazon Mechanical Turk]. Then we thought we could recruit paid undergrads. Then I organized a labeling party of intense labeling effort only among the (expert labelers) in our lab. Then I developed a modified interface that used GoogLeNet predictions to prune the number of categories from 1000 to only about 100. It was still too hard - people kept missing categories and getting up to ranges of 13-15% error rates. In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself... The labeling happened at a rate of about 1 per minute, but this decreased over time... Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs... Based on the sample of images I worked on, the GoogLeNet classification error turned out to be 6.8%... My own error in the end turned out to be 5.1%, approximately 1.7% better. 38 ILSVRC 2015 34-layer plain 34-layer residual ► Microsoft network ResNet: 152 layers, complex architecture ► Trained on 8 GPUs ► 96.43% accuracy in top-5 Id on; 11] lilnf^,l].- ♦ 39 ILSVRC ils^rc.png 40 Deeper Insight into the Logistic Sigmoid Consider a perceptron (that is a linear classifier): n i=l and y = sgn(^) i e>o o eo o eo o e 0).) 42 Deeper Insight into the Logistic Sigmoid Assume that training examples (x, c(x)) are randomly generated. Denote: ► ^ mean signed distance from the boundary of points classified 1. ► £° mean signed distance from the boundary of points classified 0. It is not unreasonable to assume that ► conditioned on c = 1, the signed distance £ is normally distributed with the mean ^ and variance (for simplicity) 1, ► conditioned on c = 0, the signed distance £ is normally distributed with the mean £° and variance (for simplicity) 1. (Notice that £ may be negative, which means that such point is on the wrong side of the boundary (the same for £ > 0).) Now, can we decide what is the probability of c = 1 given a distance? 42 Deeper Insight into the Logistic Sigmoid For simplicity, assume that £x = — £° = 1/2 i)^(i) p(ei |i)P(i) + p(CI 0)P(0) /./? LR + l/clr where p(£l 1) _ exp(- -(£ - l/2)2/2) p(€ 1 0) exp(- +1/2)V2) and P(l) clr = p which we assume (for simplicity) = 1 43 Deeper Insight into the Logistic Sigmoid For simplicity, assume that £x = — £° = 1/2 i)^(i) p(ei |i)P(i) + p(CI 0)P(0) /./? LR + l/clr where p(£l 1) _ exp(- -(£ - l/2)2/2) p(€ 1 0) exp(- +1/2)V2) and P(l) clr = p which we assume (for simplicity) = 1 So exp(f) P(l I 0 exp(f) + 1 1 + e-f Thus the logistic sigmoid applied to £ = i/i/n + 'x/ gives t/?e probability of c = 1 given the input! 43 Deeper Insight into the Logistic Sigmoid So if we use the logistic sigmoid as an activation function, and turn the neuron into a classifier as follows: classify a given input x as 1 iff y > 1/2 44 Deeper Insight into the Logistic Sigmoid So if we use the logistic sigmoid as an activation function, and turn the neuron into a classifier as follows: classify a given input x as 1 iff y > 1/2 Then the neuron basically works as the Bayes classifier! 44 Deeper Insight into the Logistic Sigmoid So if we use the logistic sigmoid as an activation function, and turn the neuron into a classifier as follows: classify a given input x as 1 iff y > 1/2 Then the neuron basically works as the Bayes classifier! This is the basis of logistic regression. 44 Deeper Insight into the Logistic Sigmoid So if we use the logistic sigmoid as an activation function, and turn the neuron into a classifier as follows: Then the neuron basically works as the Bayes classifier! This is the basis of logistic regression. Given training data, we may compute the weights w that maximize the likelihood of the training data (w.r.t. the probabilities returned by the neuron). An extremely interesting observation is that such w maximizing the likelihood coincides with the minimum of least squares for the corresponding linear function (that is the same neuron but with identity as the activation function). 44