PV021: Neural networks Tomáš Brázdil 1 Course organization Course materials: Main: The lecture Neural Networks and Deep Learning by Michael Nielsen http://neuralnetworksanddeeplearning.com/ (Extremely well written modern online textbook.) Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville http://www.deeplearningbook.org/ (A very good overview of the state-of-the-art in neural networks.) 2 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++, or in Java without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) 3 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++, or in Java without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) Oral exam I may ask about anything from the lecture including some proofs that occur only on the whiteboard! 3 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++, or in Java without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) Oral exam I may ask about anything from the lecture including some proofs that occur only on the whiteboard! Application of any deep learning toolset on given (difficult) data. We prefer TensorFlow but you may use another library (CNTK, Caffe, DeepLearning4j, ...) The goal is to get the best results on increasingly more difficult datasets. 3 FAQ Q: Why English? 4 FAQ Q: Why English? A: Couple of reasons. First, all resources about modern neural nets are in English, it is rather cumbersome to translate everything to Czech (combination of Czech and English is ugly). Second, to attract non-Czech speaking students to the course. 4 FAQ Q: Why English? A: Couple of reasons. First, all resources about modern neural nets are in English, it is rather cumbersome to translate everything to Czech (combination of Czech and English is ugly). Second, to attract non-Czech speaking students to the course. Q: Why we cannot use specialized libraries in projects? 4 FAQ Q: Why English? A: Couple of reasons. First, all resources about modern neural nets are in English, it is rather cumbersome to translate everything to Czech (combination of Czech and English is ugly). Second, to attract non-Czech speaking students to the course. Q: Why we cannot use specialized libraries in projects? A: In order to "touch" the low level implementation details of the algorithms. You should not even use libraries for linear algebra and numerical methods, so that you will be confronted with rounding errors and numerical instabilities. 4 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text · · · and lots of much much more sophisticated applications ... 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text · · · and lots of much much more sophisticated applications ... Basic attributes of learning algorithms: representation: ability to capture the inner structure of training data generalization: ability to work properly on new data 5 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. 6 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. There are many types of models: decision trees support vector machines hidden Markov models Bayes networks and other graphical models neural networks · · · Neural networks, based on models of a (human) brain, form a natural basis for learning algorithms! 6 Artificial neural networks Artificial neuron is a rough mathematical approximation of a biological neuron. (Aritificial) neural network (NN) consists of a number of interconnected artificial neurons. "Behavior" of the network is encoded in connections between neurons. σ ξ x1 x2 xn y Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/ 7 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). simplified mathematical models help to identify important mechanisms How a brain receives information? How the information is stored? How a brain develops? · · · 8 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). simplified mathematical models help to identify important mechanisms How a brain receives information? How the information is stored? How a brain develops? · · · neuroscience is strongly multidisciplinary; precise mathematical descriptions help in communication among experts and in design of new experiments. I will not spend much time on this area! 8 Why artificial neural networks? Neural networks in machine learning. Typically primitive models, far from their biological counterparts (but often inspired by biology). 9 Why artificial neural networks? Neural networks in machine learning. Typically primitive models, far from their biological counterparts (but often inspired by biology). Strongly oriented towards concrete application domains: decision making and control - autonomous vehicles, manufacturing processes, control of natural resources games - backgammon, poker, GO finance - stock prices, risk analysis medicine - diagnosis, signal processing (EKG, EEG, ...), image processing (MRI, roentgen, ...) text and speech processing - automatic translation, text generation, speech recognition other signal processing - filtering, radar tracking, noise reduction · · · I will concentrate on this area! 9 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits Robustness a blurred photo of a rabbit may still be classified as a picture of a rabbit 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits Robustness a blurred photo of a rabbit may still be classified as a picture of a rabbit Graceful degradation Experiments have shown that damaged neural network is still able to work quite well Damaged network may re-adapt, remaining neurons may take on functionality of the damaged ones 10 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) Basic practical training techniques (data preparation, setting various parameters, control of learning) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) Basic practical training techniques (data preparation, setting various parameters, control of learning) Basic information about current implementations (TensorFlow, Keras) 11 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. 12 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). 12 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. 12 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. Afterwards, the output signal is transfered via PNS to effectors (e.g. muscle cells). 12 Biological neural network Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X 13 Summation 14 Biological and Mathematical neurons 15 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs 16 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights 16 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = n i=1 wixi 16 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ h ; 0 ξ < h. where h ∈ R is a threshold. 16 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs 17 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights 17 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi 17 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (The threshold h has been substituted with the new input x0 = 1 and the weight w0 = −h.) 17 Neuron and linear separation ξ = 0 ξ > 0 ξ > 0 ξ < 0 ξ < 0 inner potential ξ = w0 + n i=1 wixi determines a separation hyperplane in the n-dimensional input space in 2d line in 3d plane · · · 18 Neuron and linear separation σ σ( wixi) x1 xn · · · 1/0 by A/B w1 wn n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 19 Neuron and linear separation σ x1 xn · · · x0 = 1 1/0 pro A/B w1 wn w0 n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 20 Neuron and linear separation ¯w0 + n i=1 ¯wixi = 0 w0 + n i=1 wixi = 0 A A A A B B B Red line classifies incorrectly Green line classifies correctly (may be a result of a correction by a learning algorithm) 21 Neuron and linear separation (XOR) 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) x1 x2 No line separates ones from zeros. 22 Neural networks Neural network consists of formal neurons interconnected in such a way that the output of one neuron is an input of several other neurons. In order to describe a particular type of neural networks we need to specify: Architecture How the neurons are connected. Activity How the network transforms inputs to outputs. Learning How the weights are changed during training. 23 Architecture Network architecture is given as a digraph whose nodes are neurons and edges are connections. We distinguish several categories of neurons: Output neurons Hidden neurons Input neurons (In general, a neuron may be both input and output; a neuron is hidden if it is neither input, nor output.) 24 Architecture – Cycles A network is cyclic (recurrent) if its architecture contains a directed cycle. 25 Architecture – Cycles A network is cyclic (recurrent) if its architecture contains a directed cycle. Otherwise it is acyclic (feed-forward) 25 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers 26 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer 26 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer 26 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 26 Activity Consider a network with n neurons, k input and output. 27 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. 27 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. Network input is a vector of k real numbers, i.e. an element of Rk . Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) 27 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. Network input is a vector of k real numbers, i.e. an element of Rk . Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) Initial state Input neurons set to values from the network input (each component of the network input corresponds to an input neuron) Values of the remaining neurons set to 0. 27 Activity – computation of a network Computation (typically) proceeds in discrete steps. 28 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 28 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) 28 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. 28 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. Network output is a vector of values of all output neurons in the network (i.e. an element of R ). Note that the network output keeps changing throughout the computation! 28 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. Network output is a vector of values of all output neurons in the network (i.e. an element of R ). Note that the network output keeps changing throughout the computation! MLP uses the following selection rule: In the i-th step evaluate all neurons in the i-th layer. 28 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. 29 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. 29 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. Example 1 This network computes a function from R2 to R. 29 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. 30 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here x = (x1, . . . , xn) are inputs of the neuron and w = (w1, . . . , wn) are weights. 30 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here x = (x1, . . . , xn) are inputs of the neuron and w = (w1, . . . , wn) are weights. There are special types of neural network where the inner potential is computed differently, e.g. as a "distance" of an input from the weight vector: ξ = x − w here ||·|| is a vector norm, typically Euclidean. 30 Activity – inner potential and activation functions There are many activation functions, typical examples: Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 31 Activity – inner potential and activation functions There are many activation functions, typical examples: Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (Logistic) sigmoid σ(ξ) = 1 1 + e−λ·ξ here λ ∈ R is a steepness parameter. Hyperbolic tangens σ(ξ) = 1 − e−ξ 1 + e−ξ 31 Activity – XOR 1 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 1 1 σ 11 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 0 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 0 0 σ 01 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 1 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 1 0 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 1 0 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 0 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 0 1 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – XOR 0 1 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 32 Activity – MLP and linear separation 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) P1 P2 x1 x2 σ1 σ 1 σ1 −22 2 −2 1 −1 1 3 −2 The line P1 is given by −1 + 2x1 + 2x2 = 0 The line P2 is given by 3 − 2x1 − 2x2 = 0 33 Activity – example x1 1 σ 0 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 34 Activity – example x1 1 σ 1 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 34 Activity – example x1 1 σ 1 1 σ 1 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 34 Activity – example x1 1 σ 1 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 34 Activity – example x1 1 σ 0 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 34 Learning Consider a network with n neurons, k input and output. 35 Learning Consider a network with n neurons, k input and output. Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) Weight-space of a network is a set of all configurations. 35 Learning Consider a network with n neurons, k input and output. Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) Weight-space of a network is a set of all configurations. initial configuration weights can be initialized randomly or using some sophisticated algorithm 35 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) 36 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) Supervised learning The desired function is described using training examples that are pairs of the form (input, output). Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. 36 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) Supervised learning The desired function is described using training examples that are pairs of the form (input, output). Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. Unsupervised learning The training set contains only inputs. The goal is to determine distribution of the inputs (clustering, deep belief networks, etc.) 36 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron 37 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B 37 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B the algorithm considers examples one after another whenever an incorrectly classified point is considered, the learning algorithm turns the line in the direction of the point 37 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel 38 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks 38 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks generalization and robustness information is encoded in a distributed manned in weights "close" inputs typicaly get similar values 38 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks generalization and robustness information is encoded in a distributed manned in weights "close" inputs typicaly get similar values Graceful degradation damage typically causes only a decrease in precision of results 38 Expressive power of neural networks 39 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 40 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 41 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. σ x1 x2 xn x0 = 1 y = AND(x1, . . . , xn) 1 1 · · · 1 −n σ x1 x2 xn x0 = 1 y = OR(x1, . . . , xn) 1 1 · · · 1 −1 σ x1 x0 = 1 y = NOT(x1) −1 0 41 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. 42 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. Proof. Given a vector v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron Nv whose output is 1 iff the input is v: σ y x1 xi xn x0 = 1 w1 wi · · ·· · · wn w0 w0 = − n i=1 vi wi =    1 vi = 1 −1 vi = 0 Now let us connect all outputs of all neurons Nv satisfying F(v) = 1 using a neuron implementing OR. 42 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). 43 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. 43 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. 43 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. The third layer may e.g. make unions of some convex sets. 43 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . 44 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) 44 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). 44 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). Finally, connect outputs of the nets NK satisfying K ∩ A ∅ using a neuron implementing OR. 44 Non-linear separation - sigmoid Theorem (Cybenko 1989 - informal version) Let σ be a continuous function which is sigmoidal, i.e. satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every "reasonable" set A ⊆ [0, 1]n, there is a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following: For "most" vectors v ∈ [0, 1]n we have that v ∈ A iff the network output is > 0 for the input v. For mathematically oriented: "reasonable" means Lebesgue measurable "most" means that the set of incorrectly classified vectors has the Lebesgue measure smaller than a given ε > 0 45 Non-linear separation - practical illustration ALVINN drives a car 46 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) 46 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) Input values correspond to shades of gray of pixels. 46 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) Input values correspond to shades of gray of pixels. Output neurons "classify" images of the road based on their "curvature". Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 46 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, 47 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, the remaining neurons have the logistic sigmoid σ as their activation, 47 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, the remaining neurons have the logistic sigmoid σ as their activation, for every v ∈ [0, 1]n we have that |F(v) − f(v)| < ε. 47 Function approximation – three layer networks x1 x2 σ σ σ σ σ· · · · · · · · · ζ y weighted sum of "spikes" ... + the other two 90 degree rotations a "spike" inner potential the value of the neuron 48 Function approximation - two-layer networks Theorem (Cybenko 1989) Let σ be a continuous function which is sigmoidal, i.e. is increasing and satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every continuous function f : [0, 1]n → [0, 1] and every ε > 0 there is a function F : [0, 1]n → [0, 1] computed by a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following |f(v) − F(v)| < ε pro každé v ∈ [0, 1]n . 49 Neural networks and computability Consider recurrent networks (i.e. containing cycles) 50 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); 50 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); 50 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); 50 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); activation function σ(ξ) =    1 ξ ≥ 1 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. 50 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); activation function σ(ξ) =    1 ξ ≥ 1 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. We encode words ω ∈ {0, 1}+ into numbers as follows: δ(ω) = |ω| i=1 ω(i) 2i + 1 2|ω|+1 E.g. ω = 11001 gives δ(ω) = 1 2 + 1 22 + 1 25 + 1 26 (= 0.110011 in binary form). 50 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. 51 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) 51 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) Recurrent networks are super-Turing powerful 51 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) Recurrent networks are super-Turing powerful For every language L ⊆ {0, 1}+ there is a recurrent network with less than 1000 nerons which recognizes L. 51 Summary of theoretical results Neural networks are very strong from the point of view of theory: All Boolean functions can be expressed using two-layer networks. Two-layer networks may approximate any continuous function. Recurrent networks are at least as strong as Turing machines. 52 Summary of theoretical results Neural networks are very strong from the point of view of theory: All Boolean functions can be expressed using two-layer networks. Two-layer networks may approximate any continuous function. Recurrent networks are at least as strong as Turing machines. These results are purely theoretical! "Theoretical" networks are extremely huge. It is very difficult to handcraft them even for simplest problems. From practical point of view, the most important advantage of neural networks are: learning, generalization, robustness. 52 Neural networks vs classical computers Neural networks "Classical" computers Data implicitly in weights explicitly Computation naturally parallel sequential, localized Robustness robust w.r.t. input corruption & damage changing one bit may completely crash the computation Precision imprecise, network recalls a training example "similar" to the input (typically) precise Programming learning manual 53 History & implementations 54 History of neurocomputers 1951: SNARC (Minski et al) the first implementation of neural network a rat strives to exit a maze 40 artificial neurons (300 vacuum tubes, engines, etc.) 55 History of neurocomputers 1957: Mark I Perceptron (Rosenblatt et al) - the first successful network for image recognition single layer network image represented by 20 × 20 photocells intensity of pixels was treated as the input to a perceptron (basically the formal neuron), which recognized figures weights were implemented using potentiometers, each set by its own engine it was possible to arbitrarily reconnect inputs to neurons to demonstrate adaptability 56 History of neurocomputers 1960: ADALINE (Widrow & Hof) single layer neural network weights stored in a newly invented electronic component memistor, which remembers history of electric current in the form of resistance. Widrow founded a company Memistor Corporation, which sold implementations of neural networks. 1960-66: several companies concerned with neural networks were founded. 57 History of neurocomputers 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) 1983-end of 90s: revival of neural networks many attempts at hardware implementations application specific chips (ASIC) programmable hardware (FPGA) hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) 58 History of neurocomputers 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) 1983-end of 90s: revival of neural networks many attempts at hardware implementations application specific chips (ASIC) programmable hardware (FPGA) hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) end of 90s-cca 2005: NN suppressed by other machine learning methods (support vector machines (SVM)) 2006-now: The boom of neural networks! deep networks – often better than any other method GPU implementations ... some specialized hw implementations (Google’s TPU) 58 History in waves ... Figure: The figure shows two of the three historical waves of artificial neural nets research, as measured by the frequency of the phrases "cybernetics" and "connectionism" or "neural networks" according to Google Books (the third wave is too recent to appear). 59 Current hardware – What do we face? Increasing dataset size ... 60 Current hardware – What do we face? ... and thus increasing size of neural networks ... 2. ADALINE 4. Early back-propagation network (Rumelhart et al., 1986b) 8. Image recognition: LeNet-5 (LeCun et al., 1998b) 10. Dimensionality reduction: Deep belief network (Hinton et al., 2006) ... here the third "wave" of neural networks started 15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012) 20. Image recognition: GoogLeNet (Szegedy et al., 2014a) 61 Current hardware – What do we face? ... as a reward we get this ... Figure: Since deep networks reached the scale necessary to compete in the ImageNetLarge Scale Visual Recognition Challenge, they have consistently won the competition every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015). 62 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. 63 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. In 2014, similar task performed on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Able to train 1 billion parameter networks on just 3 machines in a couple of days. Able to scale to 11 billion weights (approx. 6.5 times larger than the Google model) on 16 GPUs. 63 Current hardware – NVIDIA DGX Station 4x GPU (Tesla V100) TFLOPS = 480 GPU memory 64GB total NVIDIA Tensor Cores: 2,560 NVIDIA CUDA Cores: 20,480 System memory: 256 GB Network: Dual 10 Gb LAN NVIDIA Deep Learning SDK 64 Current software TensorFlow (Google) open source software library for numerical computation using data flow graphs allows implementation of most current neural networks allows computation on multiple devices (CPUs, GPUs, ...) Python API Keras: a library on top of TensorFlow that allows easy description of most modern neural networks CNTK (Microsoft) functionality similar to TensorFlow special input language called BrainScript Theano (dead): The "academic" grand-daddy of deep-learning frameworks, written in Python. Strongly inspired TensorFlow (some people developing Theano moved on to develop TensorFlow). There are others: Caffe, Torch (Facebook), Deeplearning4j, ... 65 Current software – Keras 66 Other software implementations Most "mathematical" software packages contain some support of neural networks: MATLAB R STATISTICA Weka ... The implementations are typically not on par with the previously mentioned dedicated deep-learning libraries. 67 Training linear models 68 Linear regression (ADALINE) Architecture: x1 x2 xn · · · y x0 = 1 w0 w1 w2 wn w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1. Activity: inner potential: ξ = w0 + n i=1 wixi = n i=0 wixi = w · x activation function: σ(ξ) = ξ network function: y[w](x) = σ(ξ) = w · x 69 Linear regression (ADALINE) Learning: Given a training dataset T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ R is the expected output. Intuition: The network is supposed to compute an affine approximation of the function (some of) whose values are given in the training set. 70 Oaks in Wisconsin 71 Linear regression (ADALINE) Error function: E(w) = 1 2 p k=1 w · xk − dk 2 = 1 2 p k=1   n i=0 wixki − dk   2 The goal is to find w which minimizes E(w). 72 Error function 73 Gradient of the error function Consider gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) Intuition: E(w) is a vector in the weight space which points in the direction of the steepest ascent of the error function. Note that the vectors xk are just parameters of the function E, and are thus fixed! 74 Gradient of the error function Consider gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) Intuition: E(w) is a vector in the weight space which points in the direction of the steepest ascent of the error function. Note that the vectors xk are just parameters of the function E, and are thus fixed! Fact If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E. For ADALINE, the error function E(w) is a convex paraboloid and thus has the unique global minimum. 74 Gradient - illustration Caution! This picture just illustrates the notion of gradient ... it is not the convex paraboloid E(w) ! 75 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. 76 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) 76 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] 76 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 76 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 = (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1 76 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 = (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1 δE δw1 76 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 = (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1 δE δw1 = (w0 +w1 ·2−1)·2+(w0 +w1 ·3−2)·3+(w0 +w1 ·4−5)·4 76 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 77 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   77 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   77 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   = p k=1 w · xk − dk xk 77 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   = p k=1 w · xk − dk xk Thus E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) = p k=1 w · xk − dk xk 77 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. 78 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 78 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) · xk − dk · xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate. 78 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) · xk − dk · xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate. Proposition For sufficiently small ε > 0 the sequence w(0), w(1), w(2), . . . converges (componentwise) to the global minimum of E (i.e. to the vector w satisfying E(w) = 0). 78 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 Linear regression - animation 79 ADALINE - learning Online algorithm (Delta-rule, Widrow-Hoff rule): weights in w(0) initialized randomly close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε(t) · w(t) · xk − dk · xk Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in the step t + 1. Note that the algorithm does not work with the complete gradient but only with its part determined by the currently considered training example. 80 ADALINE - learning Online algorithm (Delta-rule, Widrow-Hoff rule): weights in w(0) initialized randomly close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε(t) · w(t) · xk − dk · xk Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in the step t + 1. Note that the algorithm does not work with the complete gradient but only with its part determined by the currently considered training example. Theorem (Widrow & Hoff) If ε(t) = 1 t , then w(0), w(1), w(2), . . . converges to the global minimum of E. 80 What about classification? Binary classification: Desired outputs 0 and 1. Ideally, capture the probability distribution of classes. 81 What about classification? Binary classification: Desired outputs 0 and 1. ... does not capture probability well (it is not a probability at all) 81 What about classification? Binary classification: Desired outputs 0 and 1. ... logistic sigmoid 1 1+e−(w·x) is much better! 81 Logistic regression x1 x2 xn · · · y x0 = 1 w0 w1 w2 wn w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1. Activity: inner potential: ξ = w0 + n i=1 wixi = n i=0 wixi = w · x activation function: σ(ξ) = 1 1+e−ξ network function: y[w](x) = σ(ξ) = 1 1+e−(w·x) Intuition: The output y is now the probability of the class 1 given the input x. 82 But what is the meaning of the sigmoid? The model gives a probability y of the class 1 given an input x. But why we model such a probability using 1/(1 + e−w·x) ?? 83 But what is the meaning of the sigmoid? The model gives a probability y of the class 1 given an input x. But why we model such a probability using 1/(1 + e−w·x) ?? What about odds of the class 1? odds(y) = y 1 − y Resembles an exponential function ... 83 But what is the meaning of the sigmoid? The model gives a probability y of the class 1 given an input x. But why we model such a probability using 1/(1 + e−w·x) ?? What about log odds (aka logit) of the class 1? logit(y) = log(y/(1 − y)) Looks almost linear ... 83 But what is the meaning of the sigmoid? Put log(y/(1 − y)) = w · x 84 But what is the meaning of the sigmoid? Put log(y/(1 − y)) = w · x Then log((1 − y)/y) = −w · x 84 But what is the meaning of the sigmoid? Put log(y/(1 − y)) = w · x Then log((1 − y)/y) = −w · x and (1 − y)/y = e−w·x 84 But what is the meaning of the sigmoid? Put log(y/(1 − y)) = w · x Then log((1 − y)/y) = −w · x and (1 − y)/y = e−w·x and y = 1 1 + e−w·x That is, if we model log odds using a linear function, the probability is obtained by applying the logistic sigmoid on the result of the linear function. 84 Logistic regression Learning: Given a training dataset T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. 85 Logistic regression Learning: Given a training dataset T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. What error function? (Binary) cross-entropy: E(w) = p k=1 −(dk log(yk ) + (1 − dk ) log(1 − yk )) What?!? 85 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). 86 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is p and is unknown! 86 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is p and is unknown! You have tossed the coin 5 times and got a training dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Consider this to be a very special case where the input dimension is 0 86 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is p and is unknown! You have tossed the coin 5 times and got a training dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Consider this to be a very special case where the input dimension is 0 What is the best model y of p based on the data? 86 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is p and is unknown! You have tossed the coin 5 times and got a training dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Consider this to be a very special case where the input dimension is 0 What is the best model y of p based on the data? Answer: The one that generates the data with maximum probability! 86 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} 87 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Assume that the data was generated by independent trials, then the probability of getting exactly T is L = y · y · (1 − y) · (1 − y) · y How to maximize this w.r.t. y? 87 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Assume that the data was generated by independent trials, then the probability of getting exactly T is L = y · y · (1 − y) · (1 − y) · y How to maximize this w.r.t. y? Maximize LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y) 87 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Assume that the data was generated by independent trials, then the probability of getting exactly T is L = y · y · (1 − y) · (1 − y) · y How to maximize this w.r.t. y? Maximize LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y) But then −LL = −1·log(y)−1·log(y)−(1−0)·log(1−y)−(1−0)·log(1−y)−1·log(y) and thus −LL is the cross-entropy. 87 Let the coin depend on the input Consider our model: y = 1 1 + e−(w·x) 88 Let the coin depend on the input Consider our model: y = 1 1 + e−(w·x) The training dataset is now standard: T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. 88 Let the coin depend on the input Consider our model: y = 1 1 + e−(w·x) The training dataset is now standard: T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. The likelihood: L = p k=1 ydk k · (1 − yk )(1−dk ) and LL = log(L) = p k=1 (dk log(yk ) + (1 − dk ) log(1 − yk )) and thus −LL = the cross-entropy. Minimizing the cross-netropy maximizes the log-likelihood (and vice versa). 88 Normal Distribution Distribution of continuous random variables. Density (one dimensional, that is over R): p(x) = 1 σ √ 2π exp − (x − µ)2 2σ2 =: N[µ, σ2 ](x) µ is the expected value (the mean), σ2 is the variance. 89 Maximum Likelihood vs Least Squares (Dim 1) Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp 90 Maximum Likelihood vs Least Squares (Dim 1) Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp Assume that each dk has been generated randomly by dk = (w0 + w1 · xk ) + k w0, w1 are unknown numbers k are normally distributed with mean 0 and an unknown variance σ2 90 Maximum Likelihood vs Least Squares (Dim 1) Keep in mind: dk = (w0 + w1 · xk ) + k Assume that 1, . . . , p were generated independently. 91 Maximum Likelihood vs Least Squares (Dim 1) Keep in mind: dk = (w0 + w1 · xk ) + k Assume that 1, . . . , p were generated independently. Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density according to which the values d1, . . . , dn were generated assuming fixed w0, w1, σ2, x1, . . . , xp. 91 Maximum Likelihood vs Least Squares (Dim 1) Keep in mind: dk = (w0 + w1 · xk ) + k Assume that 1, . . . , p were generated independently. Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density according to which the values d1, . . . , dn were generated assuming fixed w0, w1, σ2, x1, . . . , xp. The independence and normality imply p(d1, . . . , dp | w0, w1, σ2 ) = p k=1 N[w0 + w1xk , σ2 ](dk ) = p k=1 1 σ √ 2π exp − (dk − w0 − w1xk )2 2σ2 91 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) 92 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) Theorem (w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1) minimizes E(w0, w1), i.e. the least squares error function. 92 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) Theorem (w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1) minimizes E(w0, w1), i.e. the least squares error function. Note that the maximizing/minimizing (w0, w1) does not depend on σ2. 92 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) Theorem (w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1) minimizes E(w0, w1), i.e. the least squares error function. Note that the maximizing/minimizing (w0, w1) does not depend on σ2. Maximizing σ2 satisfies σ2 = 1 p p k=1 (dk − w0 − w1 · xk )2. 92 MLP training – theory 93 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 94 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) 95 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops 95 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) 95 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) 95 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) 95 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 95 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi 96 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] 96 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) 96 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) The network computes a function R|X| do R|Y| . Layer-wise computation: First, all input neurons are assigned values of the input. In the -th step, all neurons of the -th layer are evaluated. 96 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). 97 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 97 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji 98 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. 98 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. Note that ∂E ∂wji (w(t) ) is a component of the gradient E, i.e. the weight update can be written as w(t+1) = w(t) − ε(t) · E(w(t) ). 98 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji 99 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi 99 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y 99 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) (Here all yj are in fact yj(w, xk )). 99 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) 100 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) 100 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then σj (ξj) = b a (a − yj)(a + yj) 100 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: 101 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) 101 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 101 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 101 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 101 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 101 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 4. Eji := Eji + ∂Ek ∂wji The resulting Eji equals ∂E ∂wji . 101 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: 102 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj 102 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and assuming that ∂Ek ∂yr has already been computed for all neurons in the + 1-st layer, compute ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj (This works because all neurons of r ∈ j→ belong to the + 1-st layer.) 102 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) 103 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 103 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 103 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 103 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) 103 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. 103 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. Note that the speed of convergence of the gradient descent cannot be estimated ... 103 Illustration of the gradient descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 104 MLP – learning algorithm Online algorithm: The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂Ek ∂wji (w (t) ji ) is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. There are other variants determined by selection of the training examples used for the error computation (more on this later). 105 SGD weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · k∈T Ek (w(t) ) 0 < ε(t) ≤ 1 is a learning rate in step t + 1 Ek (w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 106 MLP training – practical issues 107 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 108 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 109 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function: (for example) E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 110 SGD weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · k∈T Ek (w(t) ) 0 < ε(t) ≤ 1 is a learning rate in step t + 1 Ek (w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 111 MLP – mse gradient For every wji we have ∂E ∂wji = 1 p p k=1 ∂Ek ∂wji 112 MLP – mse gradient For every wji we have ∂E ∂wji = 1 p p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get (for squared error) ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) (Here all yj are in fact yj(w, xk )). 112 (Some) error functions squared error: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 mean squared error (mse): E(w) = 1 p p k=1 Ek (w) (categorical) cross entropy: E(w) = − 1 p p k=1 j∈Y dkj ln(yj) 113 Practical issues of gradient descent Training efficiency: What size of a minibatch? How to choose the learning rate ε(t) and control SGD ? How to pre-process the inputs? How to initialize weights? How to choose desired output values of the network? 114 Practical issues of gradient descent Training efficiency: What size of a minibatch? How to choose the learning rate ε(t) and control SGD ? How to pre-process the inputs? How to initialize weights? How to choose desired output values of the network? Quality of the resulting model: When to stop training? Regularization techniques. How large network? For simplicity, I will illustrate the reasoning on MLP + mse. Later we will see other topologies and error functions with different but always somewhat related issues. 114 Issues in gradient descent Lots of local minima where the descent gets stuck: The model identifiability problem: Swapping incoming weights of neurons i and j leaves the same network topology – weight space symmetry Recent studies show that for sufficiently large networks all local minima have low values of the error function. 115 Issues in gradient descent Lots of local minima where the descent gets stuck: The model identifiability problem: Swapping incoming weights of neurons i and j leaves the same network topology – weight space symmetry Recent studies show that for sufficiently large networks all local minima have low values of the error function. Saddle points One can show (by a combinatorial argument) that larger networks have exponentially more saddle points than local minima. 115 Issues in gradient descent – too slow descent flat regions E.g. if the inner potentials are too large (in abs. value), then their derivative is extremely small. 116 Issues in gradient descent – too fast descent steep cliffs: the gradient is extremely large, descent skips important weight vectors 117 Issues in gradient descent – local vs global structure What if we initialize on the left? 118 Issues in computing the gradient vanishing and exploding gradients ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) 119 Issues in computing the gradient vanishing and exploding gradients ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) inexact gradient computation: Minibatch gradient is only an estimate of the true gradient. Note that the variance of the estimate is (roughly) σ/ √ m where m is the size of the minibatch and σ is the variance of the gradient estimate for a single training example. (E.g. minibatch size 10 000 means 100 times more computation than the size 100 but gives only 10 times less variance.) 119 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. 120 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. 120 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size. 120 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size. It is common (especially when using GPUs) for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models. 120 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size. It is common (especially when using GPUs) for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models. Small batches can offer a regularizing effect, perhaps due to the noise they add to the learning process. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. ("On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima". Keskar et al, ICLR’17) 120 Moment Issue in the gradient descent: E(w(t)) constantly changes direction (but the error steadily decreases). 121 Moment Issue in the gradient descent: E(w(t)) constantly changes direction (but the error steadily decreases). Solution: In every step add the change made in the previous step (weighted by a factor α): ∆w(t) = −ε(t) · k∈T Ek (w(t) ) + α · ∆w (t−1) ji where 0 < α < 1. 121 Momentum – illustration 122 SGD with momentum weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · k∈T Ek (w(t) ) + α∆w(t−1) 0 < ε(t) ≤ 1 is a learning rate in step t + 1 0 < α < 1 measures the "influence" of the moment Ek (w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 123 Learning rate 124 Learning rate Generic rules for adaptation of ε(t) 125 Learning rate Generic rules for adaptation of ε(t) Start with a larger learning rate (e.g. ε = 0.1). Later decrease as the descent is supposed to settle in a minimum of E. Some tools allow to set a list of learning rates, each rate for one epoch of the descent. 125 Learning rate Generic rules for adaptation of ε(t) Start with a larger learning rate (e.g. ε = 0.1). Later decrease as the descent is supposed to settle in a minimum of E. Some tools allow to set a list of learning rates, each rate for one epoch of the descent. In case you may observe the error evolving: If the error decreases, increase slightly the rate. If the error increases, decrease the rate. Note that the error may increase for the short period without any harm to convergence of the learning process. 125 AdaGrad So far we have considered a uniform learning rate. It is better to have larger rates for weights with smaller updates, smaller rates for weights with larger updates. AdaGrad uses individually adapting learning rate for each weight. 126 SGD with AdaGrad weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji 127 SGD with AdaGrad weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = − η r (t) ji + δ · k∈T ∂Ek ∂wji (w(t) ) and r (t) ji = r (t−1) ji +   k∈T ∂Ek ∂wji (w(t) )   2 η is a constant expressing the influence of the learning rate, typically 0.01. δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0. 127 RMSProp The main disadvantage of AdaGrad is the accumulation of the gradient throughout the whole learning process. In case the learning needs to get over several "hills" before settling in a deep "valley", the weight updates get far too small before getting to it. RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl. 128 SGD with RMSProp weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji 129 SGD with RMSProp weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = − η r (t) ji + δ · k∈T ∂Ek ∂wji (w(t) ) and r (t) ji = ρr (t−1) ji + (1 − ρ)   k∈T ∂Ek ∂wji (w(t) )   2 η is a constant expressing the influence of the learning rate (Hinton suggests ρ = 0.9 and η = 0.001). δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0. 129 Other optimization methods There are more methods such as AdaDelta, Adam (roughly RMSProp combined with momentum), etc. A natural question: Which algorithm should one choose? 130 Other optimization methods There are more methods such as AdaDelta, Adam (roughly RMSProp combined with momentum), etc. A natural question: Which algorithm should one choose? Unfortunately, there is currently no consensus on this point. According to a recent study, the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged. 130 Other optimization methods There are more methods such as AdaDelta, Adam (roughly RMSProp combined with momentum), etc. A natural question: Which algorithm should one choose? Unfortunately, there is currently no consensus on this point. According to a recent study, the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged. Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta and Adam. The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm. 130 Choice of (hidden) activations Generic requirements imposed on activation functions: 1. differentiability (to do gradient descent) 2. non-linearity (linear multi-layer networks are equivalent to single-layer) 3. monotonicity (local extrema of activation functions induce local extrema of the error function) 4. "linearity" (i.e. preserve as much linearity as possible; linear models are easiest to fit; find the "minimum" non-linearity needed to solve a given task) The choice of activation functions is closely related to input preprocessing and the initial choice of weights. I will illustrate the reasoning on sigmoidal functions; say few words about other activation functions later. 131 Activation functions – tanh σ(ξ) = 1.7159 · tanh(2 3 · ξ), we have limξ→∞ σ(ξ) = 1.7159 and limξ→−∞ σ(ξ) = −1.7159 132 Activation functions – tanh σ(ξ) = 1.7159 · tanh(2 3 · ξ) is almost linear on [−1, 1] 133 Activation functions – tanh first derivative: σ(ξ) = 1.7159 · tanh(2 3 · ξ) 134 Input preprocessing Some inputs may be much larger than others. E.g..: Height vs weight of a person, maximum speed of a car (in km/h) vs its price (in CZK), etc. 135 Input preprocessing Some inputs may be much larger than others. E.g..: Height vs weight of a person, maximum speed of a car (in km/h) vs its price (in CZK), etc. Large inputs have greater influence on the training than the small ones. In addition, too large inputs may slow down learning (saturation of activation functions). 135 Input preprocessing Some inputs may be much larger than others. E.g..: Height vs weight of a person, maximum speed of a car (in km/h) vs its price (in CZK), etc. Large inputs have greater influence on the training than the small ones. In addition, too large inputs may slow down learning (saturation of activation functions). Typical standardization: average = 0 (subtract the mean) variance = 1 (divide by the standard deviation) Here the mean and standard deviation may be estimated from data (the training set). (illustration of standard deviation) 135 Initial weights (for tanh) Typically, the weights are chosen randomly from an interval [−w, w] where w depends on the number of inputs of a given neuron. 136 Initial weights (for tanh) Typically, the weights are chosen randomly from an interval [−w, w] where w depends on the number of inputs of a given neuron. Consider the activation function σ(ξ) = 1.7159 · tanh(2 3 · ξ) for all neurons. σ is almost linear on [−1, 1], σ saturates out of the interval [−4, 4] (i.e. it is close to its limit values and its derivative is close to 0. 136 Initial weights (for tanh) Typically, the weights are chosen randomly from an interval [−w, w] where w depends on the number of inputs of a given neuron. Consider the activation function σ(ξ) = 1.7159 · tanh(2 3 · ξ) for all neurons. σ is almost linear on [−1, 1], σ saturates out of the interval [−4, 4] (i.e. it is close to its limit values and its derivative is close to 0. Thus for too small w we may get (almost) linear model. for too large w (i.e. much larger than 1) the activations may get saturated and the learning will be very slow. Hence, we want to choose w so that the inner potentials of neurons will be roughly in the interval [−1, 1]. 136 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. 137 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. 137 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. The rule: choose w so that the standard deviation of ξj (denote by oj) is close to the border of the interval on which σj is linear. In our case: oj ≈ 1. 137 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. The rule: choose w so that the standard deviation of ξj (denote by oj) is close to the border of the interval on which σj is linear. In our case: oj ≈ 1. Our assumptions imply: oj = d 3 · w. Thus we put w = √ 3√ d . 137 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. The rule: choose w so that the standard deviation of ξj (denote by oj) is close to the border of the interval on which σj is linear. In our case: oj ≈ 1. Our assumptions imply: oj = d 3 · w. Thus we put w = √ 3√ d . The same works for higher layers, d corresponds to the number of neurons in the layer one level lower. 137 Glorot & Bengio initialization The previous heuristics for weight initialization ignores variance of the gradient (i.e. it is concerned only with the "size" of activations in the forward pass). 138 Glorot & Bengio initialization The previous heuristics for weight initialization ignores variance of the gradient (i.e. it is concerned only with the "size" of activations in the forward pass). Glorot & Bengio (2010) presented a normalized initialization by choosing w uniformly from the interval:  − 6 m + n , 6 m + n   Here n is the number of inputs to the layer, m is the number of outputs of the layer (i.e. the number of neurons in the layer). 138 Glorot & Bengio initialization The previous heuristics for weight initialization ignores variance of the gradient (i.e. it is concerned only with the "size" of activations in the forward pass). Glorot & Bengio (2010) presented a normalized initialization by choosing w uniformly from the interval:  − 6 m + n , 6 m + n   Here n is the number of inputs to the layer, m is the number of outputs of the layer (i.e. the number of neurons in the layer). This is designed to compromise between the goal of initializing all layers to have the same activation variance and the goal of initializing all layers to have the same gradient variance. The formula is derived using the assumption that the network consists only of a chain of matrix multiplications, with no non-linearities. Real neural networks obviously violate this assumption, but many strategies designed for the linear model perform reasonably well on its non-linear counterparts. 138 Modern activation functions For hidden neurons sigmoidal functions are often substituted with piece-wise linear activations functions. Most prominent is ReLU: σ(ξ) = max{0, ξ} THE default activation function recommended for use with most feedforward neural networks. As close to linear function as possible; very simple; does not saturate for large potentials. 139 Output neurons The choice of activation functions for output units depends on the concrete applications. For regression (function approximation) the output is typically linear (or sigmoidal). 140 Output neurons The choice of activation functions for output units depends on the concrete applications. For regression (function approximation) the output is typically linear (or sigmoidal). For classification, the current activation functions of choice are logistic sigmoid or tanh – binary classification softmax: Given an output neuron j ∈ Y yj = σj(ξj) = eξj i∈Y eξi for multi-class classification. 140 Output neurons The choice of activation functions for output units depends on the concrete applications. For regression (function approximation) the output is typically linear (or sigmoidal). For classification, the current activation functions of choice are logistic sigmoid or tanh – binary classification softmax: Given an output neuron j ∈ Y yj = σj(ξj) = eξj i∈Y eξi for multi-class classification. For some reasons the error function used with softmax (assuming that the target values dkj are from {0, 1}) is typically cross-entropy: − 1 p p k=1 j∈Y dkj ln(yj) ... which somewhat corresponds to the maximum likelihood principle. 140 Sigmoidal outputs with cross-entropy – in detail Consider Binary classification, two classes {0, 1} One output neuron j, its activation logistic sigmoid σj(ξj) = 1 1 + e−ξj The output of the network is y = σj(ξj). 141 Sigmoidal outputs with cross-entropy – in detail Consider Binary classification, two classes {0, 1} One output neuron j, its activation logistic sigmoid σj(ξj) = 1 1 + e−ξj The output of the network is y = σj(ξj). For a training set T = xk , dk k = 1, . . . , p (here xk ∈ R|X| and dk ∈ R), the cross-entropy looks like this: Ecross = − 1 p p k=1 [dk ln(yk ) + (1 − dk ) ln(1 − yk )] where yk is the output of the network for the k-th training input xk , and dk is the k-th desired output. 141 Generalization Intuition: Generalization = ability to cope with new unseen instances. Data are mostly noisy, so it is not good idea to fit exactly. In case of function approximation, the network should not return exact results as in the training set. 142 Generalization Intuition: Generalization = ability to cope with new unseen instances. Data are mostly noisy, so it is not good idea to fit exactly. In case of function approximation, the network should not return exact results as in the training set. More formally: It is typically assumed that the training set has been generated as follows: dkj = gj(xk ) + Θkj where gj is the "underlying" function corresponding to the output neuron j ∈ Y and Θkj is random noise. The network should fit gj not the noise. Methods improving generalization are called regularization methods. 142 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. 143 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. von Neumann: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." ... and I ask you prof. Neumann: What can you fit with 40GB of parameters?? 143 Early stopping Early stopping means that we stop learning before it reaches a minimum of the error E. When to stop? 144 Early stopping Early stopping means that we stop learning before it reaches a minimum of the error E. When to stop? In many applications the error function is not the main thing we want to optimize. E.g. in the case of a trading system, we typically want to maximize our profit not to minimize (strange) error functions designed to be easily differentiable. Also, as noted before, minimizing E completely is not good for generalization. For start: We may employ standard approach of training on one set and stopping on another one. 144 Early stopping Divide your dataset into several subsets: training set (e.g. 60%) – train the network here validation set (e.g. 20%) – use to stop the training (possibly) test set (e.g. 20%) – use to compare trained models What to use as a stopping rule? 145 Early stopping Divide your dataset into several subsets: training set (e.g. 60%) – train the network here validation set (e.g. 20%) – use to stop the training (possibly) test set (e.g. 20%) – use to compare trained models What to use as a stopping rule? You may observe E (or any other function of interest) on the validation set, if it does not improve for last k steps, stop. Alternatively, you may observe the gradient, if it is small for some time, stop. (recent studies shown that this traditional rule is not too good: it may happen that the gradient is larger close to minimum values; on the other hand, E does not have to be evaluated which saves time. To compare models you may use ML techniques such as cross-validation etc. 145 Size of the network Similar problem as in the case of the training duration: Too small network is not able to capture intrinsic properties of the training set. Large networks overfit faster – bad generalization. Solution: Optimal number of neurons :-) 146 Size of the network Similar problem as in the case of the training duration: Too small network is not able to capture intrinsic properties of the training set. Large networks overfit faster – bad generalization. Solution: Optimal number of neurons :-) there are some (useless) theoretical bounds there are algorithms dynamically adding/removing neurons (not much use nowadays) In practice: start using a rule of thumb: the number of neurons ≈ ten times less than the number of training instances. experiment, experiment, experiment. 146 Feature extraction Consider a two layer network. Hidden neurons are supposed to represent "patterns" in the inputs. Example: Network 64-2-3 for letter classification: 147 Ensemble methods Techniques for reducing generalization error by combining several models. The reason that ensemble methods work is that different models will usually not make all the same errors on the test set. Idea: Train several different models separately, then have all of the models vote on the output for test examples. 148 Ensemble methods Techniques for reducing generalization error by combining several models. The reason that ensemble methods work is that different models will usually not make all the same errors on the test set. Idea: Train several different models separately, then have all of the models vote on the output for test examples. Bagging: Generate k training sets T1, ..., Tk by sampling from T uniformly with replacement. If the number of samples is |T |, then on average |Ti| = (1 − 1/e)|T |. For each i, train a model Mi on Ti. Combine outputs of the models: for regression by averaging, for classification by (majority) voting. 148 Dropout The algorithm: In every step of the gradient descent choose randomly a set N of neurons, each neuron is included in N independently with probability 1/2, (in practice, different probabilities are used as well). do forward and backward propagations only using the selected neurons (i.e. leave weights of the other neurons unchanged) 149 Dropout The algorithm: In every step of the gradient descent choose randomly a set N of neurons, each neuron is included in N independently with probability 1/2, (in practice, different probabilities are used as well). do forward and backward propagations only using the selected neurons (i.e. leave weights of the other neurons unchanged) Dropout resembles bagging: Large ensemble of neural networks is trained "at once" on parts of the data. Dropout is not exactly the same as bagging: The models share parameters, with each model inheriting a different subset of parameters from the parent neural network. This parameter sharing makes it possible to represent an exponential number of models with a tractable amount of memory. In the case of bagging, each model is trained to convergence on its respective training set. This would be infeasible for large networks/training sets. 149 Weight decay and L2 regularization Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. 150 Weight decay and L2 regularization Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. In every step we decrease weights (multiplicatively) as follows: w (t+1) ji = (1 − ζ)(w (t) ji + ∆w (t) ji ) Intuition: Unimportant weights will be pushed to 0, important weights will survive the decay. 150 Weight decay and L2 regularization Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. In every step we decrease weights (multiplicatively) as follows: w (t+1) ji = (1 − ζ)(w (t) ji + ∆w (t) ji ) Intuition: Unimportant weights will be pushed to 0, important weights will survive the decay. Weight decay is equivalent to the gradient descent with a constant learning rate ε and the following error function: E (w) = E(w) + 2ζ ε (w · w) Here 2ζ ε (w · w) is the L2 regularization that penalizes large weights. 150 More optimization, regularization ... There are many more practical tips, optimization methods, regularization methods, etc. For a very nice survey see http://www.deeplearningbook.org/ ... and also all other infinitely many urls concerned with deep learning. 151 Some applications 152 ALVINN (history) 153 ALVINN Architecture: MLP, 960 − 4 − 30 (also 960 − 5 − 30) inputs correspond to pixels 154 ALVINN Architecture: MLP, 960 − 4 − 30 (also 960 − 5 − 30) inputs correspond to pixels Activity: activation functions: logistic sigmoid Steering wheel position determined by "center of mass" of neuron values. 154 ALVINN Learning: Trained during (live) drive. Front window view captured by a camera, 25 images per second. Training samples of the form (xk , dk ) where xk = image of the road dk = corresponding position of the steering wheel position of the steering wheel "blurred" by Gaussian distribution: dki = e−D2 i /10 where Di is the distance of the i-th output from the one which corresponds to the correct position of the wheel. (The authors claim that this was better than the binary output.) 155 ALVINN – Selection of training samples Naive approach: take images directly from the camera and adapt accordingly. 156 ALVINN – Selection of training samples Naive approach: take images directly from the camera and adapt accordingly. Problems: If the driver is gentle enough, the car never learns how to get out of dangerous situations. A solution may be turn off learning for a moment, then suddenly switch on, and let the net catch on, let the driver drive as if being insane (dangerous, possibly expensive). The real view out of the front window is repetitive and boring, the net would overfit on few examples. 156 ALVINN – Selection of training examples Problem with a "good" driver is solved as follows: 157 ALVINN – Selection of training examples Problem with a "good" driver is solved as follows: 15 distorted copies of each image: desired output generated for each copy 157 ALVINN – Selection of training examples Problem with a "good" driver is solved as follows: 15 distorted copies of each image: desired output generated for each copy "Boring" images solved as follows: a buffer of 200 images (including 15 copies of the original), in every step the system trains on the buffer after several updates a new image is captured, 15 copies are made and they will substitute 15 images in the buffer (5 chosen randomly, 10 with the smallest error). 157 ALVINN - learning pure backpropagation constant learning rate momentum, slowly increasing. We used a learning rate of 0.015, a momentum term of 0.9, and we ramped up the learning rate and momentum using a rate term of 0.05. This means that the learning rate and momentum increase linearly over 20 epochs until they reach their maximum value (0.015 and 0.9, respectively). We also used a weight decay term of 0.0001. Results: Trained for 5 minutes, speed 4 miles per hour. ALVINN was able to drive well on a new road it has never seen (in different weather conditions). 158 ALVINN - learning pure backpropagation constant learning rate momentum, slowly increasing. We used a learning rate of 0.015, a momentum term of 0.9, and we ramped up the learning rate and momentum using a rate term of 0.05. This means that the learning rate and momentum increase linearly over 20 epochs until they reach their maximum value (0.015 and 0.9, respectively). We also used a weight decay term of 0.0001. Results: Trained for 5 minutes, speed 4 miles per hour. ALVINN was able to drive well on a new road it has never seen (in different weather conditions). The maximum speed was limited by the hydraulic controller of the steering wheel, not the learning algorithm. 158 ALVINN - weight development round 0 round 10 round 20 round 50 h1 h2 h3 h4 h5 Here h1, . . . , h5 are hidden neurons. 159 MNIST – handwritten digits recognition Database of labelled images of handwritten digits: 60 000 training examples, 10 000 testing. Dimensions: 28 x 28, digits are centered to the "center of gravity" of pixel values and normalized to fixed size. More at http: //yann.lecun.com/exdb/mnist/ The database is used as a standard benchmark in lots of publications. 160 MNIST – handwritten digits recognition Database of labelled images of handwritten digits: 60 000 training examples, 10 000 testing. Dimensions: 28 x 28, digits are centered to the "center of gravity" of pixel values and normalized to fixed size. More at http: //yann.lecun.com/exdb/mnist/ The database is used as a standard benchmark in lots of publications. Allows comparison of various methods. 160 MNIST One of the best "old" results is the following: 6-layer NN 784-2500-2000-1500-1000-500-10 (on GPU) (Ciresan et al. 2010) Abstract: Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35 error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning. A famous application of the first convolutional network LeNet-1 in 1998. 161 MNIST – LeNet1 162 MNIST – LeNet1 Interpretation of output: the output neuron with the highest value identifies the digit. the same, but if the two largest neuron values are too close together, the input is rejected (i.e. no answer). Learning: Inputs: training on 7291 samples, tested on 2007 samples Results: error on test set without rejection: 5% error on test set with rejection: 1% (12% rejected) compare with dense MLP with 40 hidden neurons: error 1% (19.4% rejected) 163 Modern convolutional networks The rest of the lecture is based on the online book Neural Networks and Deep Learning by Michael Nielsen. http://neuralnetworksanddeeplearning.com/index.html Convolutional networks are currently the best networks for image classification. Their common ancestor is LeNet-5 (and other LeNets) from nineties. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998 164 AlexNet In 2012 this network made a breakthrough in ILVSCR competition, taking the classification error from around 28% to 16%: A convolutional network, trained on two GPUs. 165 Convolutional networks - local receptive fields Every neuron is connected with a field of k × k (in this case 5 × 5) neurons in the lower layer (this filed is receptive field). Neuron is "standard": Computes a weighted sum of its inputs, applies an activation function. 166 Convolutional networks - stride length Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron: The "size" of the slide is called stride length. The group of all such neurons is feature map. all these neurons share weights and biases! 167 Feature maps Each feature map represents a property of the input that is supposed to be spatially invariant. Typically, we consider several feature maps in a single layer. 168 Trained feature maps (20 feature maps, receptive fields 5 × 5) 169 Pooling Neurons in the pooling layer compute functions of their receptive fields: Max-pooling : maximum of inputs L2-pooling : square root of the sum of squres Average-pooling : mean · · · 170 Simple convolutional network 28 × 28 input image, 3 feature maps, each feature map has its own max-pooling (field 5 × 5, stride = 1), 10 output neurons. Each neuron in the output layer gets input from each neuron in the pooling layer. Trained using backprop, which can be easily adapted to convolutional networks. 171 Convolutional network 172 Simple convolutional network vs MNIST two convolutional-pooling layers, one 20, second 40 feature maps, two dense (MLP) layers (1000-1000), outputs (10) Activation functions of the feature maps and dense layers: ReLU max-pooling output layer: soft-max Error function: negative log-likelihood (= cross-entropy) Training: SGD, mini-batch size 10 learning rate 0.03 L2 regularization with "weight" λ = 0.1 + dropout with prob. 1/2 training for 40 epochs (i.e. every training example is considered 40 times) Expanded dataset: displacement by one pixel to an arbitrary direction. Committee voting of 5 networks. 173 MNIST Out of 10 000 images in the test set, only these 33 have been incorrectly classified: 174 More complex convolutional networks Convolutional networks have been used for classification of images from the ImageNet database (16 million color images, 20 thousand classes) 175 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) Competition in classification over a subset of images from ImageNet. Started in 2010, assisted in breakthrough in image recognition. Training set 1.2 million images, 1000 classes. Validation set: 50 000, test set: 150 000. Many images contain more than one object ⇒ model is allowed to choose five classes, the correct label must be among the five. (top-5 criterion). 176 AlexNet ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton (2012). Trained on two GPUs (NVIDIA GeForce GTX 580) Výsledky: accuracy 84.7% in top-5 (second best algorithm at the time 73.8%) 63.3% "perfect" (top-1) classification 177 ILSVRC 2014 The same set as in 2012, top-5 criterion. GoogLeNet: deep convolutional network, 22 layers Results: Accuracy 93.33% top-5 178 ILSVRC 2015 Deep convolutional network Various numbers of layers, the winner has 152 layers Skip connections implementing residual learning Error 3.57% in top-5. 179 Superhuman convolutional nets?! Andrej Karpathy: ...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while. First we thought we would put it up on [Amazon Mechanical Turk]. Then we thought we could recruit paid undergrads. Then I organized a labeling party of intense labeling effort only among the (expert labelers) in our lab. Then I developed a modified interface that used GoogLeNet predictions to prune the number of categories from 1000 to only about 100. It was still too hard - people kept missing categories and getting up to ranges of 13-15% error rates. In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself... The labeling happened at a rate of about 1 per minute, but this decreased over time... Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs... Based on the sample of images I worked on, the GoogLeNet classification error turned out to be 6.8%... My own error in the end turned out to be 5.1%, approximately 1.7% better. 180 Convolutional networks – theory 181 Convolutional network 182 Convolutional layers Every neuron is connected with a (typically small) receptive field of neurons in the lower layer. Neuron is "standard": Computes a weighted sum of its inputs, applies an activation function. 183 Convolutional layers Neurons grouped into feature maps sharing weights. 184 Convolutional layers Each feature map represents a property of the input that is supposed to be spatially invariant. Typically, we consider several feature maps in a single layer. 185 Pooling layers Neurons in the pooling layer compute simple functions of their receptive fields (the fields are typically disjoint): Max-pooling : maximum of inputs L2-pooling : square root of the sum of squres Average-pooling : mean · · · 186 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. 187 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: input layer L0 187 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: input layer L0 dense layer Lm: Each neuron of Lm connected with each neuron of Lm−1. 187 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: input layer L0 dense layer Lm: Each neuron of Lm connected with each neuron of Lm−1. convolutional & pooling layer Lm: Contains two sub-layers: convolutional layer: Neurons organized into disjoint feature maps, all neurons of a given feature map share weights (but have different inputs) pooling layer: Each (convolutional) feature map F has a corresponding pooling map P. Neurons of P have inputs only from F (typically few of them), compute a simple aggregate function (such as max), have disjoint inputs. 187 Convolutional networks – architecture Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) [ji] is a set of all connections (i.e. pairs of neurons) sharing the weight wji. 188 Convolutional networks – activity neurons of dense and convolutional layers: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable): yj = σj(ξj) 189 Convolutional networks – activity neurons of dense and convolutional layers: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable): yj = σj(ξj) Neurons of pooling layers: Apply the "pooling" function: max-pooling: yj = max i∈j← yi avg-pooling: yj = i∈j← yi |j←| A convolutional network is evaluated layer-wise (as MLP), for each j ∈ Y we have that yj(w, x) is the value of the output neuron j after evaluating the network with weights w and input x. 189 Convolutional networks – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function – mean squared error (for example): E(w) = 1 p p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 190 Convolutional networks – SGD The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · 1 |T| k∈T Ek (w(t) ) Here T is a minibatch (of a fixed size), 0 < ε(t) ≤ 1 is a learning rate in step t + 1 Ek (w(t)) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. Epoch consists of one round through all data. 191 Backprop Recall that Ek (w(t)) is a vector of all partial derivatives of the form ∂Ek ∂wji . How to compute ∂Ek ∂wji ? 192 Backprop Recall that Ek (w(t)) is a vector of all partial derivatives of the form ∂Ek ∂wji . How to compute ∂Ek ∂wji ? First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj: Recall that for every wji where j is in a dense layer, i.e. does not share weights: ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi 192 Backprop Recall that Ek (w(t)) is a vector of all partial derivatives of the form ∂Ek ∂wji . How to compute ∂Ek ∂wji ? First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj: Recall that for every wji where j is in a dense layer, i.e. does not share weights: ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi Now for every wji where j is in a convolutional layer: ∂Ek ∂wji = r ∈[ji] ∂Ek ∂yr · σr (ξr ) · y Neurons of pooling layers do not have weights. 192 Backprop Now compute derivatives w.r.t. yj: for every j ∈ Y: ∂Ek ∂yj = yj − dkj This holds for the squared error, for other error functions the derivative w.r.t. outputs will be different. 193 Backprop Now compute derivatives w.r.t. yj: for every j ∈ Y: ∂Ek ∂yj = yj − dkj This holds for the squared error, for other error functions the derivative w.r.t. outputs will be different. for every j ∈ Z Y such that j→ is either a dense layer, or a convolutional layer: ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj 193 Backprop Now compute derivatives w.r.t. yj: for every j ∈ Y: ∂Ek ∂yj = yj − dkj This holds for the squared error, for other error functions the derivative w.r.t. outputs will be different. for every j ∈ Z Y such that j→ is either a dense layer, or a convolutional layer: ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for every j ∈ Z Y such that j→ is max-pooling: Then j→ = {i} for a single "max" neuron and we have ∂Ek ∂yj =    ∂Ek ∂yi if j = arg maxr∈i← yr 0 otherwise I.e. gradient can be propagated from the output layer downwards as in MLP. 193 Convolutional networks – summary Conv. nets. are nowadays the most used networks in image processing (and also in other areas where input has some local, "spatially" invariant properties) Typically trained using backpropagation. Due to the weight sharing allow (very) deep architectures. Typically extended with more adjustments and tricks in their topologies. 194 Recurrent Neural Networks - LSTM 195 RNN Input: x = (x1, . . . , xM) Hidden: h = (h1, . . . , hH) Output: y = (y1, . . . , yN) 196 RNN example Activation function: σ(ξ) =    1 ξ ≥ 0 0 ξ < 0 y 1 0 1 h (0, 0) (1, 1) (1, 0) (0, 1) · · · x (0, 0) (1, 0) (1, 1) 197 RNN example Activation function: σ(ξ) =    1 ξ ≥ 0 0 ξ < 0 y y1 = 1 y2 = 0 y3 = 1 h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · · x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1) 197 RNN example y y1 = 1 y2 = 0 y3 = 1 h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · · x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1) 197 RNN – formally M inputs: x = (x1, . . . , xM) H hidden neurons: h = (h1, . . . , hH) N output neurons: y = (y1, . . . , yN) Weights: Ukk from input xk to hidden hk Wkk from hidden hk to hidden hk Vkk from hidden hk to output yk 198 RNN – formally Input sequence: x = x1, . . . , xT xt = (xt1, . . . , xtM) 199 RNN – formally Input sequence: x = x1, . . . , xT xt = (xt1, . . . , xtM) Hidden sequence: h = h0, h1, . . . , hT ht = (ht1, . . . , htH) We have h0 = (0, . . . , 0) and htk = σ   M k =1 Ukk xtk + H k =1 Wkk h(t−1)k   199 RNN – formally Input sequence: x = x1, . . . , xT xt = (xt1, . . . , xtM) Hidden sequence: h = h0, h1, . . . , hT ht = (ht1, . . . , htH) We have h0 = (0, . . . , 0) and htk = σ   M k =1 Ukk xtk + H k =1 Wkk h(t−1)k   Output sequence: y = y1, . . . , yT yt = (yt1, . . . , ytN) where ytk = σ H k =1 Vkk htk . 199 RNN – in matrix form Input sequence: x = x1, . . . , xT 200 RNN – in matrix form Input sequence: x = x1, . . . , xT Hidden sequence: h = h0, h1, . . . , hT where h0 = (0, . . . , 0) and ht = σ(Uxt + Wht−1) 200 RNN – in matrix form Input sequence: x = x1, . . . , xT Hidden sequence: h = h0, h1, . . . , hT where h0 = (0, . . . , 0) and ht = σ(Uxt + Wht−1) Output sequence: y = y1, . . . , yT where yt = σ(Vht ) 200 RNN – Comments ht is the memory of the network, captures what happened in all previous steps (with decaying quality). RNN shares weights U, V, W along the sequence. Note the similarity to convolutional networks where the weights were shared spatially over images, here they are shared temporally over sequences. RNN can deal with sequences of variable length. Compare with MLP which accepts only fixed-dimension vectors on input. 201 RNN – training Training set T = (x1, d1), . . . , (xp, yp) here each x = x 1, . . . , x T is an input sequence, each d = d 1, . . . , d T is an expected output sequence. Here each x t = (x t1, . . . , x tM) is an input vector and each d t = (d t1, . . . , d tN) is an expected output vector. 202 Error function In what follows I will consider a training set with a single element (x, d). I.e. drop the index and have x = x1, . . . , xT where xt = (xt1, . . . , xtM) d = d1, . . . , dT where dt = (dt1, . . . , dtN) The squared error of (x, d) is defined by E(x,d) = T t=1 N k=1 1 2 (ytk − dtk )2 Recall that we have a sequence of network outputs y = y1, . . . , yT and thus ytk is the k-th component of yt 203 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: 204 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: Initialize all weights randomly close to 0. 204 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: Initialize all weights randomly close to 0. In the step + 1 (here = 0, 1, 2, . . .) compute "new" weights U( +1), V( +1), W( +1) from the "old" weights U( ), V( ), W( ) as follows: U ( +1) kk = U ( ) kk − ε( ) · δE(x,d) δUkk V ( +1) kk = V ( ) kk − ε( ) · δE(x,d) δVkk W ( +1) kk = W ( ) kk − ε( ) · δE(x,d) δWkk 204 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: Initialize all weights randomly close to 0. In the step + 1 (here = 0, 1, 2, . . .) compute "new" weights U( +1), V( +1), W( +1) from the "old" weights U( ), V( ), W( ) as follows: U ( +1) kk = U ( ) kk − ε( ) · δE(x,d) δUkk V ( +1) kk = V ( ) kk − ε( ) · δE(x,d) δVkk W ( +1) kk = W ( ) kk − ε( ) · δE(x,d) δWkk The above is THE learning algorithm that modifies weights! 204 Backpropagation Computes the derivatives of E, no weights are modified! 205 Backpropagation Computes the derivatives of E, no weights are modified! δE(x,d) δUkk = T t=1 δE(x,d) δhtk · σ · xtk k = 1, . . . , M δE(x,d) δVkk = T t=1 δE(x,d) δytk · σ · htk k = 1, . . . , H δE(x,d) δWkk = T t=1 δE(x,d) δhtk · σ · h(t−1)k k = 1, . . . , H 205 Backpropagation Computes the derivatives of E, no weights are modified! δE(x,d) δUkk = T t=1 δE(x,d) δhtk · σ · xtk k = 1, . . . , M δE(x,d) δVkk = T t=1 δE(x,d) δytk · σ · htk k = 1, . . . , H δE(x,d) δWkk = T t=1 δE(x,d) δhtk · σ · h(t−1)k k = 1, . . . , H Backpropagation: δE(x,d) δytk = ytk − dtk (assuming squared error) δE(x,d) δhtk = N k =1 δE(x,d) δytk · σ · Vk k + H k =1 δE(x,d) δh(t+1)k · σ · Wk k 205 Long-term dependencies δE(x,d) δhtk = N k =1 δE(x,d) δytk · σ · Vk k + H k =1 δE(x,d) δh(t+1)k · σ · Wk k Unless H k =1 σ · Wk k ≈ 1, the gradient either vanishes, or explodes. For a large T (long-term dependency), the gradient "deeper" in the past tends to be too small (large). A solution: LSTM 206 LSTM ht = ot ◦ σh(Ct ) output Ct = ft ◦ Ct−1 + it ◦ ˜Ct memory ˜Ct = σh(WC · ht−1 + UC · xt ) new memory contents ot = σg(Wo · ht−1 + Uo · xt ) output gate ft = σg(Wf · ht−1 + Uf · xt ) forget gate it = σg(Wi · ht−1 + Ui · xt ) input gate ◦ is the component-wise product of vectors · is the matrix-vector product σh hyperbolic tangents (applied component-wise) σg logistic sigmoid (aplied component-wise) 207 RNN vs LSTM 208 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) 209 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) 209 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) 209 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) 209 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) 209 LSTM – summary LSTM (almost) solves the vanishing gradient problem w.r.t. the "internal" state of the network. Learns to control its own memory (via forget gate). Revolution in machine translation and text processing. 210 Convolutions & LSTM in action – cancer research 211 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). 212 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). Input: Digitized haematoxylin-eosin-stained tumour tissue microarray samples. Output: Estimated survival probability. 212 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). Input: Digitized haematoxylin-eosin-stained tumour tissue microarray samples. Output: Estimated survival probability. Data: Training set: 420 patients of Helsinki University Centre Hospital, diagnosed with colorectal cancer, underwent primary surgery. Test set: 182 patients Follow-up time and outcome known for each patient. 212 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). Input: Digitized haematoxylin-eosin-stained tumour tissue microarray samples. Output: Estimated survival probability. Data: Training set: 420 patients of Helsinki University Centre Hospital, diagnosed with colorectal cancer, underwent primary surgery. Test set: 182 patients Follow-up time and outcome known for each patient. Human expert comparison: Histological grade assessed at the time of diagnosis. Visual Risk Score: Three pathologists classified to high/low-risk categories (by majority vote). Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientific Reports, Nature, 2018. 212 Colorectal cancer outcome prediction 213 Colorectal cancer outcome prediction 213 Data & workflow Input images: 3500 px × 3500 px Cut into tiles: 224 px × 224 px ⇒ 256 tiles Each tile pased to a convolutional network (CNN) Ouptut of CNN: 4096 dimensional vector. A "string" of 256 vectors (each of the dimension 4096) pased into a LSTM. LSTM outputs the probability of 5-year survival. 214 Data & workflow Input images: 3500 px × 3500 px Cut into tiles: 224 px × 224 px ⇒ 256 tiles Each tile pased to a convolutional network (CNN) Ouptut of CNN: 4096 dimensional vector. A "string" of 256 vectors (each of the dimension 4096) pased into a LSTM. LSTM outputs the probability of 5-year survival. The authors also tried to substitute the LSTM on top of CNN with logistic regression naive Bayes support vector machines 214 CNN architecture – VGG-16 (Pre)trained on ImageNet (cats, dogs, chairs, etc.) 215 LSTM architecture LSTM has three layers (264, 128, 64 cells) 216 LSTM – training L1 regularization (0.005) at each hidden layer of LSTM i.e. 0.005 times the sum of absolute values of weights added to the error L2 regularization (0.005) at each hidden layer of LSTM i.e. 0.005 times the sum of squared values of weights added to the error Dropout 5% at the input and the last hidden layers of LSTM Datasets: Training: 220 samples, Validation 60 samples, Test 140 samples. 217 Colorectal cancer outcome prediction Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientific Reports, Nature, 2018. 218 Feed-forward networks summary Architectures: Multi-layer perceptron (MLP): dense connections between layers Convolutional networks (CNN): local receptors, feature maps pooling Recurrent networks (RNN, LSTM): self-loops but still feed-forward through time Training: gradient descent algorithm + heuristics 219 Hopfield Network 220 Hopfield network Auto-associative network: Given an input, the network outputs a training example (encoded in its weights) "similar" to the given input. 221 Hopfield network Architecture: complete topology, i.e. output of each neuron is input to all neurons all neurons are both input and output denote by ξ1, . . . , ξn inner potentials and by y1, . . . , yn outputs (states) of individual neurons denote by wji the weight of connection from a neuron i ∈ {1, . . . , n} to a neuron j ∈ {1, . . . , n} We assume wji = wij, i.e. symmetric connections. assume wjj = 0 for every j = 1, . . . , n For now: no neuron has a bias 222 Hopfield network Learning: Training set T = {xk | xk = (xk1, . . . , xkn) ∈ {−1, 1}n , k = 1, . . . , p} The goal is to "store" the training examples of T so that the network is able to associate similar examples. Hebb’s learning rule: If the inputs to a system cause the same pattern of activity to occur repeatedly, the set of active elements constituting that pattern will become increasingly strongly interassociated. That is, each element will tend to turn on every other element and (with negative weights) to turn off the elements that do not form part of the pattern. To put it another way, the pattern as a whole will become "auto-associated". Mathematically speaking: wji = p k=1 xkjxki 1 ≤ j i ≤ n Intuition: "Neurons that fire together, wire together". 223 Hopfield network Learning: Training set T = {xk | xk = (xk1, . . . , xkn) ∈ {−1, 1}n , k = 1, . . . , p} Hebb’s rule: wji = p k=1 xkjxki 1 ≤ j i ≤ n Note that wji = wij, i.e. the weight matrix is symmetric. Learning can be seen as poll about equality of inputs: If xkj = xki, then the training example votes for "i equals j" by adding one to wji. If xkj xki, then the training example votes for "i does not equal j" by subtracting one from wji. 224 Hopfield network Activity: Initially, neurons set to the network input x = (x1, . . . , xn), thus y (0) j = xj for every j = 1, . . . , n. Cyclically update states of neurons, i.e. in step t + 1 compute the value of a neuron j such that j = (t mod p) + 1, as follows: 225 Hopfield network Activity: Initially, neurons set to the network input x = (x1, . . . , xn), thus y (0) j = xj for every j = 1, . . . , n. Cyclically update states of neurons, i.e. in step t + 1 compute the value of a neuron j such that j = (t mod p) + 1, as follows: Compute the inner potential: ξ (t) j = n i=1 wjiy (t) i then y (t+1) j =    1 ξ (t) j > 0 y (t) j ξ (t) j = 0 −1 ξ (t) j < 0 225 Hopfield network – activity The computation stops in a step t∗ if the network is for the first time in a stable state, i.e. y (t∗+n) j = y (t∗) j (j = 1, . . . , n) 226 Hopfield network – activity The computation stops in a step t∗ if the network is for the first time in a stable state, i.e. y (t∗+n) j = y (t∗) j (j = 1, . . . , n) Theorem Assuming symmetric weights, computation of a Hopfiled network always stops for every input. 226 Hopfield network – activity The computation stops in a step t∗ if the network is for the first time in a stable state, i.e. y (t∗+n) j = y (t∗) j (j = 1, . . . , n) Theorem Assuming symmetric weights, computation of a Hopfiled network always stops for every input. This implies that a given Hopfiled network computes a function from {−1, 1}n to {−1, 1}n (determined by its weights). 226 Hopfield network – activity The computation stops in a step t∗ if the network is for the first time in a stable state, i.e. y (t∗+n) j = y (t∗) j (j = 1, . . . , n) Theorem Assuming symmetric weights, computation of a Hopfiled network always stops for every input. This implies that a given Hopfiled network computes a function from {−1, 1}n to {−1, 1}n (determined by its weights). Denote by y(W, x) = y (t∗) 1 , . . . , y (t∗) n the value of the network for a given input x and a weight matrix W. Denote by yj(W, x) = y (t∗) j the component of the value of the network corresponding to the neuron j. If W is clear from the context, we write only y(x) a yj(x). 226 Ising model – an analogy Simple models of magnetic materials resemble Hopfield network. 227 Ising model – an analogy Simple models of magnetic materials resemble Hopfield network. atomic magnets organized into square-lattice 227 Ising model – an analogy Simple models of magnetic materials resemble Hopfield network. atomic magnets organized into square-lattice each magnet may have only one of two possible orientations (in the Hopfield network +1 a −1) 227 Ising model – an analogy Simple models of magnetic materials resemble Hopfield network. atomic magnets organized into square-lattice each magnet may have only one of two possible orientations (in the Hopfield network +1 a −1) orientation of each magnet is influenced by an external magnetic field (input of the network) as well as orientation of the other magnets 227 Ising model – an analogy Simple models of magnetic materials resemble Hopfield network. atomic magnets organized into square-lattice each magnet may have only one of two possible orientations (in the Hopfield network +1 a −1) orientation of each magnet is influenced by an external magnetic field (input of the network) as well as orientation of the other magnets weights in the Hopfiled net model determine interaction among magnets 227 Energy function Energy function E assigns to every state y ∈ {−1, 1}n a (potential) energy: E(y) = − 1 2 n j=1 n i=1 wjiyjyi 228 Energy function Energy function E assigns to every state y ∈ {−1, 1}n a (potential) energy: E(y) = − 1 2 n j=1 n i=1 wjiyjyi states with low energy are stable (few neurons "want to" change their states), states with high energy are not stable 228 Energy function Energy function E assigns to every state y ∈ {−1, 1}n a (potential) energy: E(y) = − 1 2 n j=1 n i=1 wjiyjyi states with low energy are stable (few neurons "want to" change their states), states with high energy are not stable i.e. large (positive) wjiyjyi is stable and small (negative) wjiyjyi is not stable The energy does not increase during computation: E(y(t)) ≥ E(y(t+1)), stable states y(t∗) correspond to local minima of E. 228 Energy landscape 229 Hopfield network – convergence Observe that the energy does not increase during computation: E(y(t)) ≥ E(y(t+1)) 230 Hopfield network – convergence Observe that the energy does not increase during computation: E(y(t)) ≥ E(y(t+1)) if the state is updated in a step t + 1, then E(y(t)) > E(y(t+1)) 230 Hopfield network – convergence Observe that the energy does not increase during computation: E(y(t)) ≥ E(y(t+1)) if the state is updated in a step t + 1, then E(y(t)) > E(y(t+1)) there are only finitely many states, and thus, eventually, a local minimum of E is reached. This proves that computation of a Hopfield network always stops. 230 Hopfield network – example figures 12 × 10 (120 neurons, −1 is white and 1 is black) learned 8 figures input generated with 25% noise image shows the activity of the Hopfield network 231 Hopfield network – example 232 Hopfield network – example 233 Restricted Boltzmann Machines 234 Restricted Boltzmann machine (RBM) Architecture: Neural network with cycles and symmetric connections, neurons divided into two disjoint sets: V - visible H - hidden Connections: V × S (complete bipartite graph) N is a set of all neurons. Denote by ξj the inner potential and by yj the output (i.e. state) of neuron j. State of the machine: y ∈ {0, 1}|N|. Denote by wji ∈ R the weight of the connection from i to j (and thus also from j to i). Consider bias: wj0 is the weight between j and a neuron 0 whose value y0 is always 1. 235 RBM – activity Activity: States of neurons initially set to values of {0, 1}, i.e. y (0) j ∈ {0, 1} for j ∈ N. 236 RBM – activity Activity: States of neurons initially set to values of {0, 1}, i.e. y (0) j ∈ {0, 1} for j ∈ N. In the step t + 1 do the following: t even: randomly choose new values of all hidden neurons, for every j ∈ H P y (t+1) j = 1 = 1  1 + exp  −wj0 − i∈V wjiy (t) i     t odd: randomly choose new values of all visible neurons, for every j ∈ V P y (t+1) j = 1 = 1  1 + exp  −wj0 − i∈H wjiy (t) i     236 Equilibrium Theorem For every γ∗ ∈ {0, 1}|N| we have that lim t→∞ P y(t) = γ∗ = 1 Z e−E(γ∗) where Z = γ∈{0,1}|N| e−E(γ) and E(γ) = − i∈V, j∈H wjiy γ j y γ i − i∈V wi0y γ i − j∈H wj0y γ j Here y γ i is the value of the neuron i in the state γ. 237 RBM – Probability distribution RBM defines the following probability distribution on {0, 1}|N| (recall that N is the set of all neurons): pN(γ∗ ) := lim t→∞ P y(t) = γ∗ for every γ∗ ∈ {0, 1}|N| We obtain a distribution on states of visible neurons by marginalization: pV (α) = β∈{0,1}|H| pN(αβ) for every α ∈ {0, 1}|V| Here αβ ∈ {0, 1}|N| is a vector of values of all states obtained by concatenating values α of visible neurons and values β of hidden neurons. 238 RBM – learning Learning: Let pd be a probability distribution on states of visible neurons, i.e. on {0, 1}|V|. Our goal is to find a configuration of the network W such that pV ≈ pd. A suitable measure of difference between probability distributions pV and pd is relative entropy weighted by probabilities of states (Kullback-Leibler divergence): E(W) = α∈{0,1}|V| pd(α) ln pd(α) pV (α) E is minimized using the gradient descent algorithm. 239 RBM – learning Minimize E(w) using gradient descent, i.e. compute a sequence of weight matrices: W(0), W(1), . . . 240 RBM – learning Minimize E(w) using gradient descent, i.e. compute a sequence of weight matrices: W(0), W(1), . . . initialise W(0) randomly, close to 0 240 RBM – learning Minimize E(w) using gradient descent, i.e. compute a sequence of weight matrices: W(0), W(1), . . . initialise W(0) randomly, close to 0 in step t + 1 compute W(t+1) as follows: W (t+1) ji = W (t) ji + ∆W (t) ji where ∆W (t) ji = −ε(t) · ∂E ∂wji (W(t) ) is the update of the weight wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. It remains to compute ∂E ∂wji (W) (skipped). 240 Deep MLP Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 241 Why deep networks ... if one hidden layer is able to represent an arbitrary (reasonable) function? 242 Why deep networks ... if one hidden layer is able to represent an arbitrary (reasonable) function? One hidden layer may be very inefficient, i.e. huge amount of neurons may be needed. One can show that the number of hidden neurons may be exponential w.r.t. the dimension of the input, networks with multiple layers may be exponentially more succinct as opposed to single hidden layer. 242 Why deep networks ... if one hidden layer is able to represent an arbitrary (reasonable) function? One hidden layer may be very inefficient, i.e. huge amount of neurons may be needed. One can show that the number of hidden neurons may be exponential w.r.t. the dimension of the input, networks with multiple layers may be exponentially more succinct as opposed to single hidden layer. ... ok, so let’s try to teach deep networks ... using backpropagation? Problems: Gradient may vanish/explode when backpropagated through many layers. Deep networks (with many neurons) overfit very easily. 242 Deep MLP – pretraining Assume k layers. Denote Wi the weight matrix between layers i − 1 and i 243 Deep MLP – pretraining Assume k layers. Denote Wi the weight matrix between layers i − 1 and i Fi function computed by the "lower" part of the MLP consisting of layers 0, 1, . . . , i F1 is a function which consists of the input and the first hidden layer (which is now considered as the output layer). 243 Deep MLP – pretraining Assume k layers. Denote Wi the weight matrix between layers i − 1 and i Fi function computed by the "lower" part of the MLP consisting of layers 0, 1, . . . , i F1 is a function which consists of the input and the first hidden layer (which is now considered as the output layer). Crucial observation: For every i, the layers i − 1 and i together with the matrix Wi can be considered as a RBM. Denote such a RBM as Bi. 243 Deep MLP – pretraining For now, consider only input vectors x1, . . . , xp where xk ∈ {0, 1}n for all k = 1, . . . , p. unsupervised pretraining: Gradually, for every i = 1, . . . , k, train RBM Bi on randomly selected inputs from the training set: Fi−1(x1), . . . , Fi−1(xp) using the training algorithm for RBM (here F0(xi) = xi). (Thus Bi learns from training samples transformed by the already pretrained layers 0, . . . , i − 1) We obtain a deep belief network D representing a distribution given by x1, . . . , xp. (Recall that in such a distribution the probability of a given x is equal to the relative frequency of x in x1, . . . , xp.) 244 Deep belief network The network D can be used to sample from the distribution as follows: Simulate the topmost RBM for some steps (ideally to thermal equilibrium), this gives values of neurons in the two topmost layers. 245 Deep belief network The network D can be used to sample from the distribution as follows: Simulate the topmost RBM for some steps (ideally to thermal equilibrium), this gives values of neurons in the two topmost layers. Propagate the values downwards by always simulating one step of the corresponding RBM. That is, you have already computed values of neurons in layers k and k − 1. To compute values of neurons in the layer k − 2, simulate one step of RBM Bk−1, that is sample values of neurons in the layer k − 2 using RBM dynamics of Bk−1 with values of the layer k − 1 fixed. Similarly, compute values of k − 3 by simulating Bk−2 ... etc. ... finally obtain values of input neurons. 245 Deep belief network The network D can be used to sample from the distribution as follows: Simulate the topmost RBM for some steps (ideally to thermal equilibrium), this gives values of neurons in the two topmost layers. Propagate the values downwards by always simulating one step of the corresponding RBM. That is, you have already computed values of neurons in layers k and k − 1. To compute values of neurons in the layer k − 2, simulate one step of RBM Bk−1, that is sample values of neurons in the layer k − 2 using RBM dynamics of Bk−1 with values of the layer k − 1 fixed. Similarly, compute values of k − 3 by simulating Bk−2 ... etc. ... finally obtain values of input neurons. Probability with which a concrete input x is sampled by the above procedure is the probability of x in the distribution represented by D. 245 Deep MLP – training with pretraining Now consider supervised learning with a training set: T = xk , dk k = 1, . . . , p . Still assume that xk ∈ {0, 1}n . unsupervised pretraining: Gradually, for every i = 1, . . . , k, train RBM Bi on randomly selected inputs from the training set: Fi−1(x1), . . . , Fi−1(xp) using the training algorithm for RBM (here F0(xi) = xi). (Thus Bi learns from training samples transformed by the already pretrained layers 0, . . . , i − 1) Obtain D. Add one (or more) layer to the top of D and consider the result to be MLP. (i.e. forget the RBM dynamics and start considering the network as MLP with sigmoidal activations). supervised fine-tuning: Train in supervised mode (on the training set T ) using e.g. gradient descent + backprop. 246 Application – dimensionality reduction Dimensionality reduction: A mapping R from Rn to Rm where m < n, for every example x we have that x can be "reconstructed" from R(x). 247 Application – dimensionality reduction Dimensionality reduction: A mapping R from Rn to Rm where m < n, for every example x we have that x can be "reconstructed" from R(x). Standard method: PCA (there are many linear as well as non-linear variants) 247 Reconstruction – PCA 1024 pixels compressed to 100 dimensions (i.e. 100 numbers). 248 Deep MLP – dimensionality reduction Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554. Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. This basically started all the deep learning craze ... 249 Deep MLP – dimensionality reduction 250 Images – pretraining Data: 165 600 black-white images, 25 × 25, mean intensity 0, variance 1. Images obtained from Olivetti Faces database of images 64 × 64 using standard transformations. 103 500 training set, 20 700 validation, 41 400 test Network: 2000-100-500-30, training using layered RBM. Notes: Training of the lowest layer (2000 neurons): Values of pixels distorted using Gaussian noise, low learning rate: 0.001, 200 iterations Training all hidden layers: Values of neurons are binary. Training of output layer: Values computed directly using the sigmoid activation functions + noise. That is, values of output neurons are from the interval [0, 1]. 251 Images – fine-tuning Stochastic activation substituted with deterministic. That is the value of hidden neurons is not chosen randomly but directly computed by application of sigmoid on the inner potential (this gives the mean activation). Backpropagation. Error function: cross-entropy − i pi ln ˆpi − i (1 − pi) ln(1 − ˆpi) here pi is the intensity of i-th pixel of the input and ˆpi of the reconstruction. 252 Results 1. Original 2. Reconstruction using deep networks (reduction to 30-dim) 3. Reconstruction using PCA (reduction to 30-dim) 253 Kohonen’s Map 254 Vector quantization Assume we are given a probability density function p(x) on input vectors x ∈ Rn. I.e. assume that the inputs are randomly generated according to p(x). Our goal is to approximate p(x) using finitely many centres wi ∈ Rn where i = 1, . . . , h. Roughly speaking: We want more centres in areas of higher density and less in areas of low density. 255 Vector quantization Assume we are given a probability density function p(x) on input vectors x ∈ Rn. I.e. assume that the inputs are randomly generated according to p(x). Our goal is to approximate p(x) using finitely many centres wi ∈ Rn where i = 1, . . . , h. Roughly speaking: We want more centres in areas of higher density and less in areas of low density. Formally: To every input x we assign its closest centre wc(x) : c(x) = arg min i=1,...,h x − wi and then minimize the error E = x − wc(x) 2 p(x)dx Caution! c(x) depends on x. 255 Vector quantization In practice, p(x) is obtained by sampling uniformly from a given training (multi)set: T = {xj ∈ Rn | j = 1, . . . , } 256 Vector quantization In practice, p(x) is obtained by sampling uniformly from a given training (multi)set: T = {xj ∈ Rn | j = 1, . . . , } The error then corresponds to E = 1 j=1 xj − wc(xj) 2 (keep in mind that c(xj) = arg mini=1,...,h xj − wi .) 256 Vector quantization In practice, p(x) is obtained by sampling uniformly from a given training (multi)set: T = {xj ∈ Rn | j = 1, . . . , } The error then corresponds to E = 1 j=1 xj − wc(xj) 2 (keep in mind that c(xj) = arg mini=1,...,h xj − wi .) If T has been randomly selected according to p(x) and is large eough, then 1 j=1 xj − wc(xj ) 2 ≈ x − wc(x) 2 p(x)dx 256 Example – image compression Every pixel has 256 shades of grey, each pair of neighbouring pixels is a two-dimensional vector from {0, . . . , 255} × {0, . . . , 255}, our compression finds a small set of centres that will encode shades of grey of pairs of pixels, image is then encoded by simple substitution of pairs of pixels with their centres. 257 Example – image compression pair distribution naive quantization smart quantization 258 k-means clustering algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. 259 k-means clustering algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. In the step t computes w (t) 1 , . . . , w (t) h as follows: 259 k-means clustering algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. In the step t computes w (t) 1 , . . . , w (t) h as follows: for every k = 1, . . . , h compute a set Tk of all vectors of T to which w (t−1) k is the closest centre: Tk = xj ∈ T | k = arg min i=1,...,h xj − w (t−1) i 259 k-means clustering algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. In the step t computes w (t) 1 , . . . , w (t) h as follows: for every k = 1, . . . , h compute a set Tk of all vectors of T to which w (t−1) k is the closest centre: Tk = xj ∈ T | k = arg min i=1,...,h xj − w (t−1) i compute w (t) k to be the centroid of Tk : w (t) k = 1 |Tk | x∈Tk x We may stop the computation when, e.g. the error E is sufficiently small. 259 Kohonen’s learning The k-means algorithm is not online. 260 Kohonen’s learning The k-means algorithm is not online. The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): 260 Kohonen’s learning The k-means algorithm is not online. The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): In step t, consider the input xt and compute w (t) k as follows: 260 Kohonen’s learning The k-means algorithm is not online. The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): In step t, consider the input xt and compute w (t) k as follows: If w (t−1) k is the closest centre to xt , i.e. k = arg mini xt − w (t−1) i then w (t) k = w (t−1) k + θ · (xt − w (t−1) k ) else w (t) k = w (t−1) k 260 Kohonen’s learning The k-means algorithm is not online. The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): In step t, consider the input xt and compute w (t) k as follows: If w (t−1) k is the closest centre to xt , i.e. k = arg mini xt − w (t−1) i then w (t) k = w (t−1) k + θ · (xt − w (t−1) k ) else w (t) k = w (t−1) k 0 < θ ≤ 1 determines how much to move the centre towards the input. Let us formulate this algorithm in the language of neural networks. 260 Kohonen’s learning – neural network Architecture: Single layer x1 xi xn · · · · · · y1 yk yh · · · · · · wk1 wki wkn 261 Kohonen’s learning – neural network Architecture: Single layer x1 xi xn · · · · · · y1 yk yh · · · · · · wk1 wki wkn Activity: For an input x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 otherwise 261 Kohonen’s learning In step t, consider the input xt and compute w (t) k as follows: 262 Kohonen’s learning In step t, consider the input xt and compute w (t) k as follows: If w (t−1) k is the closest neuron to xt , i.e. k = arg mini xt − w (t−1) i then w (t) k = w (t−1) k + θ · (xt − w (t−1) k ) else w (t) k = w (t−1) k 0 < θ ≤ 1 determines how much to move the neuron towards the input. 262 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. 263 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. In case of two (or more) separated clusters, the density may not correspond to p(x) at all: Ex. Two separated areas with the same density. Assume that the centres are initially in one of the areas. 263 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. In case of two (or more) separated clusters, the density may not correspond to p(x) at all: Ex. Two separated areas with the same density. Assume that the centres are initially in one of the areas. The second then "drags" only one of the centres (which always wins the competition). 263 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. In case of two (or more) separated clusters, the density may not correspond to p(x) at all: Ex. Two separated areas with the same density. Assume that the centres are initially in one of the areas. The second then "drags" only one of the centres (which always wins the competition). Result: One of the areas will be covered by a single centre even though it contains half of the mass of the input examples. Solution: We tie centres together so that they have to move together. 263 Kohonen’s map Architecture: Single layer x1 xi xn · · · · · · y1 yk yh · · · · · · wk1 wki wkn Topological structure: neurons connected by edges so that they are nodes in an undirected graph. In most cases, this structure is either a one dimensional sequence or a two dimensional grid. 264 Kohonen’s map – illustration 265 Kohonen’s map – bio motivation Source: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 266 Kohonen’s map Activity: Given an input vector x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 jinak 267 Kohonen’s map Activity: Given an input vector x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 jinak Learning: We use the topological structure. Denote by d(c, k) the length of the shortest path from neuron c to neuron k in the topological structure. For every neuron c and a given s ∈ N0 define topological neighbourhood of the neuron c of size s : Ns(c) = {k | d(c, k) ≤ s} 267 Kohonen’s map Activity: Given an input vector x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 jinak Learning: We use the topological structure. Denote by d(c, k) the length of the shortest path from neuron c to neuron k in the topological structure. For every neuron c and a given s ∈ N0 define topological neighbourhood of the neuron c of size s : Ns(c) = {k | d(c, k) ≤ s} In step t, given training example xt adapt wk as follows: w (t) k =    w (t−1) k + θ · xt − w (t−1) k k ∈ Ns(c(xt )) w (t−1) k otherwise where c(xt ) = arg mini=1,...,h xt − w (t−1) i and θ ∈ R and s ∈ N0 are parameters that may change during training. 267 Kohonen’s map – learning More general version: w (t) k = w (t−1) k + Θ(c(xt ), k) · xt − w (t−1) k where c(xt ) = arg mini=1,...,h xt − w (t−1) i . The previous case then corresponds to Θ(c(xt ), k) =    θ k ∈ Ns(c(xt )) 0 jinak A smoother version: Θ(c(xt ), k) = θ0 · exp −d(c(xt ), k)2 σ2 where θ0 ∈ R is a learning rate and σ ∈ R is the width (both parameters may change during training). 268 Example 1 Inputs uniformly distributed in a rectangle. Image source: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 269 Example 2 Inputs uniformly distributed in a triangle. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 270 Example 3 Inputs uniformly distributed in a cuboid. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 271 Example 4 Inputs uniformly distributed in a cactus. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 272 Example – defect Topological defect – twisted network. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 273 Kohonen’s map – theory Convergence to "ordered" state has been proved only for one dimensional maps and special cases of the distribution p(x) (uniform), fixed neighbourhoods of size 1, and a fixed learning rate. There are simple counterexamples disproving convergence in case these assumptions are not satisfied. 274 Kohonen’s map – theory Convergence to "ordered" state has been proved only for one dimensional maps and special cases of the distribution p(x) (uniform), fixed neighbourhoods of size 1, and a fixed learning rate. There are simple counterexamples disproving convergence in case these assumptions are not satisfied. In more than one dimension there are no guarantees at all, convergence depends on several factors: initial distribution of neurons (centres) size of the neighbourhood learning rate What dimension to choose? Typically one or two dimensional map is used (as a coarse version of dimensionality reduction). 274 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. 275 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. 275 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where 275 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where xt ∈ R2, here the first component is the weight and the second the diameter. 275 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where xt ∈ R2, here the first component is the weight and the second the diameter. dt is either A or O depending on whether the given object is an apple or an orange. 275 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where xt ∈ R2, here the first component is the weight and the second the diameter. dt is either A or O depending on whether the given object is an apple or an orange. We allow apples and oranges with the same features. The goal is to sort out the fruits based on their weight and diameter. 275 Classification using Kohonen’s map We use Kohonen’s map as follows: 1. Train the map on feature vectors xt where t = 1, . . . , (ignore the classes for now). 276 Classification using Kohonen’s map We use Kohonen’s map as follows: 1. Train the map on feature vectors xt where t = 1, . . . , (ignore the classes for now). 2. Label neurons with classes. The class vc of a given neuron c is determined as follows: For every neuron c and every class Ci count the number #(c, Ci) of training examples xt with class Ci for which the neuron c returns 1 (i.e. is the closest to them). To c, assign the class vc satisfying vc = argmaxCi #(c, Ci) 276 Classification using Kohonen’s map We use Kohonen’s map as follows: 1. Train the map on feature vectors xt where t = 1, . . . , (ignore the classes for now). 2. Label neurons with classes. The class vc of a given neuron c is determined as follows: For every neuron c and every class Ci count the number #(c, Ci) of training examples xt with class Ci for which the neuron c returns 1 (i.e. is the closest to them). To c, assign the class vc satisfying vc = argmaxCi #(c, Ci) 3. Fine tune the network using LVQ (next slide) The trained network is used as follows: Given a feature vector x, evaluate the network with x as the input. A single neuron c has the value 1, return vc as the class of x. 276 LVQ Iterate over training examples. For (xt , dt ) find the closes neuron c c = arg min i=1,...,h xt − wi Adjust weights of c as follows: w (t) c =    w (t−1) c + α(xt − w (t−1) c ) dt = vc w (t−1) c − α(xt − w (t−1) c ) dt vc The parameter α should be small right from the beginning (approx. 0.01 − 0.02) and go to 0 steadily. 277 LVQ Iterate over training examples. For (xt , dt ) find the closes neuron c c = arg min i=1,...,h xt − wi Adjust weights of c as follows: w (t) c =    w (t−1) c + α(xt − w (t−1) c ) dt = vc w (t−1) c − α(xt − w (t−1) c ) dt vc The parameter α should be small right from the beginning (approx. 0.01 − 0.02) and go to 0 steadily. By Kohonen: The border between classes should be a good approximation of the Bayes decision boundary. What is it?? 277 Bayes classifier For simplicity, consider two classes C0 and C1 (e.g. A and O). Let P(Ci | x) be the probability that the object belongs to Ci assuming that it has features x. (e.g. P(A | (a, b)) is the probability that a fruit with weight a and diameter b is an apple.) Bayes classifier assigns to x the class Ci which satisfies P(Ci | x) ≥ P(C1−i | x). Denote by R0 the set of all x satisfying P(C0 | x) ≥ P(C1 | x) and R1 = Rn R0. 278 Bayes classifier For simplicity, consider two classes C0 and C1 (e.g. A and O). Let P(Ci | x) be the probability that the object belongs to Ci assuming that it has features x. (e.g. P(A | (a, b)) is the probability that a fruit with weight a and diameter b is an apple.) Bayes classifier assigns to x the class Ci which satisfies P(Ci | x) ≥ P(C1−i | x). Denote by R0 the set of all x satisfying P(C0 | x) ≥ P(C1 | x) and R1 = Rn R0. Bayes classifier minimizes the error probability: P(x ∈ R0 ∧ C1) + P(x ∈ R1 ∧ C0) Bayes decision boundary is the boundary between the sets R0 and R1. 278 Bayes decision boundary vs LVQ Zdroj obrázku: The Self-Organizing Map, Teuvo Kohonen, IEEE, 1990 279 Oceanographic data Source: Patterns of ocean current variability on the West Florida Shelf using the self-organizing map. Y. Liu a R. H. Weisberg, JOURNAL OF GEOPHYSICAL RESEARCH, 2005 Investigates currents in the ocean around Florida. 280 Oceanographic data 11 measuring stations, 3 depths (surface, bottom, in between). data: 2D velocity vectors of the current measured by every hour, for 25585 hours 281 Oceanographic data 11 measuring stations, 3 depths (surface, bottom, in between). data: 2D velocity vectors of the current measured by every hour, for 25585 hours Thus we have 25585 data samples, 66 dimensions. Kohonen’s map: grid 3 × 4 neighbourhoods given by Gaussian functions Θ(c, k) = θ0 · exp −d(c, k)2 σ2 shrinking width (linearly decreasing learning rate) 281 Oceanographic data 282 Oceanographic data crosses are winning neurons) influenced by local fluctuations observable trend: winter: neurons 1-6 (south-east) summer: neurons 10-12 (north-west) 283 Grimm’s fairy tales Zdroj: Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map. T. Kohonen, T. Honkela a V. Pulkki, ICANN, 1995 Our goal is to visualize syntactic and semantic categories of words in fairy tales (depending on context). 284 Grimm’s fairy tales Zdroj: Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map. T. Kohonen, T. Honkela a V. Pulkki, ICANN, 1995 Our goal is to visualize syntactic and semantic categories of words in fairy tales (depending on context). Input: Grimm’s fairy tales (understandably encoded using a stream of 270-dimensional vectors) triples of words (predecessor, key, successor) every component in the triple encoded using a randomly generated 90 dimensional real vector Network: Kohonen’s map, 42 × 36 neurons, weights of the form w = (wp, wk , wn) where wp, wk , wn ∈ R90. 284 Grimm’s fairy tales Learning: Trained on triples of successive words in fairy tales The training set consisted of 150 most common words, with "average" context. Training: Approx. 1000 000 iterations. In the end, 150 most common words labelled neurons: A word u labels a neuron with weights w = (wp, wk , wn) when wk is closest to the code of u. 285 Grimm’s fairy tales 286 Course Summary 287 Great summary – models We have considered several models of neural networks: ADALINE (aka linear regression) Multilayer Perceptron Hopfield Networks Convolutional Networks Recurrent Networks (LSTM) Restricted Boltzmann Machines and Deep Belief Networks Kohonen’s Maps 288 Great summary – algorithms Gradient descent! The only exception were Kohonen’s maps (Kohonen learning) and Hopfield (Hebb’s learning). The gradient computed using Backpropagation: MLP, Convolutional, Recurrent (LSTM) Simulations: RBM 289 Deeper thoughts Most neural network models are universal approximators (i.e. capable of approximating any reasonable function), but it is difficult to find the appropriate configuration → such configuration can be learned efficiently (without guarantees of course) Depth is stronger than size: deep networks are more succinct in their representation but are harder to train: Do not forget the vanishin/exploding gradient problem! Weight tying = single most effective trick in the history of neural networks! 290