What about classification? Binary classification: Desired outputs 0 and 1. Ideally, capture the probability distribution of classes. 14 What about classification? Binary classification: Desired outputs 0 and 1. ... does not capture probability well (it is not a probability at all) 14 What about classification? Binary classification: Desired outputs 0 and 1. ... logistic sigmoid 1 1+e−(w·x) is much better! 14 Logistic regression x1 x2 xn · · · y x0 = 1 w0 w1 w2 wn w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1. Activity: inner potential: ξ = w0 + n i=1 wixi = n i=0 wixi = w · x activation function: σ(ξ) = 1 1+e−ξ network function: y[w](x) = σ(ξ) = 1 1+e−(w·x) Intuition: The output y is now interpreted as the probability of the class 1 given the input x. 15 But what is the meaning of the sigmoid? The model gives a probability y of the class 1 given an input x. But why we model such a probability using 1/(1 + e−w·x) ?? 16 But what is the meaning of the sigmoid? The model gives a probability y of the class 1 given an input x. But why we model such a probability using 1/(1 + e−w·x) ?? Let ˆy be the "true" probability of the class 1 to be modeled. What about odds of the class 1? odds(ˆy) = ˆy/1 − ˆy Resembles an exponential function ... 16 But what is the meaning of the sigmoid? The model gives a probability y of the class 1 given an input x. But why we model such a probability using 1/(1 + e−w·x) ?? Let ˆy be the "true" probability of the class 1 to be modeled. What about log odds (aka logit) of the class 1? logit(ˆy) = log(ˆy/(1 − ˆy)) Looks almost linear ... 16 But what is the meaning of the sigmoid? Assume that ˆy is the probability of the class 1. Put log(ˆy/(1 − ˆy)) = w · x 17 But what is the meaning of the sigmoid? Assume that ˆy is the probability of the class 1. Put log(ˆy/(1 − ˆy)) = w · x Then log((1 − ˆy)/ˆy) = −w · x 17 But what is the meaning of the sigmoid? Assume that ˆy is the probability of the class 1. Put log(ˆy/(1 − ˆy)) = w · x Then log((1 − ˆy)/ˆy) = −w · x and (1 − ˆy)/ˆy = e−w·x 17 But what is the meaning of the sigmoid? Assume that ˆy is the probability of the class 1. Put log(ˆy/(1 − ˆy)) = w · x Then log((1 − ˆy)/ˆy) = −w · x and (1 − ˆy)/ˆy = e−w·x and ˆy = 1 1 + e−w·x That is, if we model log odds using a linear function, the probability is obtained by applying the logistic sigmoid on the result of the linear function. 17 Logistic regression Learning: Given a training dataset T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. 18 Logistic regression Learning: Given a training dataset T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. What error function? (Binary) cross-entropy: E(w) = p k=1 −(dk log(yk ) + (1 − dk ) log(1 − yk )) What?!? 18 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). 19 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is ˆy and is unknown! 19 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is ˆy and is unknown! You have tossed the coin 5 times and got a training dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Consider this to be a very special case where the input dimension is 0 19 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is ˆy and is unknown! You have tossed the coin 5 times and got a training dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Consider this to be a very special case where the input dimension is 0 What is the best model y of ˆy based on the data? 19 Log likelihood is your friend! Let’s have a "coin" (sides 0 and 1). The probability of 1 is ˆy and is unknown! You have tossed the coin 5 times and got a training dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Consider this to be a very special case where the input dimension is 0 What is the best model y of ˆy based on the data? Answer: The one that generates the data with maximum probability! 19 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} 20 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Assume that the data was generated by independent trials, then the probability of getting exactly T from our model is L = y · y · (1 − y) · (1 − y) · y How to maximize this w.r.t. y? 20 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Assume that the data was generated by independent trials, then the probability of getting exactly T from our model is L = y · y · (1 − y) · (1 − y) · y How to maximize this w.r.t. y? Maximize LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y) 20 Log likelihood is your friend! Keep in mind our dataset: T = {1, 1, 0, 0, 1} = {d1, . . . , d5} Assume that the data was generated by independent trials, then the probability of getting exactly T from our model is L = y · y · (1 − y) · (1 − y) · y How to maximize this w.r.t. y? Maximize LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y) But then −LL = −1·log(y)−1·log(y)−(1−0)·log(1−y)−(1−0)·log(1−y)−1·log(y) i.e. −LL is the cross-entropy. 20 Let the coin depend on the input Consider our model: y = 1 1 + e−(w·x) 21 Let the coin depend on the input Consider our model: y = 1 1 + e−(w·x) The training dataset is now standard: T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. 21 Let the coin depend on the input Consider our model: y = 1 1 + e−(w·x) The training dataset is now standard: T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. The likelihood: L = p k=1 ydk k · (1 − yk )(1−dk ) and LL = log(L) = p k=1 (dk log(yk ) + (1 − dk ) log(1 − yk )) and thus −LL = the cross-entropy. Minimizing the cross-netropy maximizes the log-likelihood (and vice versa). 21 Normal Distribution Distribution of continuous random variables. Density (one dimensional, that is over R): p(x) = 1 σ √ 2π exp − (x − µ)2 2σ2 =: N[µ, σ2 ](x) µ is the expected value (the mean), σ2 is the variance. 22 Maximum Likelihood vs Least Squares (Dim 1) Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp 23 Maximum Likelihood vs Least Squares (Dim 1) Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp Assume that each dk has been generated randomly by dk = (w0 + w1 · xk ) + k w0, w1 are unknown numbers k are normally distributed with mean 0 and an unknown variance σ2 23 Maximum Likelihood vs Least Squares (Dim 1) Keep in mind: dk = (w0 + w1 · xk ) + k Assume that 1, . . . , p were generated independently. 24 Maximum Likelihood vs Least Squares (Dim 1) Keep in mind: dk = (w0 + w1 · xk ) + k Assume that 1, . . . , p were generated independently. Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density according to which the values d1, . . . , dn were generated assuming fixed w0, w1, σ2, x1, . . . , xp. 24 Maximum Likelihood vs Least Squares (Dim 1) Keep in mind: dk = (w0 + w1 · xk ) + k Assume that 1, . . . , p were generated independently. Denote by p(d1, . . . , dp | w0, w1, σ2) the probability density according to which the values d1, . . . , dn were generated assuming fixed w0, w1, σ2, x1, . . . , xp. The independence and normality imply p(d1, . . . , dp | w0, w1, σ2 ) = p k=1 N[w0 + w1xk , σ2 ](dk ) = p k=1 1 σ √ 2π exp − (dk − w0 − w1xk )2 2σ2 24 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) 25 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) Theorem (w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1) minimizes squared error E(w0, w1) = p k=1 (dk − w0 − w1xk )2. 25 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) Theorem (w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1) minimizes squared error E(w0, w1) = p k=1 (dk − w0 − w1xk )2. Note that the maximizing/minimizing (w0, w1) does not depend on σ2. 25 Maximum Likelihood vs Least Squares Our goal is to find (w0, w1) that maximizes the likelihood that the training set D with fixed values d1, . . . , dn has been generated: L(w0, w1, σ2 ) := p(d1, . . . , dp | w0, w1, σ2 ) Theorem (w0, w1) maximizes L(w0, w1, σ2) for arbitrary σ2 iff (w0, w1) minimizes squared error E(w0, w1) = p k=1 (dk − w0 − w1xk )2. Note that the maximizing/minimizing (w0, w1) does not depend on σ2. Maximizing σ2 satisfies σ2 = 1 p p k=1 (dk − w0 − w1 · xk )2. 25 MLP training – theory 26 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 27 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 28 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi 29 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] 29 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) 29 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) The network computes a function R|X| do R|Y| . Layer-wise computation: First, all input neurons are assigned values of the input. In the -th step, all neurons of the -th layer are evaluated. 29 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). 30 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 30 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji 31 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. 31 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. Note that ∂E ∂wji (w(t) ) is a component of the gradient E, i.e. the weight update can be written as w(t+1) = w(t) − ε(t) · E(w(t) ). 31 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji 32 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi 32 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y 32 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) (Here all yj are in fact yj(w, xk )). 32 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) 33 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) 33 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then σj (ξj) = b a (a − yj)(a + yj) 33 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 4. Eji := Eji + ∂Ek ∂wji The resulting Eji equals ∂E ∂wji . 34 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: 35 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj 35 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and assuming that ∂Ek ∂yr has already been computed for all neurons in the + 1-st layer, compute ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj (This works because all neurons of r ∈ j→ belong to the + 1-st layer.) 35 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. Note that the speed of convergence of the gradient descent cannot be estimated ... 36 Illustration of the gradient descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 37 MLP – learning algorithm Online algorithm: The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂Ek ∂wji (w (t) ji ) is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. There are other variants determined by selection of the training examples used for the error computation (more on this later). 38 SGD weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · k∈T Ek (w(t) ) 0 < ε(t) ≤ 1 is a learning rate in step t + 1 Ek (w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 39