Introduction to Neural Networks Brian Thompson slides by Philipp Koehn 27 September 2018 Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 1Linear Models • We used before weighted linear combination of feature values hj and weights λj score(λ, di) = j λj hj(di) • Such models can be illustrated as a ”network” Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 2Limits of Linearity • We can give each feature a weight • But not more complex value relationships, e.g, – any value in the range [0;5] is equally good – values over 8 are bad – higher than 10 is not worse Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 3XOR • Linear models cannot model XOR bad good good bad Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 4Multiple Layers • Add an intermediate (”hidden”) layer of processing (each arrow is a weight) • Have we gained anything so far? Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 5Non-Linearity • Instead of computing a linear combination score(λ, di) = j λj hj(di) • Add a non-linear function score(λ, di) = f j λj hj(di) • Popular choices tanh(x) sigmoid(x) = 1 1+e−x relu(x) = max(0,x) (sigmoid is also called the ”logistic function”) Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 6Deep Learning • More layers = deep learning Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 7What Depths Holds • Each layer is a processing step • Having multiple processing steps allows complex functions • Metaphor: NN and computing circuits – computer = sequence of Boolean gates – neural computer = sequence of layers • Deep neural networks can implement complex functions e.g., sorting on input values Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 8 example Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 9Simple Neural Network 11 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • One innovation: bias units (no inputs, always value 1) Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 10Sample Input 1 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Try out two input values • Hidden unit computation sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = 1 1 + e−2.2 = 0.90 sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = 1 1 + e1.6 = 0.17 Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 11Computed Hidden .90 .17 1 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Try out two input values • Hidden unit computation sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = 1 1 + e−2.2 = 0.90 sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = 1 1 + e1.6 = 0.17 Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 12Compute Output .90 .17 1 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Output unit computation sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = 1 1 + e−1.17 = 0.76 Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 13Computed Output .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Output unit computation sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = 1 1 + e−1.17 = 0.76 Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 14Output for all Binary Inputs Input x0 Input x1 Hidden h0 Hidden h1 Output y0 0 0 0.12 0.02 0.18 → 0 0 1 0.88 0.27 0.74 → 1 1 0 0.73 0.12 0.74 → 1 1 1 0.99 0.73 0.33 → 0 • Network implements XOR – hidden node h0 is OR – hidden node h1 is AND – final layer operation is h0 − −h1 • Power of deep neural networks: chaining of processing steps just as: more Boolean circuits → more complex computations possible Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 15 why ”neural” networks? Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 16Neuron in the Brain • The human brain is made up of about 100 billion neurons Soma Axon Nucleus Dendrite Axon terminal • Neurons receive electric signals at the dendrites and send them to the axon Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 17Neural Communication • The axon of the neuron is connected to the dendrites of many other neurons Neurotransmitter Neurotransmitter transporter Axon terminal Synaptic cleft Dendrite Receptor Postsynaptic density Voltage gated Ca++ channel Synaptic vesicle Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 18The Brain vs. Artificial Neural Networks • Similarities – Neurons, connections between neurons – Learning = change of connections, not change of neurons – Massive parallel processing • But artificial neural networks are much simpler – computation within neuron vastly simplified – discrete time steps – typically some form of supervised learning with massive number of stimuli Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 19 back-propagation training Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 20Error .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Computed output: y = .76 • Correct output: t = 1.0 ⇒ How do we adjust the weights? Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 21Key Concepts • Gradient descent – error is a function of the weights – we want to reduce the error – gradient descent: move towards the error minimum – compute gradient → get direction to the error minimum – adjust weights towards direction of lower error • Back-propagation – first adjust last set of weights – propagate error back to each previous layer – adjust their weights Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 22Gradient Descent λ error(λ) gradient = 1 current λoptimal λ Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 23Gradient Descent Gradient for w1 Gradientforw2 Optimum Current Point Combined Gradient Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 24Derivative of Sigmoid • Sigmoid sigmoid(x) = 1 1 + e−x • Reminder: quotient rule f(x) g(x) = g(x)f (x) − f(x)g (x) g(x)2 • Derivative d sigmoid(x) dx = d dx 1 1 + e−x = 0 × (1 − e−x ) − (−e−x ) (1 + e−x)2 = 1 1 + e−x e−x 1 + e−x = 1 1 + e−x 1 − 1 1 + e−x = sigmoid(x)(1 − sigmoid(x)) Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 25Final Layer Update • Linear combination of weights s = k wkhk • Activation function y = sigmoid(s) • Error (L2 norm) E = 1 2(t − y)2 • Derivative of error with regard to one weight wk dE dwk = dE dy dy ds ds dwk Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 26Final Layer Update (1) • Linear combination of weights s = k wkhk • Activation function y = sigmoid(s) • Error (L2 norm) E = 1 2(t − y)2 • Derivative of error with regard to one weight wk dE dwk = dE dy dy ds ds dwk • Error E is defined with respect to y dE dy = d dy 1 2 (t − y)2 = −(t − y) Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 27Final Layer Update (2) • Linear combination of weights s = k wkhk • Activation function y = sigmoid(s) • Error (L2 norm) E = 1 2(t − y)2 • Derivative of error with regard to one weight wk dE dwk = dE dy dy ds ds dwk • y with respect to x is sigmoid(s) dy ds = d sigmoid(s) ds = sigmoid(s)(1 − sigmoid(s)) = y(1 − y) Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 28Final Layer Update (3) • Linear combination of weights s = k wkhk • Activation function y = sigmoid(s) • Error (L2 norm) E = 1 2(t − y)2 • Derivative of error with regard to one weight wk dE dwk = dE dy dy ds ds dwk • x is weighted linear combination of hidden node values hk ds dwk = d dwk k wkhk = hk Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 29Putting it All Together • Derivative of error with regard to one weight wk dE dwk = dE dy dy ds ds dwk = −(t − y) y(1 − y) hk – error – derivative of sigmoid: y • Weight adjustment will be scaled by a fixed learning rate µ ∆wk = µ (t − y) y hk Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 30Multiple Output Nodes • Our example only had one output node • Typically neural networks have multiple output nodes • Error is computed over all j output nodes E = j 1 2 (tj − yj)2 • Weights k → j are adjusted according to the node they point to ∆wj←k = µ(tj − yj) yj hk Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 31Hidden Layer Update • In a hidden layer, we do not have a target output value • But we can compute how much each node contributed to downstream error • Definition of error term of each node δj = (tj − yj) yj • Back-propagate the error term (why this way? there is math to back it up...) δi = j wj←iδj yi • Universal update formula ∆wj←k = µ δj hk Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 32Our Example .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 A B C D E F G • Computed output: y = .76 • Correct output: t = 1.0 • Final layer weight updates (learning rate µ = 10) – δG = (t − y) y = (1 − .76) 0.181 = .0434 – ∆wGD = µ δG hD = 10 × .0434 × .90 = .391 – ∆wGE = µ δG hE = 10 × .0434 × .17 = .074 – ∆wGF = µ δG hF = 10 × .0434 × 1 = .434 Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 33Our Example .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 A B C D E F G 4.891 — -5.126 —— -1.566 —— • Computed output: y = .76 • Correct output: t = 1.0 • Final layer weight updates (learning rate µ = 10) – δG = (t − y) y = (1 − .76) 0.181 = .0434 – ∆wGD = µ δG hD = 10 × .0434 × .90 = .391 – ∆wGE = µ δG hE = 10 × .0434 × .17 = .074 – ∆wGF = µ δG hF = 10 × .0434 × 1 = .434 Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 34Hidden Layer Updates .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 A B C D E F G 4.891 — -5.126 —— -1.566 —— • Hidden node D – δD = j wj←iδj yD = wGD δG yD = 4.5 × .0434 × .0898 = .0175 – ∆wDA = µ δD hA = 10 × .0175 × 1.0 = .175 – ∆wDB = µ δD hB = 10 × .0175 × 0.0 = 0 – ∆wDC = µ δD hC = 10 × .0175 × 1 = .175 • Hidden node E – δE = j wj←iδj yE = wGE δG yE = −5.2 × .0434 × 0.2055 = −.0464 – ∆wEA = µ δE hA = 10 × −.0464 × 1.0 = −.464 – etc. Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 35 some additional aspects Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 36Initialization of Weights • Weights are initialized randomly e.g., uniformly from interval [−0.01, 0.01] • Glorot and Bengio (2010) suggest – for shallow neural networks − 1 √ n , 1 √ n n is the size of the previous layer – for deep neural networks − √ 6 √ nj + nj+1 , √ 6 √ nj + nj+1 nj is the size of the previous layer, nj size of next layer Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 37Neural Networks for Classification • Predict class: one output node per class • Training data output: ”One-hot vector”, e.g., y = (0, 0, 1)T • Prediction – predicted class is output node yi with highest value – obtain posterior probability distribution by soft-max softmax(yi) = eyi j eyj Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 38Problems with Gradient Descent Training λ error(λ) Too high learning rate Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 39Problems with Gradient Descent Training λ error(λ) Bad initialization Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 40Problems with Gradient Descent Training λ error(λ) local optimum global optimum Local optimum Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 41Speedup: Momentum Term • Updates may move a weight slowly in one direction • To speed this up, we can keep a memory of prior updates ∆wj←k(n − 1) • ... and add these to any new updates (with decay factor ρ) ∆wj←k(n) = µ δj hk + ρ∆wj←k(n − 1) Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 42Adagrad • Typically reduce the learning rate µ over time – at the beginning, things have to change a lot – later, just fine-tuning • Adapting learning rate per parameter • Adagrad update based on error E with respect to the weight w at time t = gt = dE dw ∆wt = µ t τ=1 g2 τ gt Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 43Dropout • A general problem of machine learning: overfitting to training data (very good on train, bad on unseen test) • Solution: regularization, e.g., keeping weights from having extreme values • Dropout: randomly remove some hidden units during training – mask: set of hidden units dropped – randomly generate, say, 10–20 masks – alternate between the masks during training • Why does that work? → bagging, ensemble, ... Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 44Mini Batches • Each training example yields a set of weight updates ∆wi. • Batch up several training examples – sum up their updates – apply sum to model • Mostly done or speed reasons Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 45 computational aspects Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 46Vector and Matrix Multiplications • Forward computation: s = Wh • Activation function: y = sigmoid(h) • Error term: δ = (t − y) sigmoid’(s) • Propagation of error term: δi = Wδi+1 · sigmoid’(s) • Weight updates: ∆W = µδhT Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 47GPU • Neural network layers may have, say, 200 nodes • Computations such as Wh require 200 × 200 = 40, 000 multiplications • Graphics Processing Units (GPU) are designed for such computations – image rendering requires such vector and matrix operations – massively mulit-core but lean processing units – example: NVIDIA Tesla K20c GPU provides 2496 thread processors • Extensions to C to support programming of GPUs, such as CUDA Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018 48Toolkits • Theano • Tensorflow (Google) • PyTorch (Facebook) • MXNet (Amazon) • DyNet Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018