Introduction to Neural Networks
Brian Thompson
slides by Philipp Koehn
27 September 2018
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
1Linear Models
• We used before weighted linear combination of feature values hj and weights λj
score(λ, di) =
j
λj hj(di)
• Such models can be illustrated as a ”network”
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
2Limits of Linearity
• We can give each feature a weight
• But not more complex value relationships, e.g,
– any value in the range [0;5] is equally good
– values over 8 are bad
– higher than 10 is not worse
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
3XOR
• Linear models cannot model XOR
bad good
good bad
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
4Multiple Layers
• Add an intermediate (”hidden”) layer of processing
(each arrow is a weight)
• Have we gained anything so far?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
5Non-Linearity
• Instead of computing a linear combination
score(λ, di) =
j
λj hj(di)
• Add a non-linear function
score(λ, di) = f
j
λj hj(di)
• Popular choices
tanh(x) sigmoid(x) = 1
1+e−x relu(x) = max(0,x)
(sigmoid is also called the ”logistic function”)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
6Deep Learning
• More layers = deep learning
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
7What Depths Holds
• Each layer is a processing step
• Having multiple processing steps allows complex functions
• Metaphor: NN and computing circuits
– computer = sequence of Boolean gates
– neural computer = sequence of layers
• Deep neural networks can implement complex functions
e.g., sorting on input values
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
8
example
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
9Simple Neural Network
11
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• One innovation: bias units (no inputs, always value 1)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
10Sample Input
1
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Try out two input values
• Hidden unit computation
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) =
1
1 + e−2.2
= 0.90
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) =
1
1 + e1.6
= 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
11Computed Hidden
.90
.17
1
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Try out two input values
• Hidden unit computation
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) =
1
1 + e−2.2
= 0.90
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) =
1
1 + e1.6
= 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
12Compute Output
.90
.17
1
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Output unit computation
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) =
1
1 + e−1.17
= 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
13Computed Output
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Output unit computation
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) =
1
1 + e−1.17
= 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
14Output for all Binary Inputs
Input x0 Input x1 Hidden h0 Hidden h1 Output y0
0 0 0.12 0.02 0.18 → 0
0 1 0.88 0.27 0.74 → 1
1 0 0.73 0.12 0.74 → 1
1 1 0.99 0.73 0.33 → 0
• Network implements XOR
– hidden node h0 is OR
– hidden node h1 is AND
– ﬁnal layer operation is h0 − −h1
• Power of deep neural networks: chaining of processing steps
just as: more Boolean circuits → more complex computations possible
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
15
why ”neural” networks?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
16Neuron in the Brain
• The human brain is made up of about 100 billion neurons
Soma
Axon
Nucleus
Dendrite
Axon terminal
• Neurons receive electric signals at the dendrites and send them to the axon
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
17Neural Communication
• The axon of the neuron is connected to the dendrites of many other neurons
Neurotransmitter
Neurotransmitter
transporter Axon
terminal
Synaptic cleft
Dendrite
Receptor
Postsynaptic
density
Voltage
gated Ca++
channel
Synaptic
vesicle
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
18The Brain vs. Artiﬁcial Neural Networks
• Similarities
– Neurons, connections between neurons
– Learning = change of connections,
not change of neurons
– Massive parallel processing
• But artiﬁcial neural networks are much simpler
– computation within neuron vastly simpliﬁed
– discrete time steps
– typically some form of supervised learning with massive number of stimuli
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
19
back-propagation training
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
⇒ How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
21Key Concepts
• Gradient descent
– error is a function of the weights
– we want to reduce the error
– gradient descent: move towards the error minimum
– compute gradient → get direction to the error minimum
– adjust weights towards direction of lower error
• Back-propagation
– ﬁrst adjust last set of weights
– propagate error back to each previous layer
– adjust their weights
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
22Gradient Descent
λ
error(λ)
gradient = 1
current λoptimal λ
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
23Gradient Descent
Gradient for w1
Gradientforw2
Optimum
Current Point
Combined Gradient
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
24Derivative of Sigmoid
• Sigmoid sigmoid(x) =
1
1 + e−x
• Reminder: quotient rule
f(x)
g(x)
=
g(x)f (x) − f(x)g (x)
g(x)2
• Derivative
d sigmoid(x)
dx
=
d
dx
1
1 + e−x
=
0 × (1 − e−x
) − (−e−x
)
(1 + e−x)2
=
1
1 + e−x
e−x
1 + e−x
=
1
1 + e−x
1 −
1
1 + e−x
= sigmoid(x)(1 − sigmoid(x))
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s = k wkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 1
2(t − y)2
• Derivative of error with regard to one weight wk
dE
dwk
=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
26Final Layer Update (1)
• Linear combination of weights s = k wkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 1
2(t − y)2
• Derivative of error with regard to one weight wk
dE
dwk
=
dE
dy
dy
ds
ds
dwk
• Error E is deﬁned with respect to y
dE
dy
=
d
dy
1
2
(t − y)2
= −(t − y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
27Final Layer Update (2)
• Linear combination of weights s = k wkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 1
2(t − y)2
• Derivative of error with regard to one weight wk
dE
dwk
=
dE
dy
dy
ds
ds
dwk
• y with respect to x is sigmoid(s)
dy
ds
=
d sigmoid(s)
ds
= sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
28Final Layer Update (3)
• Linear combination of weights s = k wkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 1
2(t − y)2
• Derivative of error with regard to one weight wk
dE
dwk
=
dE
dy
dy
ds
ds
dwk
• x is weighted linear combination of hidden node values hk
ds
dwk
=
d
dwk
k
wkhk = hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
29Putting it All Together
• Derivative of error with regard to one weight wk
dE
dwk
=
dE
dy
dy
ds
ds
dwk
= −(t − y) y(1 − y) hk
– error
– derivative of sigmoid: y
• Weight adjustment will be scaled by a ﬁxed learning rate µ
∆wk = µ (t − y) y hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
30Multiple Output Nodes
• Our example only had one output node
• Typically neural networks have multiple output nodes
• Error is computed over all j output nodes
E =
j
1
2
(tj − yj)2
• Weights k → j are adjusted according to the node they point to
∆wj←k = µ(tj − yj) yj hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
31Hidden Layer Update
• In a hidden layer, we do not have a target output value
• But we can compute how much each node contributed to downstream error
• Deﬁnition of error term of each node
δj = (tj − yj) yj
• Back-propagate the error term
(why this way? there is math to back it up...)
δi =
j
wj←iδj yi
• Universal update formula
∆wj←k = µ δj hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
33Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
A
B
C
D
E
F
G
4.891 —
-5.126 ——
-1.566 ——
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
34Hidden Layer Updates
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
A
B
C
D
E
F
G
4.891 —
-5.126 ——
-1.566 ——
• Hidden node D
– δD = j wj←iδj yD = wGD δG yD = 4.5 × .0434 × .0898 = .0175
– ∆wDA = µ δD hA = 10 × .0175 × 1.0 = .175
– ∆wDB = µ δD hB = 10 × .0175 × 0.0 = 0
– ∆wDC = µ δD hC = 10 × .0175 × 1 = .175
• Hidden node E
– δE = j wj←iδj yE = wGE δG yE = −5.2 × .0434 × 0.2055 = −.0464
– ∆wEA = µ δE hA = 10 × −.0464 × 1.0 = −.464
– etc.
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
35
some additional aspects
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
36Initialization of Weights
• Weights are initialized randomly
e.g., uniformly from interval [−0.01, 0.01]
• Glorot and Bengio (2010) suggest
– for shallow neural networks
−
1
√
n
,
1
√
n
n is the size of the previous layer
– for deep neural networks
−
√
6
√
nj + nj+1
,
√
6
√
nj + nj+1
nj is the size of the previous layer, nj size of next layer
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
37Neural Networks for Classiﬁcation
• Predict class: one output node per class
• Training data output: ”One-hot vector”, e.g., y = (0, 0, 1)T
• Prediction
– predicted class is output node yi with highest value
– obtain posterior probability distribution by soft-max
softmax(yi) =
eyi
j eyj
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
38Problems with Gradient Descent Training
λ
error(λ)
Too high learning rate
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
39Problems with Gradient Descent Training
λ
error(λ)
Bad initialization
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
40Problems with Gradient Descent Training
λ
error(λ)
local optimum
global optimum
Local optimum
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
41Speedup: Momentum Term
• Updates may move a weight slowly in one direction
• To speed this up, we can keep a memory of prior updates
∆wj←k(n − 1)
• ... and add these to any new updates (with decay factor ρ)
∆wj←k(n) = µ δj hk + ρ∆wj←k(n − 1)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
42Adagrad
• Typically reduce the learning rate µ over time
– at the beginning, things have to change a lot
– later, just ﬁne-tuning
• Adapting learning rate per parameter
• Adagrad update
based on error E with respect to the weight w at time t = gt = dE
dw
∆wt =
µ
t
τ=1 g2
τ
gt
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
43Dropout
• A general problem of machine learning: overﬁtting to training data
(very good on train, bad on unseen test)
• Solution: regularization, e.g., keeping weights from having extreme values
• Dropout: randomly remove some hidden units during training
– mask: set of hidden units dropped
– randomly generate, say, 10–20 masks
– alternate between the masks during training
• Why does that work?
→ bagging, ensemble, ...
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
44Mini Batches
• Each training example yields a set of weight updates ∆wi.
• Batch up several training examples
– sum up their updates
– apply sum to model
• Mostly done or speed reasons
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
45
computational aspects
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
46Vector and Matrix Multiplications
• Forward computation: s = Wh
• Activation function: y = sigmoid(h)
• Error term: δ = (t − y) sigmoid’(s)
• Propagation of error term: δi = Wδi+1 · sigmoid’(s)
• Weight updates: ∆W = µδhT
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
47GPU
• Neural network layers may have, say, 200 nodes
• Computations such as Wh require 200 × 200 = 40, 000 multiplications
• Graphics Processing Units (GPU) are designed for such computations
– image rendering requires such vector and matrix operations
– massively mulit-core but lean processing units
– example: NVIDIA Tesla K20c GPU provides 2496 thread processors
• Extensions to C to support programming of GPUs, such as CUDA
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
48Toolkits
• Theano
• Tensorﬂow (Google)
• PyTorch (Facebook)
• MXNet (Amazon)
• DyNet
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018