Computation Graphs
Philipp Koehn
29 September 2020
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
1Neural Network Cartoon
• A common way to illustrate a neural network
x h y
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
2Neural Network Math
• Hidden layer
h = sigmoid(W1x + b1)
• Final layer
y = sigmoid(W2h + b2)
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
3Computation Graph
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
4Simple Neural Network
11
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
5Computation Graph
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
6Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
0.0
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
7Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
0.0
3.7
2.9
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
8Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
0.0
3.7
2.9
2.2
−1.6
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
9Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
0.0
3.7
2.9
2.2
−1.6
.900
.168
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
10Processing Input
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
0.0
3.7
2.9
2.2
−1.6
.900
.168
3.18
1.18
.765
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
11Error Function
• For training, we need a measure how well we do
⇒ Cost function
also known as objective function, loss, gain, cost, ...
• For instance L2 norm
error =
1
2
(t − y)2
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
12Gradient Descent
• We view the error as a function of the trainable parameters
error(λ)
• We want to optimize error(λ) by moving it towards its optimum
λ
error(λ)
gradient = 1
current λoptimal λ
λ
error(λ)
gradient=2
current λoptimal λ
λ
error(λ)
gradient = 0.2
current λoptimal λ
• Why not just set it to its optimum?
– we are updating based on one training example, do not want to overﬁt to it
– we are also changing all the other parameters, the curve will look different
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
13Calculus Refresher: Chain Rule
• Formula for computing derivative of composition of two or more functions
– functions f and g
– composition f ◦ g maps x to f(g(x))
• Chain rule
(f ◦ g) = (f ◦ g) · g
or
F (x) = f (g(x))g (x)
• Leibniz’s notation
dz
dx = dz
dy · dy
dx
if z = f(y) and y = g(x), then
dz
dx = dz
dy · dy
dx = f (y)g (x) = f (g(x))g (x)
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
14Final Layer Update
• Linear combination of weights s = k wkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 1
2(t − y)2
• Derivative of error with regard to one weight wk
dE
dwk
=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
15Error Computation in Computation Graph
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
16Error Propagation in Computation Graph
E
B
A
• Compute derivative at node A: dE
dA = dE
dB
dB
dA
• Assume that we already computed dE
dB (backward pass through graph)
• So now we only have to get the formula for dB
dA
• For instance B is a square node
– forward computation: B = A2
– backward computation: dB
dA = dA2
dA = 2A
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
17Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2
dsigmoid = do
di = d
di
1
2(t − i)2
= t − i
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
18Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2
dsigmoid = do
di = d
di
1
2(t − i)2
= t − i
dsigmoid
dsum = do
di = d
diσ(i) = σ(i)(1 − σ(i))
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
19Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2
dsigmoid = do
di = d
di
1
2(t − i)2
= t − i
dsigmoid
dsum = do
di = d
diσ(i) = σ(i)(1 − σ(i))
dsum
dprod = do
di1
= d
di1
i1 + i2 = 1, do
di2
= 1
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
20Derivatives for Each Node
L2
tsigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
dL2
dsigmoid = do
di = d
di
1
2(t − i)2
= t − i
dsigmoid
dsum = do
di = d
diσ(i) = σ(i)(1 − σ(i))
dsum
dprod = do
di1
= d
di1
i1 + i2 = 1, do
di2
= 1
dsum
dprod = do
di1
= d
di1
i1i2 = i2, do
di2
= i1
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
21Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ (i)
1, 1
i2, i1
σ (i)
1, 1
i2, i1
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
1.0
0.0
3.7
2.9
2.2
−1.6
.900
.17
3.18
1.18
.765
.0277 .235
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
22Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ (i)
1, 1
i2, i1
σ (i)
1, 1
i2, i1
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
1.0
0.0
3.7
2.9
2.2
−1.6
.900
.17
3.18
1.18
.765
.0277
.180 × .235 = .0424
.235
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
23Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ (i)
1, 1
i2, i1
σ (i)
1, 1
i2, i1
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
1.0
0.0
3.7
2.9
2.2
−1.6
.900
.17
3.18
1.18
.765
.0277
.0424 , .0424
.180 × .235 = .0424
.235
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
24Backward Pass: Derivative Computation
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ (i)
1, 1
i2, i1
σ (i)
1, 1
i2, i1
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
1.0
1.0
0.0
3.7
2.9
2.2
−1.6
.900
.17
3.18
1.18
.765
.0277
−.0260
−.0260
,
0171 0
−.0308 0
.0171
−.0308
,
.0171
−.0308
.0171
−.0308
.191
−.220
, .0382 .00712
.0424 , .0424
.180 × .235 = .0424
.235
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
25Gradients for Parameter Update
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ (i)
1, 1
i2, i1
σ (i)
1, 1
i2, i1
3.7 3.7
2.9 2.9
4.5 −5.2
−1.5
−4.6
−2.0
.0171 0
−.0308 0
.0171
−.0308
.0382 .00712
.0424
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
26Parameter Update
t
b2
W2
b1
W1x
L2
sigmoid
sum
prod
sigmoid
sum
prod
i2 − i1
σ (i)
1, 1
i2, i1
σ (i)
1, 1
i2, i1
3.7 3.7
2.9 2.9
– µ
.0171 0
−.0308 0
4.5 −5.2 – µ
.0171
−.0308
−1.5
−4.6
– µ .0382 .00712
−2.0 – µ .0424
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
27
toolkits
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
28Explosion of Deep Learning Toolkits
• University of Montreal: Theano (early, now defunct)
• Google: Tensorﬂow
• Facebook: Torch, pyTorch
• Microsoft: CNTK
• Amazon: MX-Net
• CMU: Dynet
• AMU/Edinburgh/Microsoft: Marian
• ... and many more
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
29Toolkits
• Machine learning architectures around computations graphs very powerful
– deﬁne a computation graph
– provide data and a training strategy (e.g., batching)
– toolkit does the rest
– seamless support of GPUs
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
30Example: PyTorch
• Installation




pip install torch
• Usage




import torch
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
31Some Data Types
• PyTorch data type for parameter vectors, matrices etc., called torch.tensor
'
&
$
%
W = torch.tensor([[3,4],[2,3]], requires grad=True, dtype=torch.float)
b = torch.tensor([-2,-4], requires grad=True, dtype=torch.float)
W2 = torch.tensor([5,-5], requires grad=True, dtype=torch.float)
b2 = torch.tensor([-2], requires grad=True, dtype=torch.float)
• Deﬁnition of variables includes
– speciﬁcation of their basic data type (float)
– indication to compute gradients (requires grad=True)
• Input and output




x = torch.tensor([1,0], dtype=torch.float)
t = torch.tensor([1], dtype=torch.float)
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
32Computation Graph
• Computation graph
'
&
$
%
s = W.mv(x) + b
h = torch.nn.Sigmoid()(s)
z = torch.dot(W2, h) + b2
y = torch.nn.Sigmoid()(z)
error = 1/2 * (t - z) ** 2
• Note
– PyTorch sigmoid function torch.nn.Sigmoid()
– multiplication between matrix W and vector x is mv
– multiplication between two vectors W2 and h is torch.dot.
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
33Backward Computation
• Here it is:




error.backward()
• No need to derive gradients — all is done automatically
• We can look up computed gradients
#
" !
>>> W2.grad
tensor([-0.0360, -0.0059])
• Note
– when you run this code multiple times, then gradients accumulate
– reset them with, e.g., W2.grad.data.zero ()
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
34Training Data
• Our training set consists of the four examples of binary XOR operations.
x y x ⊕ y
0 0 0
0 1 1
1 0 1
1 1 0
• Placed into array
'
&
$
%
training data =
[ [ torch.tensor([0.,0.]), torch.tensor([0.]) ],
[ torch.tensor([1.,0.]), torch.tensor([1.]) ],
[ torch.tensor([0.,1.]), torch.tensor([1.]) ],
[ torch.tensor([1.,1.]), torch.tensor([0.]) ] ]
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
35Training Loop: Forward
'
&
$
%
mu = 0.1
for epoch in range(1000):
total_error = 0
for item in training_data:
x = item[0]
t = item[1]
# forward computation
s = W.mv(x) + b
h = torch.nn.Sigmoid()(s)
z = torch.dot(W2, h) + b2
y = torch.nn.Sigmoid()(z)
error = 1/2 * (t - y) ** 2
total_error = total_error + error
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
36Training Loop: Backward and Updates
'
&
$
%
# backward computation
error.backward()
# weight updates
W.data = W - mu * W.grad.data
b.data = b - mu * b.grad.data
W2.data = W2 - mu * W2.grad.data
b2.data = b2 - mu * b2.grad.data
W.grad.data.zero_()
b.grad.data.zero_()
W2.grad.data.zero_()
b2.grad.data.zero_()
print("error: ", total_error/4)
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
37Batch Training
• We computed gradients for each training example, update model immediately
• More common: process examples in batches, update after batch processed
• Instead




error.backward()
• Run back-propagation on accumulated error




total error.backward()
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
38Training Data Batch
#
" !
x = torch.tensor([ [0.,0.], [1.,0.], [0.,1.], [1.,1.] ])
t = torch.tensor([ 0., 1., 1., 0. ])
• Change to computation graph (input now a matrix, output a vector)
'
&
$
%
s = x.mm(W) + b
h = torch.nn.Sigmoid()(s)
z = h.mv(W2) + b2
y = torch.nn.Sigmoid()(z)
• Convert error vector into single number
'
&
$
%
error = 1/2 * (t - y) ** 2
mean error = error.mean()
mean error.backward()
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
39Parameter Updates (Optimizer)
• Our code has explicit parameter update computations
'
&
$
%
# weight updates
W.data = W - mu * W.grad.data
b.data = b - mu * b.grad.data
W2.data = W2 - mu * W2.grad.data
b2.data = b2 - mu * b2.grad.data
• But fancier optimizers are typically used (Adam, etc.)
• This requires more complex implementation
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
40torch.nn.Module
• Neural network model is deﬁned as class derived from torch.nn.Module
'
&
$
%
class ExampleNet(torch.nn.Module):
def init (self):
super(ExampleNet, self). init ()
self.layer1 = torch.nn.Linear(2,2)
self.layer2 = torch.nn.Linear(2,1)
self.layer1.weight = torch.nn.Parameter(torch.tensor([[3.,2.],[4.,3.]]))
self.layer1.bias = torch.nn.Parameter(torch.tensor([-2.,-4.]))
self.layer2.weight = torch.nn.Parameter(torch.tensor([[5.,-5.]]))
self.layer2.bias = torch.nn.Parameter(torch.tensor([-2.]))
def forward(self, x):
s = self.layer1(x)
h = torch.nn.Sigmoid()(s)
z = self.layer2(h)
y = torch.nn.Sigmoid()(z)
return y
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
41Optimizer Deﬁnition
• Instantiation of neural network object




net = ExampleNet()
• Optimizer deﬁnition




optimizer = torch.optim.SGD(net.parameters(), lr=0.1)
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
42Training Loop
'
&
$
%
for iteration in range(1000):
optimizer.zero grad()
out = net.forward( x )
error = 1/2 * (t - out) ** 2
mean error = error.mean()
print("error: ",mean error.data)
mean error.backward()
optimizer.step()
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
43
code available on web page for textbook
http://www.statmt.org/nmt-book/
Philipp Koehn Machine Translation: Computation Graphs 29 September 2020