Computation Graphs Philipp Koehn 29 September 2020 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 1Neural Network Cartoon • A common way to illustrate a neural network x h y Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 2Neural Network Math • Hidden layer h = sigmoid(W1x + b1) • Final layer y = sigmoid(W2h + b2) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 3Computation Graph sigmoid sum b2prod W2sigmoid sum b1prod W1x Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 4Simple Neural Network 11 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 5Computation Graph sigmoid sum b2prod W2sigmoid sum b1prod W1x 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 6Processing Input sigmoid sum b2prod W2sigmoid sum b1prod W1x 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 0.0 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 7Processing Input sigmoid sum b2prod W2sigmoid sum b1prod W1x 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 0.0 3.7 2.9 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 8Processing Input sigmoid sum b2prod W2sigmoid sum b1prod W1x 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 0.0 3.7 2.9 2.2 −1.6 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 9Processing Input sigmoid sum b2prod W2sigmoid sum b1prod W1x 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 0.0 3.7 2.9 2.2 −1.6 .900 .168 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 10Processing Input sigmoid sum b2prod W2sigmoid sum b1prod W1x 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 0.0 3.7 2.9 2.2 −1.6 .900 .168 3.18 1.18 .765 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 11Error Function • For training, we need a measure how well we do ⇒ Cost function also known as objective function, loss, gain, cost, ... • For instance L2 norm error = 1 2 (t − y)2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 12Gradient Descent • We view the error as a function of the trainable parameters error(λ) • We want to optimize error(λ) by moving it towards its optimum λ error(λ) gradient = 1 current λoptimal λ λ error(λ) gradient=2 current λoptimal λ λ error(λ) gradient = 0.2 current λoptimal λ • Why not just set it to its optimum? – we are updating based on one training example, do not want to overfit to it – we are also changing all the other parameters, the curve will look different Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 13Calculus Refresher: Chain Rule • Formula for computing derivative of composition of two or more functions – functions f and g – composition f ◦ g maps x to f(g(x)) • Chain rule (f ◦ g) = (f ◦ g) · g or F (x) = f (g(x))g (x) • Leibniz’s notation dz dx = dz dy · dy dx if z = f(y) and y = g(x), then dz dx = dz dy · dy dx = f (y)g (x) = f (g(x))g (x) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 14Final Layer Update • Linear combination of weights s = k wkhk • Activation function y = sigmoid(s) • Error (L2 norm) E = 1 2(t − y)2 • Derivative of error with regard to one weight wk dE dwk = dE dy dy ds ds dwk Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 15Error Computation in Computation Graph L2 tsigmoid sum b2prod W2sigmoid sum b1prod W1x Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 16Error Propagation in Computation Graph E B A • Compute derivative at node A: dE dA = dE dB dB dA • Assume that we already computed dE dB (backward pass through graph) • So now we only have to get the formula for dB dA • For instance B is a square node – forward computation: B = A2 – backward computation: dB dA = dA2 dA = 2A Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 17Derivatives for Each Node L2 tsigmoid sum b2prod W2sigmoid sum b1prod W1x dL2 dsigmoid = do di = d di 1 2(t − i)2 = t − i Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 18Derivatives for Each Node L2 tsigmoid sum b2prod W2sigmoid sum b1prod W1x dL2 dsigmoid = do di = d di 1 2(t − i)2 = t − i dsigmoid dsum = do di = d diσ(i) = σ(i)(1 − σ(i)) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 19Derivatives for Each Node L2 tsigmoid sum b2prod W2sigmoid sum b1prod W1x dL2 dsigmoid = do di = d di 1 2(t − i)2 = t − i dsigmoid dsum = do di = d diσ(i) = σ(i)(1 − σ(i)) dsum dprod = do di1 = d di1 i1 + i2 = 1, do di2 = 1 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 20Derivatives for Each Node L2 tsigmoid sum b2prod W2sigmoid sum b1prod W1x dL2 dsigmoid = do di = d di 1 2(t − i)2 = t − i dsigmoid dsum = do di = d diσ(i) = σ(i)(1 − σ(i)) dsum dprod = do di1 = d di1 i1 + i2 = 1, do di2 = 1 dsum dprod = do di1 = d di1 i1i2 = i2, do di2 = i1 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 21Backward Pass: Derivative Computation t b2 W2 b1 W1x L2 sigmoid sum prod sigmoid sum prod i2 − i1 σ (i) 1, 1 i2, i1 σ (i) 1, 1 i2, i1 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 1.0 0.0 3.7 2.9 2.2 −1.6 .900 .17 3.18 1.18 .765 .0277 .235 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 22Backward Pass: Derivative Computation t b2 W2 b1 W1x L2 sigmoid sum prod sigmoid sum prod i2 − i1 σ (i) 1, 1 i2, i1 σ (i) 1, 1 i2, i1 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 1.0 0.0 3.7 2.9 2.2 −1.6 .900 .17 3.18 1.18 .765 .0277 .180 × .235 = .0424 .235 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 23Backward Pass: Derivative Computation t b2 W2 b1 W1x L2 sigmoid sum prod sigmoid sum prod i2 − i1 σ (i) 1, 1 i2, i1 σ (i) 1, 1 i2, i1 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 1.0 0.0 3.7 2.9 2.2 −1.6 .900 .17 3.18 1.18 .765 .0277 .0424 , .0424 .180 × .235 = .0424 .235 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 24Backward Pass: Derivative Computation t b2 W2 b1 W1x L2 sigmoid sum prod sigmoid sum prod i2 − i1 σ (i) 1, 1 i2, i1 σ (i) 1, 1 i2, i1 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 1.0 1.0 0.0 3.7 2.9 2.2 −1.6 .900 .17 3.18 1.18 .765 .0277 −.0260 −.0260 , 0171 0 −.0308 0 .0171 −.0308 , .0171 −.0308 .0171 −.0308 .191 −.220 , .0382 .00712 .0424 , .0424 .180 × .235 = .0424 .235 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 25Gradients for Parameter Update t b2 W2 b1 W1x L2 sigmoid sum prod sigmoid sum prod i2 − i1 σ (i) 1, 1 i2, i1 σ (i) 1, 1 i2, i1 3.7 3.7 2.9 2.9 4.5 −5.2 −1.5 −4.6 −2.0 .0171 0 −.0308 0 .0171 −.0308 .0382 .00712 .0424 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 26Parameter Update t b2 W2 b1 W1x L2 sigmoid sum prod sigmoid sum prod i2 − i1 σ (i) 1, 1 i2, i1 σ (i) 1, 1 i2, i1 3.7 3.7 2.9 2.9 – µ .0171 0 −.0308 0 4.5 −5.2 – µ .0171 −.0308 −1.5 −4.6 – µ .0382 .00712 −2.0 – µ .0424 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 27 toolkits Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 28Explosion of Deep Learning Toolkits • University of Montreal: Theano (early, now defunct) • Google: Tensorflow • Facebook: Torch, pyTorch • Microsoft: CNTK • Amazon: MX-Net • CMU: Dynet • AMU/Edinburgh/Microsoft: Marian • ... and many more Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 29Toolkits • Machine learning architectures around computations graphs very powerful – define a computation graph – provide data and a training strategy (e.g., batching) – toolkit does the rest – seamless support of GPUs Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 30Example: PyTorch • Installation     pip install torch • Usage     import torch Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 31Some Data Types • PyTorch data type for parameter vectors, matrices etc., called torch.tensor ' & $ % W = torch.tensor([[3,4],[2,3]], requires grad=True, dtype=torch.float) b = torch.tensor([-2,-4], requires grad=True, dtype=torch.float) W2 = torch.tensor([5,-5], requires grad=True, dtype=torch.float) b2 = torch.tensor([-2], requires grad=True, dtype=torch.float) • Definition of variables includes – specification of their basic data type (float) – indication to compute gradients (requires grad=True) • Input and output     x = torch.tensor([1,0], dtype=torch.float) t = torch.tensor([1], dtype=torch.float) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 32Computation Graph • Computation graph ' & $ % s = W.mv(x) + b h = torch.nn.Sigmoid()(s) z = torch.dot(W2, h) + b2 y = torch.nn.Sigmoid()(z) error = 1/2 * (t - z) ** 2 • Note – PyTorch sigmoid function torch.nn.Sigmoid() – multiplication between matrix W and vector x is mv – multiplication between two vectors W2 and h is torch.dot. Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 33Backward Computation • Here it is:     error.backward() • No need to derive gradients — all is done automatically • We can look up computed gradients # " ! >>> W2.grad tensor([-0.0360, -0.0059]) • Note – when you run this code multiple times, then gradients accumulate – reset them with, e.g., W2.grad.data.zero () Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 34Training Data • Our training set consists of the four examples of binary XOR operations. x y x ⊕ y 0 0 0 0 1 1 1 0 1 1 1 0 • Placed into array ' & $ % training data = [ [ torch.tensor([0.,0.]), torch.tensor([0.]) ], [ torch.tensor([1.,0.]), torch.tensor([1.]) ], [ torch.tensor([0.,1.]), torch.tensor([1.]) ], [ torch.tensor([1.,1.]), torch.tensor([0.]) ] ] Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 35Training Loop: Forward ' & $ % mu = 0.1 for epoch in range(1000): total_error = 0 for item in training_data: x = item[0] t = item[1] # forward computation s = W.mv(x) + b h = torch.nn.Sigmoid()(s) z = torch.dot(W2, h) + b2 y = torch.nn.Sigmoid()(z) error = 1/2 * (t - y) ** 2 total_error = total_error + error Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 36Training Loop: Backward and Updates ' & $ % # backward computation error.backward() # weight updates W.data = W - mu * W.grad.data b.data = b - mu * b.grad.data W2.data = W2 - mu * W2.grad.data b2.data = b2 - mu * b2.grad.data W.grad.data.zero_() b.grad.data.zero_() W2.grad.data.zero_() b2.grad.data.zero_() print("error: ", total_error/4) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 37Batch Training • We computed gradients for each training example, update model immediately • More common: process examples in batches, update after batch processed • Instead     error.backward() • Run back-propagation on accumulated error     total error.backward() Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 38Training Data Batch # " ! x = torch.tensor([ [0.,0.], [1.,0.], [0.,1.], [1.,1.] ]) t = torch.tensor([ 0., 1., 1., 0. ]) • Change to computation graph (input now a matrix, output a vector) ' & $ % s = x.mm(W) + b h = torch.nn.Sigmoid()(s) z = h.mv(W2) + b2 y = torch.nn.Sigmoid()(z) • Convert error vector into single number ' & $ % error = 1/2 * (t - y) ** 2 mean error = error.mean() mean error.backward() Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 39Parameter Updates (Optimizer) • Our code has explicit parameter update computations ' & $ % # weight updates W.data = W - mu * W.grad.data b.data = b - mu * b.grad.data W2.data = W2 - mu * W2.grad.data b2.data = b2 - mu * b2.grad.data • But fancier optimizers are typically used (Adam, etc.) • This requires more complex implementation Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 40torch.nn.Module • Neural network model is defined as class derived from torch.nn.Module ' & $ % class ExampleNet(torch.nn.Module): def init (self): super(ExampleNet, self). init () self.layer1 = torch.nn.Linear(2,2) self.layer2 = torch.nn.Linear(2,1) self.layer1.weight = torch.nn.Parameter(torch.tensor([[3.,2.],[4.,3.]])) self.layer1.bias = torch.nn.Parameter(torch.tensor([-2.,-4.])) self.layer2.weight = torch.nn.Parameter(torch.tensor([[5.,-5.]])) self.layer2.bias = torch.nn.Parameter(torch.tensor([-2.])) def forward(self, x): s = self.layer1(x) h = torch.nn.Sigmoid()(s) z = self.layer2(h) y = torch.nn.Sigmoid()(z) return y Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 41Optimizer Definition • Instantiation of neural network object     net = ExampleNet() • Optimizer definition     optimizer = torch.optim.SGD(net.parameters(), lr=0.1) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 42Training Loop ' & $ % for iteration in range(1000): optimizer.zero grad() out = net.forward( x ) error = 1/2 * (t - out) ** 2 mean error = error.mean() print("error: ",mean error.data) mean error.backward() optimizer.step() Philipp Koehn Machine Translation: Computation Graphs 29 September 2020 43 code available on web page for textbook http://www.statmt.org/nmt-book/ Philipp Koehn Machine Translation: Computation Graphs 29 September 2020