Neural Networks Language Models Philipp Koehn 30 September 2021 Philipp Koehn Machine Translation: Neural Networks 30 September 2021 1N-Gram Backoff Language Model • Previously, we approximated p(W) = p(w1, w2, ..., wn) • ... by applying the chain rule p(W) = i p(wi|w1, ..., wi−1) • ... and limiting the history (Markov order) p(wi|w1, ..., wi−1) p(wi|wi−4, wi−3, wi−2, wi−1) • Each p(wi|wi−4, wi−3, wi−2, wi−1) may not have enough statistics to estimate → we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi) – exact details of backing off get complicated — ”interpolated Kneser-Ney” Philipp Koehn Machine Translation: Neural Networks 30 September 2021 2Refinements • A whole family of back-off schemes • Skip-n gram models that may back off to p(wi|wi−2) • Class-based models p(C(wi)|C(wi−4), C(wi−3), C(wi−2), C(wi−1)) ⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words Philipp Koehn Machine Translation: Neural Networks 30 September 2021 3First Sketch Softmax FF wi h wi-4 wi-3 wi-2 wi-1 History Hidden Layer Output Word Philipp Koehn Machine Translation: Neural Networks 30 September 2021 4Representing Words • Words are represented with a one-hot vector, e.g., – dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....) • That’s a large vector! • Remedies – limit to, say, 20,000 most frequent words, rest are OTHER – place words in √ n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label – splitting rare words into subwords – character-based models Philipp Koehn Machine Translation: Neural Networks 30 September 2021 5 word embeddings Philipp Koehn Machine Translation: Neural Networks 30 September 2021 6Add a Hidden Layer Softmax FF wi h Hidden Layer Output Word wi-4 wi-3 wi-2 wi-1 History Embed Embed Embed Embed EmbeddingEw • Map each word first into a lower-dimensional real-valued space • Shared weight matrix E Philipp Koehn Machine Translation: Neural Networks 30 September 2021 7Details (Bengio et al., 2003) • Add direct connections from embedding layer to output layer • Activation functions – input→embedding: none – embedding→hidden: tanh – hidden→output: softmax • Training – loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word Philipp Koehn Machine Translation: Neural Networks 30 September 2021 8Word Embeddings C Word Embedding • By-product: embedding of word into continuous space • Similar contexts → similar embedding • Recall: distributional semantics Philipp Koehn Machine Translation: Neural Networks 30 September 2021 9Word Embeddings Philipp Koehn Machine Translation: Neural Networks 30 September 2021 10Word Embeddings Philipp Koehn Machine Translation: Neural Networks 30 September 2021 11Are Word Embeddings Magic? • Morphosyntactic regularities (Mikolov et al., 2013) – adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw • Semantic regularities – clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities Philipp Koehn Machine Translation: Neural Networks 30 September 2021 12 recurrent neural networks Philipp Koehn Machine Translation: Neural Networks 30 September 2021 13Recurrent Neural Networks Hidden Layer Output Word History Embedding Softmax tanh w1 Embed0 • Start: predict second word from first • Mystery layer with nodes all with value 1 Philipp Koehn Machine Translation: Neural Networks 30 September 2021 14Recurrent Neural Networks Hidden Layer Output Word History Embedding Softmax tanh w1 Embed0 Softmax tanh w2 Embed copy Philipp Koehn Machine Translation: Neural Networks 30 September 2021 15Recurrent Neural Networks Hidden Layer Output Word History Embedding Softmax tanh w1 Embed0 Softmax tanh w2 Embed Softmax tanh w3 Embed copy copy Philipp Koehn Machine Translation: Neural Networks 30 September 2021 16Training Softmax RNN w1 Embed 0 Hidden Layer Output Word History Embedding Cost yt ht Ewt wt • Process first training example • Update weights with back-propagation Philipp Koehn Machine Translation: Neural Networks 30 September 2021 17Training RNN Softmax RNN w2 Embed Hidden Layer Output Word History Embedding Cost yt ht Ewt wt • Process second training example • Update weights with back-propagation • And so on... • But: no feedback to previous history Philipp Koehn Machine Translation: Neural Networks 30 September 2021 18Back-Propagation Through Time Softmax RNN w1 Embed 0 Softmax RNN w2 Embed Hidden Layer Output Word History Embedding Softmax RNN w3 Embed Cost Cost Cost yt ht Ewt wt • After processing a few training examples, update through the unfolded recurrent neural network Philipp Koehn Machine Translation: Neural Networks 30 September 2021 19Back-Propagation Through Time • Carry out back-propagation though time (BPTT) after each training example – 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps • Or: update in mini-batches – process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example Philipp Koehn Machine Translation: Neural Networks 30 September 2021 20 long short term memory Philipp Koehn Machine Translation: Neural Networks 30 September 2021 21Vanishing Gradients • Error is propagated to previous steps • Updates consider – prediction at that time step – impact on future time steps • Vanishing gradient: propagated error disappears Philipp Koehn Machine Translation: Neural Networks 30 September 2021 22Recent vs. Early History • Hidden layer plays double duty – memory of the network – continuous space representation used to predict output words • Sometimes only recent context important After much economic progress over the years, the country → has • Sometimes much earlier context important The country which has made much economic progress over the years still → has Philipp Koehn Machine Translation: Neural Networks 30 September 2021 23Long Short Term Memory (LSTM) • Design quite elaborate, although not very complicated to use • Basic building block: LSTM cell – similar to a node in a hidden layer – but: has a explicit memory state • Output and memory state change depends on gates – input gate: how much new input changes memory state – forget gate: how much of prior memory state is retained – output gate: how strongly memory state is passed on to next layer. • Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2) Philipp Koehn Machine Translation: Neural Networks 30 September 2021 24LSTM Cell inputgate outputgate forget gate X i m o ⊗ ⊕ ⊗ h m ⊗ LSTM Layer Time t-1 Next Layer Y LSTM Layer Time t Preceding Layer Philipp Koehn Machine Translation: Neural Networks 30 September 2021 25LSTM Cell (Math) • Memory and output values at time step t memoryt = gateinput × inputt + gateforget × memoryt−1 outputt = gateoutput × memoryt • Hidden node value ht passed on to next layer applies activation function f ht = f(outputt ) • Input computed as input to recurrent neural network node – given node values for prior layer xt = (xt 1, ..., xt X) – given values for hidden layer from previous time step ht−1 = (ht−1 1 , ..., ht−1 H ) – input value is combination of matrix multiplication with weights wx and wh and activation function g inputt = g X i=1 wx i xt i + H i=1 wh i ht−1 i Philipp Koehn Machine Translation: Neural Networks 30 September 2021 26Values for Gates • Gates are very important • How do we compute their value? → with a neural network layer! • For each gate a ∈ (input, forget, output) – weight matrix Wxa to consider node values in previous layer xt – weight matrix Wha to consider hidden layer ht−1 at previous time step – weight matrix Wma to consider memory at previous time step memory t−1 – activation function h gatea = h X i=1 wxa i xt i + H i=1 wha i ht−1 i + H i=1 wma i memoryt−1 i Philipp Koehn Machine Translation: Neural Networks 30 September 2021 27Training • LSTM are trained the same way as recurrent neural networks • Back-propagation through time • This looks all very complex, but: – all the operations are still based on ∗ matrix multiplications ∗ differentiable activation functions → we can compute gradients for objective function with respect to all parameters → we can compute update functions Philipp Koehn Machine Translation: Neural Networks 30 September 2021 28What is the Point? (from Tran, Bisazza, Monz, 2016) • Each node has memory memoryi independent from current output hi • Memory may be carried through unchanged (gatei input = 0, gatei memory = 1) ⇒ can remember important features over long time span (capture long distance dependencies) Philipp Koehn Machine Translation: Neural Networks 30 September 2021 29Visualizing Individual Cells Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks” Philipp Koehn Machine Translation: Neural Networks 30 September 2021 30Visualizing Individual Cells Philipp Koehn Machine Translation: Neural Networks 30 September 2021 31Gated Recurrent Unit (GRU) updategate reset gate X x ⊕ h h ⊗ GRU Layer Time t-1 Next Layer Y GRU Layer Time t Preceding Layer Philipp Koehn Machine Translation: Neural Networks 30 September 2021 32Gated Recurrent Unit (Math) • Two Gates updatet = g(Wupdate inputt + Uupdate statet−1 + biasupdate) resett = g(Wreset inputt + Ureset statet−1 + biasreset) • Combination of input and previous state (similar to traditional recurrent neural network) combinationt = f(W inputt + U(resett ◦ statet−1)) • Interpolation with previous state statet =(1 − updatet) ◦ statet−1 + updatet ◦ combinationt) + bias Philipp Koehn Machine Translation: Neural Networks 30 September 2021 33 deeper models Philipp Koehn Machine Translation: Neural Networks 30 September 2021 34Deep Learning? Output Hidden Layer Input Word Embedding Softmax RNN Softmax RNN Softmax RNN Embed Embed Embed yt ht E xt Shallow • Not much deep learning so far • Between prediction from input to output: only 1 hidden layer • How about more hidden layers? Philipp Koehn Machine Translation: Neural Networks 30 September 2021 35Deep Models Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Output Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Word Embedding Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Deep Stacked Deep Transitional Philipp Koehn Machine Translation: Neural Networks 30 September 2021 36 questions? Philipp Koehn Machine Translation: Neural Networks 30 September 2021