Neural Machine Translation Philipp Koehn 3 October 2023 Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 1Language Models • Modeling variants – feed-forward neural network – recurrent neural network – long short term memory neural network • May include input context Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 2Feed Forward Neural Language Model Softmax FF wi h Hidden Layer Output Word wi-4 wi-3 wi-2 wi-1 History Embed Embed Embed Embed EmbeddingEw Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 3Recurrent Neural Language Model Embed Input Word Embedding Input Word Output Word Prediction ti Output Wordyi E xj xj Recurrent State hj Softmax the RNN Predict the first word of a sentence Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 4Recurrent Neural Language Model Embed the Embed Input Word Embedding Input Word Softmax Output Word Prediction ti house Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN Predict the second word of a sentence Re-use hidden state from first word prediction Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 5Recurrent Neural Language Model Embed the Embed house Embed Input Word Embedding Input Word Softmax Softmax Output Word Prediction ti house is Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN Predict the third word of a sentence ... and so on Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 6Recurrent Neural Language Model Embed the Embed house Embed is Embed big Embed . Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 7Recurrent Neural Translation Model • We predicted the words of a sentence • Why not also predict their translations? Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 8Encoder-Decoder Model Embed the Embed house Embed is Embed big Embed . Embed Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . das Output Wordyi E xj xj Recurrent State hj das Embed Haus Embed ist Embed groß Embed . Embed Softmax Softmax Softmax Softmax Softmax Haus ist groß . Softmax the RNN RNNRNN RNN RNN RNN RNN RNN RNN RNN RNN RNN • Obviously madness • Proposed by Google (Sutskever et al. 2014) Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 9What is Missing? • Alignment of input words to output words ⇒ Solution: attention mechanism Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 10 neural translation model with attention Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 11Input Encoding Embed the Embed house Embed is Embed big Embed . Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN RNN RNN RNN • Inspiration: recurrent neural network language model on the input side Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 12Hidden Language Model States • This gives us the hidden states RNN RNNRNN RNN RNN RNN RNN • These encode left context for each word • Same process in reverse: right context for each word RNN RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 13Input Encoder RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN RNN Embed RNN Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj RNN RNN RNN RNN RNN • Input encoder: concatenate bidrectional RNN states • Each word representation includes full left and right sentence context Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 14Encoder: Math RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN RNN Embed RNN Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj RNN RNN RNN RNN RNN • Input is sequence of words xj, mapped into embedding space ¯E xj • Bidirectional recurrent neural networks ←− hj = f( ←−− hj+1, ¯E xj) −→ hj = f( −−→ hj−1, ¯E xj) • Various choices for the function f(): feed-forward layer, GRU, LSTM, ... Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 15Decoder • We want to have a recurrent neural network predicting output words RNN RNN RNN RNN Output Word Prediction Decoder State ti si Softmax Softmax Softmax Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 16Decoder • We want to have a recurrent neural network predicting output words Embed RNN Embed Embed Embed RNN RNN RNN Output Word Prediction Output Word Embeddings Decoder State ti E yi si Softmax Softmax Softmax • We feed decisions on output words back into the decoder state Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 17Decoder • We want to have a recurrent neural network predicting output words Embed RNN Embed Embed Embed RNN RNN RNN Output Word Prediction Output Word Embeddings Decoder State Input Context ti E yi si ci Softmax Softmax Softmax • We feed decisions on output words back into the decoder state • Decoder state is also informed by the input context Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 18More Detail RNN RNN Output Word Prediction Output Word Output Word Embeddings Decoder State Input Context ti Embed das Embed yi E yi si ci Softmax • Decoder is also recurrent neural network over sequence of hidden states si si = f(si−1, Ey−1, ci) • Again, various choices for the function f(): feed-forward layer, GRU, LSTM, ... • Output word yi is selected by computing a vector ti (same size as vocabulary) ti = W(Usi−1 + V Eyi−1 + Cci) then finding the highest value in vector ti • If we normalize ti, we can view it as a probability distribution over words • Eyi is the embedding of the output word yi Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 19Attention RNN RNN Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 20Attention RNN RNN Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Given: – the previous hidden state of the decoder si−1 – the representation of input words hj = ( ←− hj, −→ hj) • Predict an alignment probability a(si−1, hj) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 21Attention RNN RNN Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Normalize attention (softmax) αij = exp(a(si−1, hj)) k exp(a(si−1, hk)) Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 22Attention RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj RNN RNN RNN RNN RNN • Relevant input context: weigh input words according to attention: ci = j αijhj Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 23Attention RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj RNN RNN RNN RNN RNN • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 24 training Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 25Comparing Prediction to Correct Word das Cost Haus Cost ist Cost Output Word Prediction Output Word Error ti yi - log ti [yi] Softmax Softmax Softmax • Current model gives some probability ti[yi] to correct word yi • We turn this into an error by computing cross-entropy: −log ti[yi] Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 27Unrolled Computation Graph Embed RNN Weighted Sum Attention RNN Embed RNN the das Embed Cost Weighted Sum Attention Embed RNN house Haus Embed Cost Weighted Sum Attention Embed RNN is ist Embed Cost Weighted Sum Attention Embed RNN big groß Embed Cost Softmax Weighted Sum Attention Embed RNN . . Embed Cost Weighted Sum Attention Embed RNN Embed Cost Softmax RNN Weighted Sum Attention RNN Embed RNN RNNRNN RNN RNN RNN Output Word Prediction Output Word Output Word Embeddings Error Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word ti yi E yi - log ti [yi] si ci αij hj E xj xj hj RNN RNN RNN RNN RNN Softmax Softmax Softmax Softmax Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 33Deeper Models • Encoder and decoder are recurrent neural networks • We can add additional layers for each step • Recall shallow and deep language models Output Hidden Layer Input Word Embedding Softmax RNN Softmax RNN Softmax RNN Embed Embed Embed yt ht E xt Shallow Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Output Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Word Embedding Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Deep Stacked Deep Transitional • Adding residual connections (short-cuts through deep layers) help Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023