Neural Networks Language Models
Philipp Koehn
30 September 2021
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
1N-Gram Backoff Language Model
• Previously, we approximated
p(W) = p(w1, w2, ..., wn)
• ... by applying the chain rule
p(W) =
i
p(wi|w1, ..., wi−1)
• ... and limiting the history (Markov order)
p(wi|w1, ..., wi−1) p(wi|wi−4, wi−3, wi−2, wi−1)
• Each p(wi|wi−4, wi−3, wi−2, wi−1) may not have enough statistics to estimate
→ we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi)
– exact details of backing off get complicated — ”interpolated Kneser-Ney”
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
2Reﬁnements
• A whole family of back-off schemes
• Skip-n gram models that may back off to p(wi|wi−2)
• Class-based models p(C(wi)|C(wi−4), C(wi−3), C(wi−2), C(wi−1))
⇒ We are wrestling here with
– using as much relevant evidence as possible
– pooling evidence between words
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
3First Sketch
Softmax
FF
wi
h
wi-4 wi-3 wi-2 wi-1 History
Hidden Layer
Output Word
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
4Representing Words
• Words are represented with a one-hot vector, e.g.,
– dog = (0,0,0,0,1,0,0,0,0,....)
– cat = (0,0,0,0,0,0,0,1,0,....)
– eat = (0,1,0,0,0,0,0,0,0,....)
• That’s a large vector!
• Remedies
– limit to, say, 20,000 most frequent words, rest are OTHER
– place words in
√
n classes, so each word is represented by
∗ 1 class label
∗ 1 word in class label
– splitting rare words into subwords
– character-based models
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
5
word embeddings
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
6Add a Hidden Layer
Softmax
FF
wi
h Hidden Layer
Output Word
wi-4 wi-3 wi-2 wi-1 History
Embed Embed Embed Embed EmbeddingEw
• Map each word ﬁrst into a lower-dimensional real-valued space
• Shared weight matrix E
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
7Details (Bengio et al., 2003)
• Add direct connections from embedding layer to output layer
• Activation functions
– input→embedding: none
– embedding→hidden: tanh
– hidden→output: softmax
• Training
– loop through the entire corpus
– update between predicted probabilities and 1-hot vector for output word
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
8Word Embeddings
C
Word Embedding
• By-product: embedding of word into continuous space
• Similar contexts → similar embedding
• Recall: distributional semantics
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
9Word Embeddings
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
10Word Embeddings
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
11Are Word Embeddings Magic?
• Morphosyntactic regularities (Mikolov et al., 2013)
– adjectives base form vs. comparative, e.g., good, better
– nouns singular vs. plural, e.g., year, years
– verbs present tense vs. past tense, e.g., see, saw
• Semantic regularities
– clothing is to shirt as dish is to bowl
– evaluated on human judgment data of semantic similarities
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
12
recurrent neural networks
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
13Recurrent Neural Networks
Hidden Layer
Output Word
History
Embedding
Softmax
tanh
w1
Embed0
• Start: predict second word from ﬁrst
• Mystery layer with nodes all with value 1
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
14Recurrent Neural Networks
Hidden Layer
Output Word
History
Embedding
Softmax
tanh
w1
Embed0
Softmax
tanh
w2
Embed
copy
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
15Recurrent Neural Networks
Hidden Layer
Output Word
History
Embedding
Softmax
tanh
w1
Embed0
Softmax
tanh
w2
Embed
Softmax
tanh
w3
Embed
copy copy
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
16Training
Softmax
RNN
w1
Embed
0 Hidden Layer
Output Word
History
Embedding
Cost
yt
ht
Ewt
wt
• Process ﬁrst training example
• Update weights with back-propagation
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
17Training
RNN
Softmax
RNN
w2
Embed
Hidden Layer
Output Word
History
Embedding
Cost
yt
ht
Ewt
wt
• Process second training example
• Update weights with back-propagation
• And so on...
• But: no feedback to previous history
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
18Back-Propagation Through Time
Softmax
RNN
w1
Embed
0
Softmax
RNN
w2
Embed
Hidden Layer
Output Word
History
Embedding
Softmax
RNN
w3
Embed
Cost Cost Cost
yt
ht
Ewt
wt
• After processing a few training examples,
update through the unfolded recurrent neural network
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
19Back-Propagation Through Time
• Carry out back-propagation though time (BPTT) after each training example
– 5 time steps seems to be sufﬁcient
– network learns to store information for more than 5 time steps
• Or: update in mini-batches
– process 10-20 training examples
– update backwards through all examples
– removes need for multiple steps for each training example
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
20
long short term memory
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
21Vanishing Gradients
• Error is propagated to previous steps
• Updates consider
– prediction at that time step
– impact on future time steps
• Vanishing gradient: propagated error disappears
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
22Recent vs. Early History
• Hidden layer plays double duty
– memory of the network
– continuous space representation used to predict output words
• Sometimes only recent context important
After much economic progress over the years, the country → has
• Sometimes much earlier context important
The country which has made much economic progress over the years still → has
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
23Long Short Term Memory (LSTM)
• Design quite elaborate, although not very complicated to use
• Basic building block: LSTM cell
– similar to a node in a hidden layer
– but: has a explicit memory state
• Output and memory state change depends on gates
– input gate: how much new input changes memory state
– forget gate: how much of prior memory state is retained
– output gate: how strongly memory state is passed on to next layer.
• Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2)
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
24LSTM Cell
inputgate
outputgate
forget gate
X i
m o
⊗ ⊕
⊗ h
m
⊗
LSTM Layer Time t-1
Next Layer
Y
LSTM Layer Time t
Preceding Layer
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
25LSTM Cell (Math)
• Memory and output values at time step t
memoryt
= gateinput × inputt
+ gateforget × memoryt−1
outputt
= gateoutput × memoryt
• Hidden node value ht
passed on to next layer applies activation function f
ht
= f(outputt
)
• Input computed as input to recurrent neural network node
– given node values for prior layer xt
= (xt
1, ..., xt
X)
– given values for hidden layer from previous time step ht−1
= (ht−1
1 , ..., ht−1
H )
– input value is combination of matrix multiplication with weights wx
and wh
and activation function g
inputt
= g
X
i=1
wx
i xt
i +
H
i=1
wh
i ht−1
i
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
26Values for Gates
• Gates are very important
• How do we compute their value?
→ with a neural network layer!
• For each gate a ∈ (input, forget, output)
– weight matrix Wxa
to consider node values in previous layer xt
– weight matrix Wha
to consider hidden layer ht−1
at previous time step
– weight matrix Wma
to consider memory at previous time step memory
t−1
– activation function h
gatea = h
X
i=1
wxa
i xt
i +
H
i=1
wha
i ht−1
i +
H
i=1
wma
i memoryt−1
i
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
27Training
• LSTM are trained the same way as recurrent neural networks
• Back-propagation through time
• This looks all very complex, but:
– all the operations are still based on
∗ matrix multiplications
∗ differentiable activation functions
→ we can compute gradients for objective function with respect to all parameters
→ we can compute update functions
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
28What is the Point?
(from Tran, Bisazza, Monz, 2016)
• Each node has memory memoryi independent from current output hi
• Memory may be carried through unchanged (gatei
input = 0, gatei
memory = 1)
⇒ can remember important features over long time span
(capture long distance dependencies)
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
29Visualizing Individual Cells
Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks”
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
30Visualizing Individual Cells
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
31Gated Recurrent Unit (GRU)
updategate
reset gate
X x ⊕ h
h
⊗
GRU Layer Time t-1
Next Layer
Y
GRU Layer Time t
Preceding Layer
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
32Gated Recurrent Unit (Math)
• Two Gates
updatet = g(Wupdate inputt + Uupdate statet−1 + biasupdate)
resett = g(Wreset inputt + Ureset statet−1 + biasreset)
• Combination of input and previous state
(similar to traditional recurrent neural network)
combinationt = f(W inputt + U(resett ◦ statet−1))
• Interpolation with previous state
statet =(1 − updatet) ◦ statet−1 +
updatet ◦ combinationt) + bias
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
33
deeper models
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
34Deep Learning?
Output
Hidden
Layer
Input Word
Embedding
Softmax
RNN
Softmax
RNN
Softmax
RNN
Embed Embed Embed
yt
ht
E xt
Shallow
• Not much deep learning so far
• Between prediction from input to output: only 1 hidden layer
• How about more hidden layers?
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
35Deep Models
Softmax
RNN
RNN
Softmax
RNN
RNN
Softmax
RNN
RNN
RNN RNN RNN
Embed Embed Embed
yt
ht,3
ht,2
ht,1
E xi
Softmax
RNN
RNN
Softmax
RNN
RNN
Softmax
RNN
RNN
RNN RNN RNN
Output
Hidden
Layer 1
Hidden
Layer 2
Hidden
Layer 3
Input Word
Embedding
Embed Embed Embed
yt
ht,3
ht,2
ht,1
E xi
Deep Stacked Deep Transitional
Philipp Koehn Machine Translation: Neural Networks 30 September 2021
36
questions?
Philipp Koehn Machine Translation: Neural Networks 30 September 2021