Recurrent Neural Networks (RNN)
hidden states
input sequence
(any length)
…
…
…
Core idea: Apply
the same weights
𝑊 repeatedlyA family of neural architectures
43
outputs
(optional)
A Simple RNN Language Model
the students opened theirwords / one-hot vectors
books
laptops
word embeddings
a zoo
output distribution
Note: this input sequence could be much
longer now!
hidden states
is the initial hidden state
44
RNN Language Models
the students opened their
books
laptops
a zoo
RNN Advantages:
• Can process any length input
• Computation for step t can (in
theory) use information from
many steps back
• Model size doesn’t increase for
longer input context
• Same weights applied on every
timestep, so there is symmetry
in how inputs are processed.
RNN Disadvantages:
• Recurrent computation is slow
• In practice, difficult to access
information from many steps
back
More on
these later
in the
course
45
Training an RNN Language Model
• Get a big corpus of text which is a sequence of words
• Feed into RNN-LM; compute output distribution for every step t.
• i.e. predict probability dist of every word, given words so far
• Loss function on step t is cross-entropy between predicted probability
distribution , and the true next word (one-hot for ):
• Average this to get overall loss for entire training set:
46
Training an RNN Language Model
= negative log prob
of “students”
the students opened their …
examsCorpus
Loss
…
47
Predicted
prob dists
Training an RNN Language Model
the students opened their …
examsCorpus
Loss
…
48
Predicted
prob dists
= negative log prob
of “opened”
Training an RNN Language Model
the students opened their …
examsCorpus
Loss
…
49
Predicted
prob dists
= negative log prob
of “their”
Training an RNN Language Model
the students opened their …
examsCorpus
Loss
…
50
Predicted
prob dists
= negative log prob
of “exams”
Training an RNN Language Model
+ + + + … =
the students opened their …
exams
…
51
Corpus
Loss
Predicted
prob dists
“Teacher forcing”
Training a RNN Language Model
• However: Computing loss and gradients across entire corpus is too
expensive!
• In practice, consider as a sentence (or a document)
• Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small
chunk of data, and update.
• Compute loss for a sentence (actually, a batch of sentences), compute gradients
and update weights. Repeat.
52
Backpropagation for RNNs
……
Question: How do we
calculate this?
Answer: Backpropagate over
timesteps i=t,…,0, summing
gradients as you go.
This algorithm is called
“backpropagation through time”
[Werbos, P.G., 1988, Neural
Networks 1, and others]
56
Generating text with a RNN Language Model
Just like a n-gram Language Model, you can use a RNN Language Model to
generate text by repeated sampling. Sampled output becomes next step’s input.
my favorite season is
…
sample
favorite
sample
season
sample
is
sample
spring
spring57
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on Obama speeches:
Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0
58
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on Harry Potter:
Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
59
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on recipes:
Source: https://gist.github.com/nylki/1efbaa36635956d35bcc
60
Generating text with a RNN Language Model
61
Let’s have some fun!
• You can train a RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on paint color names:
Source: http://aiweirdness.com/post/160776374467/new-paint-colors-invented-by-neural-network
This is an example of a character-level RNN-LM (predicts what character comes next)