Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (oh, and finish neural dependency parsing J) Lecture Plan 1. Neural dependency parsing (20 mins) 2. A bit more about neural networks (15 mins) 3. Language modeling + RNNs (45 mins) • A new NLP task: Language Modeling • A new family of neural networks: Recurrent Neural Networks (RNNs) Reminders: You should have handed in Assignment 2 by today In Assignment 3, out today, you build a neural dependency parser using PyTorch 2 motivates These are two of the most important concepts for the rest of the class! 1. How do we gain from a neural dependency parser? Indicator Features Revisited • Problem #1: sparse • Problem #2: incomplete • Problem #3: expensive computation More than 95% of parsing time is consumed by feature computation Neural Approach: learn a dense and compact feature representation 0.1 dense dim = ~1000 0.9 -0.2 0.3 -0.1 -0.5… 3 A neural dependency parser [Chen and Manning 2014] • Results on English parsing to Stanford Dependencies: • Unlabeled attachment score (UAS) = head • Labeled attachment score (LAS) = head and label Parser UAS LAS sent. / s MaltParser 89.8 87.2 469 MSTParser 91.4 88.1 10 TurboParser 92.3 89.6 8 C & M 2014 92.0 89.7 654 4 First win: Distributed Representations • We represent each word as a d-dimensional dense vector (i.e., word embedding) • Similar words are expected to have close vectors. • Meanwhile, part-of-speech tags (POS) and dependency labels are also represented as d-dimensional vectors. • The smaller discrete sets also exhibit many semantical similarities. come go werewas is good NNS (plural noun) should be close to NN (singular noun). nummod (numerical modifier) should be close to amod (adjective modifier). 5 Extracting Tokens & vector representations from configuration • We extract a set of tokens based on the stack / buffer positions: s1 s2 b1 lc(s1) rc(s1) lc(s2) rc(s2) good has control ∅ ∅ He ∅ JJ VBZ NN ∅ ∅ PRP ∅ ∅ ∅ ∅ ∅ ∅ nsubj ∅ + + word POS dep. } A concatenation of the vector representation of all these is the neural representation of a configuration 6 Second win: Deep Learning classifiers are non-linear classifiers • A softmax classifier assigns classes 𝑦 ∈ 𝐶 based on inputs 𝑥 ∈ ℝ" via the probability: • We train the weight matrix 𝑊 ∈ ℝ#×" to minimize the neg. log loss : ∑% − log 𝑝(𝑦%|𝑥%) • Traditional ML classifiers (including Naïve Bayes, SVMs, logistic regression and softmax classifier) are not very powerful classifiers: they only give linear decision boundaries This can be quite limiting à Unhelpful when a problem is complex Wouldn’t it be cool to get these correct? 7 a.k.a. “cross entropy loss” Neural Networks are more powerful • Neural networks can learn much more complex functions with nonlinear decision boundaries! • Non-linear in the original space, linear for the softmax at the top of the neural network Visualizations with ConvNetJS by Andrej Karpathy! http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html 8 Simple feed-forward neural network multi-class classifier Input layer x Hidden layer h h = ReLU(Wx + b1) Output layer y y = softmax(Uh + b2) Softmax probabilities Log loss (cross-entropy error) will be backpropagated to the embeddings 9 The hidden layer re-represents the input — it moves inputs around in an intermediate layer vector space—so it can be easily classified with a (linear) softmax ReLU = Rectified Linear Unit rect(z) = max(z,0) x is result of lookup x(i,…,i+d) = Le lookup + concat Neural Dependency Parser Model Architecture Input layer x lookup + concat Hidden layer h h = ReLU(Wx + b1) Output layer y y = softmax(Uh + b2) Softmax probabilities 10 { Shift , Left-Arcr , Right-Arcr } Dependency parsing for sentence structure 11 Chen and Manning (2014) showed that neural networks can accurately determine the structure of sentences, supporting meaning interpretation It was the first simple, successful neural dependency parser The dense representations (and non-linear classifier) let it outperform other greedy parsers in both accuracy and speed Further developments in transition-based neural dependency parsing 12 This work was further developed and improved by others, including in particular at Google • Bigger, deeper networks with better tuned hyperparameters • Beam search • Global, conditional random field (CRF)-style inference over the decision sequence Leading to SyntaxNet and the Parsey McParseFace model (2016): “The World’s Most Accurate Parser” https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html Method UAS LAS (PTB WSJ SD 3.3) Chen & Manning 2014 92.0 89.7 Weiss et al. 2015 93.99 92.05 Andor et al. 2016 94.61 92.79 Graph-based dependency parsers 13 • Compute a score for every possible dependency (choice of head) for each word • Doing this well requires more than just knowing the two words • We need good “contextual” representations of each word token, which we will develop in the coming lectures • Repeat the same process for each other word; find the best parse (MST algorithm) ROOT The big cat sat 0.5 0.3 0.8 2.0 e.g., picking the head for “big” Graph-based dependency parsers 14 • Compute a score for every possible dependency (choice of head) for each word • Doing this well requires more than just knowing the two words • We need good “contextual” representations of each word token, which we will develop in the coming lectures • Repeat the same process for each other word; find the best parse (MST algorithm) ROOT The big cat sat 0.5 0.3 0.8 2.0 e.g., picking the head for “big” A Neural graph-based dependency parser [Dozat and Manning 2017; Dozat, Qi, and Manning 2017] 15 • This paper revived interest in graph-based dependency parsing in a neural world • Designed a biaffine scoring model for neural dependency parsing • Also crucially uses a neural sequence model, something we discuss next week • Really great results! • But slower than the simple neural transition-based parsers • There are n2 possible dependencies in a sentence of length n Method UAS LAS (PTB WSJ SD 3.3 Chen & Manning 2014 92.0 89.7 Weiss et al. 2015 93.99 92.05 Andor et al. 2016 94.61 92.79 Dozat & Manning 2017 95.74 94.08 2. A bit more about neural networks 16 We have models with many parameters! Regularization! 17 • A full loss function includes regularization over all parameters 𝜃, e.g., L2 regularization: • Classic view: Regularization works to prevent overfitting when we have a lot of features (or later a very powerful/deep model, etc.) • Now: Regularization produces models that generalize well when we have a “big” model • We do not care that our models overfit on the training data, even though they are hugely overfit model “power” Training error Test error overfitting error 0 Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov 2012/JMLR 2014) 18 Preventing Feature Co-adaptation = Good Regularization Method! • Training time: at each instance of evaluation (in online SGD-training), randomly set 50% of the inputs to each neuron to 0 • Test time: halve the model weights (now twice as many) • (Except usually only drop first layer inputs a little (~15%) or not at all) • This prevents feature co-adaptation: A feature cannot only be useful in the presence of particular other features • In a single layer: A kind of middle-ground between Naïve Bayes (where all feature weights are set independently) and logistic regression models (where weights are set in the context of all others) • Can be thought of as a form of model bagging (i.e., like an ensemble model) • Nowadays usually thought of as strong, feature-dependent regularizer [Wager, Wang, & Liang 2013] “Vectorization” 19 • E.g., looping over word vectors versus concatenating them all into one large matrix and then multiplying the softmax weights with that matrix: • 1000 loops, best of 3: 639 µs per loop 10000 loops, best of 3: 53.8 µs per loop ß Now using a single a C x N matrix • Matrices are awesome!!! Always try to use vectors and matrices rather than for loops! • The speed gain goes from 1 to 2 orders of magnitude with GPUs! tanh is just a rescaled and shifted sigmoid (2 × as steep, [−1,1]): Both logistic and tanh are still used in various places (e.g., to get a probability), but are no longer the defaults for making deep networks For building a deep network, the first thing you should try is ReLU — it trains quickly and performs well due to good gradient backflow Non-linearities, old and new logistic (“sigmoid”) tanh hard tanh ReLU (Rectified Linear Unit) tanh(z) = 2logistic(2z)−1 1 0 1 −1 rect(z) = max(z,0) Leaky ReLU / Swish [Ramachandran, Parametric ReLU Zoph & Le 2017] Parameter Initialization • You normally must initialize weights to small random values (i.e., not zero matrices!) • To avoid symmetries that prevent learning/specialization • Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value if weights were 0 (e.g., mean target or inverse sigmoid of mean target) • Initialize all other weights ~ Uniform(–r, r), with r chosen so numbers get neither too big or too small [later the need for this is removed with use of layer normalization] • Xavier initialization has variance inversely proportional to fan-in nin (previous layer size) and fan-out nout (next layer size): Optimizers • Usually, plain SGD will work just fine! • However, getting good results will often require hand-tuning the learning rate • See next slide • For more complex nets and situations, or just to avoid worry, you often do better with one of a family of more sophisticated “adaptive” optimizers that scale the parameter adjustment by an accumulated gradient. • These models give differential per-parameter learning rates • Adagrad • RMSprop • Adam ß A fairly good, safe place to begin in many cases • SparseAdam • … Learning Rates • You can just use a constant learning rate. Start around lr = 0.001? • It must be order of magnitude right – try powers of 10 • Too big: model may diverge or not converge • Too small: your model may not have trained by the assignment deadline • Better results can generally be obtained by allowing learning rates to decrease as you train • By hand: halve the learning rate every k epochs • An epoch = a pass through the data (shuffled or sampled – not in same order each time) • By a formula: 𝑙𝑟 = 𝑙𝑟! 𝑒"#$, for epoch t • There are fancier methods like cyclic learning rates (q.v.) • Fancier optimizers still use a learning rate but it may be an initial rate that the optimizer shrinks – so you may want to start with a higher learning rate 3. Language Modeling + RNNs 24 • Language Modeling is the task of predicting what word comes next the students opened their ______ • More formally: given a sequence of words , compute the probability distribution of the next word : where can be any word in the vocabulary • A system that does this is called a Language Model Language Modeling exams minds laptops books 25 Language Modeling • You can also think of a Language Model as a system that assigns probability to a piece of text • For example, if we have some text , then the probability of this text (according to the Language Model) is: 26 This is what our LM provides You use Language Models every day! 27 You use Language Models every day! 28 n-gram Language Models the students opened their ______ • Question: How to learn a Language Model? • Answer (pre- Deep Learning): learn an n-gram Language Model! • Definition: A n-gram is a chunk of n consecutive words. • unigrams: “the”, “students”, “opened”, ”their” • bigrams: “the students”, “students opened”, “opened their” • trigrams: “the students opened”, “students opened their” • 4-grams: “the students opened their” • Idea: Collect statistics about how frequent different n-grams are and use these to predict next word. 29 n-gram Language Models 30 • First we make a Markov assumption: 𝑥($&') depends only on the preceding n-1 words (statistical approximation) (definition of conditional prob) (assumption) n-1 words prob of a n-gram prob of a (n-1)-gram • Question: How do we get these n-gram and (n-1)-gram probabilities? • Answer: By counting them in some large corpus of text! n-gram Language Models: Example Suppose we are learning a 4-gram Language Model. as the proctor started the clock, the students opened their _____ discard condition on this For example, suppose that in the corpus: • “students opened their” occurred 1000 times • “students opened their books” occurred 400 times • à P(books | students opened their) = 0.4 • “students opened their exams” occurred 100 times • à P(exams | students opened their) = 0.1 Should we have discarded the “proctor” context? 31 Sparsity Problems with n-gram Language Models Note: Increasing n makes sparsity problems worse. Typically, we can’t have n bigger than 5. Problem: What if “students opened their” never occurred in data? Then we can’t calculate probability for any 𝑤! Sparsity Problem 2 Problem: What if “students opened their 𝑤” never occurred in data? Then 𝑤 has probability 0! Sparsity Problem 1 (Partial) Solution: Add small 𝛿 to the count for every 𝑤 ∈ 𝑉. This is called smoothing. (Partial) Solution: Just condition on “opened their” instead. This is called backoff. 32 Storage Problems with n-gram Language Models 33 Storage: Need to store count for all n-grams you saw in the corpus. Increasing n or increasing corpus increases model size! n-gram Language Models in practice • You can build a simple trigram Language Model over a 1.7 million word corpus (Reuters) in a few seconds on your laptop* today the _______ * Try for yourself: https://nlpforhackers.io/language-models/Otherwise, seems reasonable! company 0.153 bank 0.153 price 0.077 italian 0.039 emirate 0.039 … get probability distribution Sparsity problem: not much granularity in the probability distribution Business and financial news 34 Generating text with a n-gram Language Model 35 You can also use a Language Model to generate text today the _______ condition on this company 0.153 bank 0.153 price 0.077 italian 0.039 emirate 0.039 … get probability distribution sample Generating text with a n-gram Language Model You can also use a Language Model to generate text today the price _______ condition on this of 0.308 for 0.050 it 0.046 to 0.046 is 0.031 … get probability distribution sample 36 Generating text with a n-gram Language Model You can also use a Language Model to generate text today the price of _______ condition on this the 0.072 18 0.043 oil 0.043 its 0.036 gold 0.018 … get probability distribution sample 37 Generating text with a n-gram Language Model 38 You can also use a Language Model to generate text today the price of gold per ton , while production of shoe lasts and shoe industry , the bank intervened just after it considered and rejected an imf demand to rebuild depleted european stocks , sept 30 end primary 76 cts a share . Surprisingly grammatical! …but incoherent. We need to consider more than three words at a time if we want to model language well. But increasing n worsens sparsity problem, and increases model size… How to build a neural Language Model? • Recall the Language Modeling task: • Input: sequence of words • Output: prob dist of the next word • How about a window-based neural model? • We saw this applied to Named Entity Recognition in Lecture 3: 39 in Paris are amazingmuseums LOCATION A fixed-window neural Language Model the students opened theiras the proctor started the clock ______ discard fixed window 40 A fixed-window neural Language Model the students opened their books laptops concatenated word embeddings words / one-hot vectors hidden layer a zoo output distribution 41 A fixed-window neural Language Model the students opened their books laptops a zoo Improvements over n-gram LM: • No sparsity problem • Don’t need to store all observed n-grams Remaining problems: • Fixed window is too small • Enlarging window enlarges 𝑊 • Window can never be large enough! • 𝑥(') and 𝑥()) are multiplied by completely different weights in 𝑊. No symmetry in how the inputs are processed. We need a neural architecture that can process any length input 42 Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model Recurrent Neural Networks (RNN) hidden states input sequence (any length) … … … Core idea: Apply the same weights 𝑊 repeatedlyA family of neural architectures 43 outputs (optional) A Simple RNN Language Model the students opened theirwords / one-hot vectors books laptops word embeddings a zoo output distribution Note: this input sequence could be much longer now! hidden states is the initial hidden state 44 RNN Language Models the students opened their books laptops a zoo RNN Advantages: • Can process any length input • Computation for step t can (in theory) use information from many steps back • Model size doesn’t increase for longer input context • Same weights applied on every timestep, so there is symmetry in how inputs are processed. RNN Disadvantages: • Recurrent computation is slow • In practice, difficult to access information from many steps back More on these later in the course 45 Training an RNN Language Model • Get a big corpus of text which is a sequence of words • Feed into RNN-LM; compute output distribution for every step t. • i.e. predict probability dist of every word, given words so far • Loss function on step t is cross-entropy between predicted probability distribution , and the true next word (one-hot for ): • Average this to get overall loss for entire training set: 46 Training an RNN Language Model = negative log prob of “students” the students opened their … examsCorpus Loss … 47 Predicted prob dists Training an RNN Language Model the students opened their … examsCorpus Loss … 48 Predicted prob dists = negative log prob of “opened” Training an RNN Language Model the students opened their … examsCorpus Loss … 49 Predicted prob dists = negative log prob of “their” Training an RNN Language Model the students opened their … examsCorpus Loss … 50 Predicted prob dists = negative log prob of “exams” Training an RNN Language Model + + + + … = the students opened their … exams … 51 Corpus Loss Predicted prob dists “Teacher forcing” Training a RNN Language Model • However: Computing loss and gradients across entire corpus is too expensive! • In practice, consider as a sentence (or a document) • Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small chunk of data, and update. • Compute loss for a sentence (actually, a batch of sentences), compute gradients and update weights. Repeat. 52 Backpropagation for RNNs …… Question: What’s the derivative of w.r.t. the repeated weight matrix ? Answer: “The gradient w.r.t. a repeated weight is the sum of the gradient w.r.t. each time it appears” 53 Why? Multivariable Chain Rule 54 Source: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version Backpropagation for RNNs: Proof sketch 55 … In our example: equalsequals equals Apply the multivariable chain rule: = 1 Source: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version Backpropagation for RNNs …… Question: How do we calculate this? Answer: Backpropagate over timesteps i=t,…,0, summing gradients as you go. This algorithm is called “backpropagation through time” [Werbos, P.G., 1988, Neural Networks 1, and others] 56 Generating text with a RNN Language Model Just like a n-gram Language Model, you can use a RNN Language Model to generate text by repeated sampling. Sampled output becomes next step’s input. my favorite season is … sample favorite sample season sample is sample spring spring57 Generating text with an RNN Language Model Let’s have some fun! • You can train an RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on Obama speeches: Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0 58 Generating text with an RNN Language Model Let’s have some fun! • You can train an RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on Harry Potter: Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6 59 Generating text with an RNN Language Model Let’s have some fun! • You can train an RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on recipes: Source: https://gist.github.com/nylki/1efbaa36635956d35bcc 60 Generating text with a RNN Language Model 61 Let’s have some fun! • You can train a RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on paint color names: Source: http://aiweirdness.com/post/160776374467/new-paint-colors-invented-by-neural-network This is an example of a character-level RNN-LM (predicts what character comes next) Evaluating Language Models • The standard evaluation metric for Language Models is perplexity. • This is equal to the exponential of the cross-entropy loss : 62 Inverse probability of corpus, according to Language Model Normalized by number of words Lower perplexity is better! RNNs have greatly improved perplexity n-gram model Increasingly complex RNNs Perplexity improves (lower is better) Source: https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/ 63 Why should we care about Language Modeling? 64 • Language Modeling is a benchmark task that helps us measure our progress on understanding language • Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text: • Predictive typing • Speech recognition • Handwriting recognition • Spelling/grammar correction • Authorship identification • Machine translation • Summarization • Dialogue • etc. Recap • Language Model: A system that predicts the next word • Recurrent Neural Network: A family of neural networks that: • Take sequential input of any length • Apply the same weights on each step • Can optionally produce output on each step • Recurrent Neural Network ≠ Language Model • We’ve shown that RNNs are a great way to build a LM. • But RNNs are useful for much more! 65 RNNs can be used for tagging e.g., part-of-speech tagging, named entity recognition knocked over the vasethe startled cat VBN IN DT NNDT JJ NN 66 RNNs can be used for sentence classification the movie a lotoverall I enjoyed positive Sentence encoding How to compute sentence encoding? e.g., sentiment classification 67 RNNs can be used for sentence classification the movie a lotoverall I enjoyed positive Sentence encoding equals How to compute sentence encoding? Basic way: Use final hidden state e.g., sentiment classification 68 RNNs can be used for sentence classification the movie a lotoverall I enjoyed positive Sentence encoding How to compute sentence encoding? Usually better: Take element-wise max or mean of all hidden states e.g., sentiment classification 69 RNNs can be used as an encoder module e.g., question answering, machine translation, many other tasks! Context: Ludwig van Beethoven was a German composer and pianist. A crucial figure … Beethoven ?what nationality wasQuestion: Here the RNN acts as an encoder for the Question (the hidden states represent the Question). The encoder is part of a larger neural system. Answer: German lots of neural architecture lotsofneural architecture 70 RNN-LMs can be used to generate text e.g., speech recognition, machine translation, summarization what’s the weatherthewhat’s This is an example of a conditional language model. We’ll see Machine Translation in much more detail later. 71 Input (audio) conditioning RNN-LM Terminology and a look forward By the end of the course: You will understand phrases like “stacked bidirectional LSTM with residual connections and self-attention” The RNN described in this lecture = simple/vanilla/Elman RNN Next lecture: You will learn about other RNN flavors like GRU and LSTM 72 and multi-layer RNNs