Natural Language Processing with Deep Learning CS224N/Ling284 John Hewitt Lecture 9: Self-Attention and Transformers Lecture Plan 1. From recurrence (RNN) to attention-based NLP models 2. Introducing the Transformer model 3. Great results with Transformers 4. Drawbacks and variants of Transformers Reminders: Assignment 4 due on Thursday! Mid-quarter feedback survey due Tuesday, Feb 16 at 11:59PM PST! Final project proposal due Tuesday, Feb 16 at 4:30PM PST! Please try to hand in the project proposal on time; we want to get you feedback quickly! 2 As of last week: recurrent models for (most) NLP! • Circa 2016, the de facto strategy in NLP is to encode sentences with a bidirectional LSTM: (for example, the source sentence in a translation) 3 • Define your output (parse, sentence, summary) as a sequence, and use an LSTM to generate it. • Use attention to allow flexible access to memory Today: Same goals, different building blocks • Last week, we learned about sequence-to-sequence problems and encoder-decoder models. • Today, we’re not trying to motivate entirely new ways of looking at problems (like Machine Translation) • Instead, we’re trying to find the best building blocks to plug into our models and enable broad progress. 4 2014-2017ish Recurrence Lots of trial and error 2021 ?????? Issues with recurrent models: Linear interaction distance • RNNs are unrolled “left-to-right”. • This encodes linear locality: a useful heuristic! • Nearby words often affect each other’s meanings • Problem: RNNs take O(sequence length) steps for distant word pairs to interact. 5 tasty pizza The chef waswho … O(sequence length) Issues with recurrent models: Linear interaction distance • O(sequence length) steps for distant word pairs to interact means: • Hard to learn long-distance dependencies (because gradient problems!) • Linear order of words is “baked in”; we already know linear order isn’t the right way to think about sentences… 6 The waschef who … Info of chef has gone through O(sequence length) many layers! Issues with recurrent models: Lack of parallelizability • Forward and backward passes have O(sequence length) unparallelizable operations • GPUs can perform a bunch of independent computations at once! • But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed • Inhibits training on very large datasets! 7 h1 0 1 T hT T-1 h2 1 2 2 3 Numbers indicate min # of steps before a state can be computed If not recurrence, then what? How about word windows? • Word window models aggregate local contexts • (Also known as 1D convolution; we’ll go over this in depth later!) • Number of unparallelizable operations does not increase sequence length! 8 Numbers indicate min # of steps before a state can be computed 0 0 0 0 0 0 0 0 h1 h2 hT embedding window 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 window If not recurrence, then what? How about word windows? • Word window models aggregate local contexts • What about long-distance dependencies? • Stacking word window layers allows interaction between farther words • Maximum Interaction distance = sequence length / window size • (But if your sequences are too long, you’ll just ignore long-distance context) 9 Red states indicate those “visible” to hk embedding window (size=5) h1 hk hT window (size=5) Too far from hk to be considered If not recurrence, then what? How about attention? • Attention treats each word’s representation as a query to access and incorporate information from a set of values. • We saw attention from the decoder to the encoder; today we’ll think about attention within a single sentence. • Number of unparallelizable operations does not increase sequence length. • Maximum interaction distance: O(1), since all words interact at every layer! embedding 0 0 0 0 0 0 0 0 h1 h2 hT 2 2 2 2 2 2 2 2 attention attention 1 1 1 1 1 1 1 1 All words attend to all words in previous layer; most arrows here are omitted 10 Self-Attention • Recall: Attention operates on queries, keys, and values. • We have some queries 𝑞1, 𝑞2, … , 𝑞 𝑇. Each query is 𝑞𝑖 ∈ ℝ 𝑑 • We have some keys 𝑘1, 𝑘2, … , 𝑘 𝑇. Each key is 𝑘𝑖 ∈ ℝ 𝑑 • We have some values 𝑣1, 𝑣2, … , 𝑣 𝑇. Each value is 𝑣𝑖 ∈ ℝ 𝑑 • In self-attention, the queries, keys, and values are drawn from the same source. • For example, if the output of the previous layer is 𝑥1, … , 𝑥 𝑇, (one vec per word) we could let 𝑣𝑖 = 𝑘𝑖 = 𝑞𝑖 = 𝑥𝑖 (that is, use the same vectors for all of them!) • The (dot product) self-attention operation is as follows: The number of queries can differ from the number of keys and values in practice. 𝑒𝑖𝑗 = 𝑞𝑖 ⊤ 𝑘𝑗 Compute keyquery affinities 𝛼𝑖𝑗 = exp(𝑒𝑖𝑗) σ 𝑗′ exp(𝑒𝑖𝑗′) Compute attention weights from affinities (softmax) output 𝑖 = ෍ 𝑗 𝛼𝑖𝑗 𝑣𝑗 Compute outputs as weighted sum of values 11 Self-attention as an NLP building block The 𝑤1 𝑘1 𝑞1 𝑣1 𝑤2 𝑘2 𝑞2 𝑣2 chef 𝑤3 𝑘3 𝑞3 𝑣3 who 𝑤 𝑇 𝑘 𝑇 𝑞 𝑇 𝑣 𝑇 food … 𝑘1 𝑞1 𝑣1 𝑘2 𝑞2 𝑣2 𝑘3 𝑞3 𝑣3 𝑘 𝑇 𝑞 𝑇 𝑣 𝑇 … • In the diagram at the right, we have stacked self-attention blocks, like we might stack LSTM layers. • Can self-attention be a drop-in replacement for recurrence? • No. It has a few issues, which we’ll go through. • First, self-attention is an operation on sets. It has no inherent notion of order. self-attention self-attention Self-attention doesn’t know the order of its inputs.12 Barriers • Doesn’t have an inherent notion of order! Barriers and solutions for Self-Attention as a building block Solutions 13 Fixing the first self-attention problem: sequence order • Since self-attention doesn’t build in order information, we need to encode the order of the sentence in our keys, queries, and values. • Consider representing each sequence index as a vector 𝑝𝑖 ∈ ℝ 𝑑 , for 𝑖 ∈ {1,2, … , 𝑇} are position vectors • Don’t worry about what the 𝑝𝑖 are made of yet! • Easy to incorporate this info into our self-attention block: just add the 𝑝𝑖 to our inputs! • Let ෤𝑣𝑖 ෨𝑘𝑖, ෤𝑞𝑖 be our old values, keys, and queries. 𝑣𝑖 = ෤𝑣𝑖 + 𝑝𝑖 𝑞𝑖 = ෤𝑞𝑖 + 𝑝𝑖 𝑘𝑖 = ෨𝑘𝑖 + 𝑝𝑖 In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add… 14 • Sinusoidal position representations: concatenate sinusoidal functions of varying periods: • Pros: • Periodicity indicates that maybe “absolute position” isn’t as important • Maybe can extrapolate to longer sequences as periods restart! • Cons: • Not learnable; also the extrapolation doesn’t really work! Position representation vectors through sinusoids cos(𝑖/100002∗1/𝑑) 𝑝𝑖 = sin(𝑖/100002∗1/𝑑) sin(𝑖/100002∗ 𝑑 2 /𝑑 ) cos(𝑖/100002∗ 𝑑 2 /𝑑 ) Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/ Index in the sequence Dimension 15 • Learned absolute position representations: Let all 𝑝𝑖 be learnable parameters! Learn a matrix 𝑝 ∈ ℝ 𝑑×𝑇 , and let each 𝑝𝑖 be a column of that matrix! • Pros: • Flexibility: each position gets to be learned to fit the data • Cons: • Definitely can’t extrapolate to indices outside 1, … , 𝑇. • Most systems use this! • Sometimes people try more flexible representations of position: • Relative linear position attention [Shaw et al., 2018] • Dependency syntax-based position [Wang et al., 2019] Position representation vectors learned from scratch 16 Barriers • Doesn’t have an inherent notion of order! • No nonlinearities for deep learning! It’s all just weighted averages Barriers and solutions for Self-Attention as a building block Solutions • Add position representations to the inputs 17 Adding nonlinearities in self-attention • Note that there are no elementwise nonlinearities in self-attention; stacking more self-attention layers just re-averages value vectors • Easy fix: add a feed-forward network to post-process each output vector. 𝑚𝑖 = 𝑀𝐿𝑃 output 𝑖 = 𝑊2 ∗ ReLU 𝑊1 × output 𝑖 + 𝑏1 + 𝑏2 The 𝑤1 𝑤2 chef 𝑤3 who 𝑤 𝑇 food … self-attention Intuition: the FF network processes the result of attention FF FF FF FF … self-attention FF FF FF FF 18 Barriers • Doesn’t have an inherent notion of order! • No nonlinearities for deep learning magic! It’s all just weighted averages • Need to ensure we don’t “look at the future” when predicting a sequence • Like in machine translation • Or language modeling Barriers and solutions for Self-Attention as a building block Solutions • Add position representations to the inputs • Easy fix: apply the same feedforward network to each selfattention output. 19 Masking the future in self-attention • To use self-attention in decoders, we need to ensure we can’t peek at the future. • At every timestep, we could change the set of keys and queries to include only past words. (Inefficient!) • To enable parallelization, we mask out attention to future words by setting attention scores to −∞. The chef who [START] For encoding these words We can look at these (not greyed out) words 𝑒𝑖𝑗 = ൝ 𝑞𝑖 ⊤ 𝑘𝑗, 𝑗 < 𝑖 −∞, 𝑗 ≥ 𝑖 −∞ −∞ −∞ −∞ −∞ −∞−∞ −∞−∞ −∞ 20 [The matrix of 𝑒𝑖𝑗 values] Masking the future in self-attention • To use self-attention in decoders, we need to ensure we can’t peek at the future. • At every timestep, we could change the set of keys and queries to include only past words. (Inefficient!) • To enable parallelization, we mask out attention to future words by setting attention scores to −∞. The chef who [START] For encoding these words We can look at these (not greyed out) words 𝑒𝑖𝑗 = ൝ 𝑞𝑖 ⊤ 𝑘𝑗, 𝑗 < 𝑖 −∞, 𝑗 ≥ 𝑖 −∞ −∞ −∞ −∞ −∞ −∞−∞ −∞−∞ −∞ 21 Barriers • Doesn’t have an inherent notion of order! • No nonlinearities for deep learning magic! It’s all just weighted averages • Need to ensure we don’t “look at the future” when predicting a sequence • Like in machine translation • Or language modeling Barriers and solutions for Self-Attention as a building block Solutions • Add position representations to the inputs • Easy fix: apply the same feedforward network to each selfattention output. • Mask out the future by artificially setting attention weights to 0! 22 • Self-attention: • the basis of the method. • Position representations: • Specify the sequence order, since self-attention is an unordered function of its inputs. • Nonlinearities: • At the output of the self-attention block • Frequently implemented as a simple feed-forward network. • Masking: • In order to parallelize operations while not looking at the future. • Keeps information about the future from “leaking” to the past. • That’s it! But this is not the Transformer model we’ve been hearing about. Necessities for a self-attention building block: 23 Outline 1. From recurrence (RNN) to attention-based NLP models 2. Introducing the Transformer model 3. Great results with Transformers 4. Drawbacks and variants of Transformers 24 The Transformer Encoder-Decoder [Vaswani et al., 2017] Transformer Encoder Transformer Encoder Word Embeddings Position Representations + [input sequence] Word Embeddings Position Representations + [output sequence] Transformer Decoder [decoder attends to encoder states] First, let’s look at the Transformer Encoder and Decoder Blocks at a high level Transformer Decoder [predictions!] 25 The Transformer Encoder-Decoder [Vaswani et al., 2017] Next, let’s look at the Transformer Encoder and Decoder Blocks What’s left in a Transformer Encoder Block that we haven’t covered? 1. Key-query-value attention: How do we get the 𝑘, 𝑞, 𝑣 vectors from a single word embedding? 2. Multi-headed attention: Attend to multiple places in a single layer! 3. Tricks to help with training! 1. Residual connections 2. Layer normalization 3. Scaling the dot product 4. These tricks don’t improve what the model is able to do; they help improve the training process. Both of these types of modeling improvements are very important! 26 The Transformer Encoder: Key-Query-Value Attention • We saw that self-attention is when keys, queries, and values come from the same source. The Transformer does this in a particular way: • Let 𝑥1, … , 𝑥 𝑇 be input vectors to the Transformer encoder; 𝑥𝑖 ∈ ℝ 𝑑 • Then keys, queries, values are: • 𝑘𝑖 = 𝐾𝑥𝑖, where 𝐾 ∈ ℝ 𝑑×𝑑 is the key matrix. • 𝑞𝑖 = 𝑄𝑥𝑖, where Q ∈ ℝ 𝑑×𝑑 is the query matrix. • 𝑣𝑖 = 𝑉𝑥𝑖, where V ∈ ℝ 𝑑×𝑑 is the value matrix. • These matrices allow different aspects of the 𝑥 vectors to be used/emphasized in each of the three roles. 27 The Transformer Encoder: Key-Query-Value Attention • Let’s look at how key-query-value attention is computed, in matrices. • Let 𝑋 = 𝑥1; … ; 𝑥 𝑇 ∈ ℝ 𝑇×𝑑 be the concatenation of input vectors. • First, note that 𝑋𝐾 ∈ ℝ 𝑇×𝑑 , 𝑋𝑄 ∈ ℝ 𝑇×𝑑 , 𝑋𝑉 ∈ ℝ 𝑇×𝑑 . • The output is defined as output = softmax 𝑋𝑄 𝑋𝐾 ⊤ × 𝑋𝑉. = 𝑋𝑄𝐾⊤ 𝑋⊤ ∈ ℝ 𝑇×𝑇 All pairs of attention scores! output ∈ ℝ 𝑇×𝑑 = 𝐾⊤ 𝑋⊤ 𝑋𝑄 First, take the query-key dot products in one matrix multiplication: 𝑋𝑄 𝑋𝐾 ⊤ Next, softmax, and compute the weighted average with another matrix multiplication. 𝑋𝑄𝐾⊤ 𝑋⊤softmax 𝑋𝑉 28 The Transformer Encoder: Multi-headed attention • What if we want to look in multiple places in the sentence at once? • For word 𝑖, self-attention “looks” where 𝑥𝑖 ⊤ 𝑄⊤ 𝐾𝑥𝑗 is high, but maybe we want to focus on different 𝑗 for different reasons? • We’ll define multiple attention “heads” through multiple Q,K,V matrices • Let, 𝑄ℓ, 𝐾ℓ, 𝑉ℓ ∈ ℝ 𝑑× 𝑑 ℎ, where ℎ is the number of attention heads, and ℓ ranges from 1 to ℎ. • Each attention head performs attention independently: • outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ ∗ 𝑋𝑉ℓ, where outputℓ ∈ ℝ 𝑑/ℎ • Then the outputs of all the heads are combined! • output = 𝑌[output1; … ; outputℎ], where 𝑌 ∈ ℝ 𝑑×𝑑 • Each head gets to “look” at different things, and construct value vectors differently.29 The Transformer Encoder: Multi-headed attention • What if we want to look in multiple places in the sentence at once? • For word 𝑖, self-attention “looks” where 𝑥𝑖 ⊤ 𝑄⊤ 𝐾𝑥𝑗 is high, but maybe we want to focus on different 𝑗 for different reasons? • We’ll define multiple attention “heads” through multiple Q,K,V matrices • Let, 𝑄ℓ, 𝐾ℓ, 𝑉ℓ ∈ ℝ 𝑑× 𝑑 ℎ, where ℎ is the number of attention heads, and ℓ ranges from 1 to ℎ. 𝑋 𝑄 = 𝑋𝑄 Single-head attention (just the query matrix) 𝑋 = 𝑋𝑄2 Multi-head attention (just two heads here) 𝑄1 𝑄2 𝑋𝑄1 Same amount of computation as single-head self- attention! 30 The Transformer Encoder: Residual connections [He et al., 2016] • Residual connections are a trick to help models train better. • Instead of 𝑋(𝑖) = Layer(𝑋 𝑖−1 ) (where 𝑖 represents the layer) • We let 𝑋(𝑖) = 𝑋(𝑖−1) + Layer(𝑋 𝑖−1 ) (so we only have to learn “the residual” from the previous layer) • Residual connections are thought to make the loss landscape considerably smoother (thus easier training!) 𝑋(𝑖−1) Layer 𝑋(𝑖) 𝑋(𝑖−1) Layer 𝑋(𝑖) + [no residuals] [residuals] [Loss landscape visualization, Li et al., 2018, on a ResNet]31 The Transformer Encoder: Layer normalization [Ba et al., 2016] • Layer normalization is a trick to help models train faster. • Idea: cut down on uninformative variation in hidden vector values by normalizing to unit mean and standard deviation within each layer. • LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019] • Let 𝑥 ∈ ℝ 𝑑 be an individual (word) vector in the model. • Let 𝜇 = σ 𝑗=1 𝑑 𝑥𝑗; this is the mean; 𝜇 ∈ ℝ. • Let 𝜎 = 1 𝑑 σ 𝑗=1 𝑑 𝑥𝑗 − 𝜇 2 ; this is the standard deviation; 𝜎 ∈ ℝ. • Let 𝛾 ∈ ℝ 𝑑 and 𝛽 ∈ ℝ 𝑑 be learned “gain” and “bias” parameters. (Can omit!) • Then layer normalization computes: output = 𝑥 − 𝜇 𝜎 + 𝜖 ∗ 𝛾 + 𝛽 Normalize by scalar mean and variance Modulate by learned elementwise gain and bias32 The Transformer Encoder: Layer normalization [Ba et al., 2016] • Layer normalization is a trick to help models train faster. • Idea: cut down on uninformative variation in hidden vector values by normalizing to unit mean and standard deviation within each layer. • LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019] • Let 𝑥 ∈ ℝ 𝑑 be an individual (word) vector in the model. • Let 𝜇 = σ 𝑗=1 𝑑 𝑥𝑗; this is the mean; 𝜇 ∈ ℝ. • Let 𝜎 = 1 𝑑 σ 𝑗=1 𝑑 𝑥𝑗 − 𝜇 2 ; this is the standard deviation; 𝜎 ∈ ℝ. • Let 𝛾 ∈ ℝ 𝑑 and 𝛽 ∈ ℝ 𝑑 be learned “gain” and “bias” parameters. (Can omit!) • Then layer normalization computes: output = 𝑥 − 𝜇 𝜎 + 𝜖 ∗ 𝛾 + 𝛽 Normalize by scalar mean and variance Modulate by learned elementwise gain and bias33 The Transformer Encoder: Scaled Dot Product [Vaswani et al., 2017] • “Scaled Dot Product” attention is a final variation to aid in Transformer training. • When dimensionality 𝑑 becomes large, dot products between vectors tend to become large. • Because of this, inputs to the softmax function can be large, making the gradients small. • Instead of the self-attention function we’ve seen: outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ ∗ 𝑋𝑉ℓ • We divide the attention scores by 𝑑/ℎ, to stop the scores from becoming large just as a function of 𝑑/ℎ (The dimensionality divided by the number of heads.) outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ 𝑑/ℎ ∗ 𝑋𝑉ℓ 34 The Transformer Encoder-Decoder [Vaswani et al., 2017] Transformer Encoder Word Embeddings Position Representations + Transformer Encoder [input sequence] Transformer Decoder Word Embeddings Position Representations + Transformer Decoder [output sequence] [decoder attends to encoder states] Looking back at the whole model, zooming in on an Encoder block: [predictions!] 35 The Transformer Encoder-Decoder [Vaswani et al., 2017] Word Embeddings Position Representations + Transformer Encoder [input sequence] Transformer Decoder Word Embeddings Position Representations + Transformer Decoder [output sequence] [decoder attends to encoder states] Looking back at the whole model, zooming in on an Encoder block: [predictions!] Multi-Head Attention Residual + LayerNorm Feed-Forward Residual + LayerNorm 36 The Transformer Encoder-Decoder [Vaswani et al., 2017] Transformer Encoder Word Embeddings Position Representations + Transformer Encoder [input sequence] Word Embeddings Position Representations + Transformer Decoder [output sequence] Looking back at the whole model, zooming in on a Decoder block: [predictions!] Residual + LayerNorm Multi-Head Cross-Attention Masked Multi-Head Self-Attention Residual + LayerNorm Feed-Forward Residual + LayerNorm 37 The Transformer Encoder-Decoder [Vaswani et al., 2017] Transformer Encoder Word Embeddings Position Representations + Transformer Encoder [input sequence] Word Embeddings Position Representations + Transformer Decoder [output sequence] The only new part is attention from decoder to encoder. Like we saw last week! [predictions!] Residual + LayerNorm Multi-Head Cross-Attention Masked Multi-Head Self-Attention Residual + LayerNorm Feed-Forward Residual + LayerNorm 38 The Transformer Decoder: Cross-attention (details) • We saw that self-attention is when keys, queries, and values come from the same source. • In the decoder, we have attention that looks more like what we saw last week. • Let ℎ1, … , ℎ 𝑇 be output vectors from the Transformer encoder; 𝑥𝑖 ∈ ℝ 𝑑 • Let 𝑧1, … , 𝑧 𝑇 be input vectors from the Transformer decoder, 𝑧𝑖 ∈ ℝ 𝑑 • Then keys and values are drawn from the encoder (like a memory): • 𝑘𝑖 = 𝐾ℎ𝑖, 𝑣𝑖 = 𝑉ℎ𝑖. • And the queries are drawn from the decoder, 𝑞𝑖 = 𝑄𝑧𝑖. 39 The Transformer Encoder: Cross-attention (details) • Let’s look at how cross-attention is computed, in matrices. • Let H = ℎ1; … ; ℎ 𝑇 ∈ ℝ 𝑇×𝑑 be the concatenation of encoder vectors. • Let Z = 𝑧1; … ; 𝑧 𝑇 ∈ ℝ 𝑇×𝑑 be the concatenation of decoder vectors. • The output is defined as output = softmax 𝑍𝑄 𝐻𝐾 ⊤ × 𝐻𝑉. = 𝑍𝑄𝐾⊤ 𝐻⊤ ∈ ℝ 𝑇×𝑇 All pairs of attention scores! output ∈ ℝ 𝑇×𝑑 = 𝐾⊤ 𝐻⊤ 𝑍𝑄 First, take the query-key dot products in one matrix multiplication: 𝑍𝑄 𝐻𝐾 ⊤ Next, softmax, and compute the weighted average with another matrix multiplication. 𝑍𝑄𝐾⊤ 𝐻⊤softmax 𝐻𝑉 40 Outline 1. From recurrence (RNN) to attention-based NLP models 2. Introducing the Transformer model 3. Great results with Transformers 4. Drawbacks and variants of Transformers 41 Great Results with Transformers [Vaswani et al., 2017] Not just better Machine Translation BLEU scores Also more efficient to train! First, Machine Translation from the original Transformers paper! [Test sets: WMT 2014 English-German and English-French]42 Great Results with Transformers [Liu et al., 2018]; WikiSum dataset Transformers all the way down. Next, document generation! The old standard 43 Great Results with Transformers [Liu et al., 2018] Before too long, most Transformers results also included pretraining, a method we’ll go over on Thursday. Transformers’ parallelizability allows for efficient pretraining, and have made them the de-facto standard. On this popular aggregate benchmark, for example: All top models are Transformer (and pretraining)-based. More results Thursday when we discuss pretraining. 44 Outline 1. From recurrence (RNN) to attention-based NLP models 2. Introducing the Transformer model 3. Great results with Transformers 4. Drawbacks and variants of Transformers 45 • Quadratic compute in self-attention (today): • Computing all pairs of interactions means our computation grows quadratically with the sequence length! • For recurrent models, it only grew linearly! • Position representations: • Are simple absolute indices the best we can do to represent position? • Relative linear position attention [Shaw et al., 2018] • Dependency syntax-based position [Wang et al., 2019] What would we like to fix about the Transformer? 46 • One of the benefits of self-attention over recurrence was that it’s highly parallelizable. • However, its total number of operations grows as 𝑂 𝑇2 𝑑 , where 𝑇 is the sequence length, and 𝑑 is the dimensionality. Quadratic computation as a function of sequence length 47 = 𝑋𝑄𝐾⊤ 𝑋⊤ ∈ ℝ 𝑇×𝑇 Need to compute all pairs of interactions! 𝑂 𝑇2 𝑑𝐾⊤ 𝑋⊤ 𝑋𝑄 • Think of 𝑑 as around 𝟏, 𝟎𝟎𝟎. • So, for a single (shortish) sentence, 𝑇 ≤ 30; 𝑇2 ≤ 𝟗𝟎𝟎. • In practice, we set a bound like 𝑇 = 512. • But what if we’d like 𝑻 ≥ 𝟏𝟎, 𝟎𝟎𝟎? For example, to work on long documents? • Considerable recent work has gone into the question, Can we build models like Transformers without paying the 𝑂 𝑇2 all-pairs self-attention cost? • For example, Linformer [Wang et al., 2020] Recent work on improving on quadratic self-attention cost 48 Key idea: map the sequence length dimension to a lowerdimensional space for values, keys Inferencetime(s) Sequence length / batch size • Considerable recent work has gone into the question, Can we build models like Transformers without paying the 𝑂 𝑇2 all-pairs self-attention cost? • For example, BigBird [Zaheer et al., 2021] Recent work on improving on quadratic self-attention cost 49 Key idea: replace all-pairs interactions with a family of other interactions, like local windows, looking at everything, and random interactions. • Pretraining on Thursday! • Good luck on assignment 4! • Remember to work on your project proposal! Parting remarks 50