Natural Language Modelling PA154 Jazykové modelování (12) Pavel Rychlý pary@fi.muni.cz May 18, 2021 Deep Learning deep neural networks many layers trained on big data using advanced hardware: GPU, TPU supervised, semi-supervised or unsupervised PA154 Jazykové modelování (12) Natural Language Modelling 2/12 Neural Networks Neuron: many inputs, weights, transfer function (threshold), one output: m Yk = 4>C^ wkjXj) Input/Hidden/Output layer One-hot representation of words/classes: [00010000] Hidden PA154 Jazykové modelování (12) Natural Language Modelling 3/12 Training Neural Networks ■ supervised training ■ example: input + result ■ difference between output and expected result ■ adjusts weights according to a learning rule ■ backpropagation (feedforward neural networks) ■ gradient of the loss function, stochastic gradient descent (SGD) PA154 Jazykové modelování (12) Natural Language Modelling 4/12 Recurrent Neural Network (RNN) ■ dealing with long inputs ■ feedforward NN + internal state (memory) ■ finite impulse RNN: unroll to strictly feedforward NN ■ infinite impulse RNN: directed cyclic graph ■ additional storage managed by NN: gated state/memory ■ backpropagation through time PA154 Jazykové modelování (12) Natural Language Modelling g short-term memory (LSTM) LSTM unit: cell, input gate, output gate and forget gate cell = memeory gates regulate the flow of information into and out of the cell PA154 Jazykové modelování (12) Natural Language Modelling GRU, BRNN, Gated recurrent unit (GRU) fewer parameters than LSTM memory = output Bi-directional RNN two hidden layers of opposite directions to the same output hierarchical, multilayer PA154 Jazykové modelování (12) Natural Language Modelling 7/12 Encoder-Decoder variable input/output size, not 1-1 mapping two components Encoder: variable-length sequence —> fixed size state Decoder: fixed size state variable-length sequence Encoder Decoder PA154 Jazykové modelování (12) Natural Language Modelling 8/12 Sequence to Sequence ■ Learning ► Encoder: Input sequence —>► state ► Decoder: state + output sequence —>► output sequence Encoder Decoder 1 } f They are watching lis regardent ► state ► Decoder: state + sentence delimiter —>► output Encoder Decoder T They t t are watching lis regardent tn J=p> Jt=n PA154 Jazykové modelování (12) Natural Language Modelling 10/12 Transformers using context to compute token/sentence/document embedding BERT = Bidirectional Encoder Representations from Transformers GPT = Generative Pre-trained Transformer many varians: tokenization, attention, encoder/decoder connections BERT (Ours) OpenAI GPT 3H •■■ ELMo PA154 Jazykové modelování (12) Natural Language Modelling 11/12 BERT ■ Google ■ pre-training on raw text ■ masking tokens, is-next-sentence ■ big pre-trained models available ■ domain (task) adaptation Input: The man went to the [MASK]1 . He bought a [MASK]2 of milk . Labels: [MASK]1 - store; [MASK]2 = gallon Sentence A = The man went to the store. Sentence B = He bought a gallon of milk. Label = IsNextSentence Sentence A = The man went to the store. Sentence B = Penguins are flightless. Label = NotNextSentence PA154 Jazykové modelování (12) Natural Language Modelling 12/12