Statistical Natural Language Processing P. Rychlý NLP Centre, Fl Mil, Brno September 21, 2021 P. Rychlý tatistical Natural Language Processing 1/28 O Word lists O Collocations Q Language Modeling -grams Q Evaluation of Language Models P. Rychlý tatistical Natural Language Processing 2/28 Statistical Natural Language Processing • statistics provides a summary (of a text) 9 highlights important or interesting facts 9 can be used to model data • foundation of estimating probabilities • fundamental statistics: size (+ domain, range) P. Rychlý Statistical Natural Language Processing Statistical Natural Language Processing • statistics provides a summary (of a text) 9 highlights important or interesting facts o can be used to model data • foundation of estimating probabilities • fundamental statistics: size (+ domain, range) lines words bytes Book 1 Book 2 3,715 1,601 37,703 16,859 223,415 91,031 Word list • list of all words from a text • list of most frequent words • words, lemmas, senses, tags, domains, years ... Book 1 Book 2 the, and, of, to, you, his, in, the, 1, to, a, of, is, that said, that, 1, will, him, your, he, , you, he, and, said, was, a, my, was, with, s, for, me, , in, it, not, me, my, He, is, , , it, them, have, And, are, one, for, But, be, The, all, , have, from, his, be, The, It, at, all, with, , on, her, on, will, as, very, had, this, , are, their, were, they, him, He, from, they, , so, which, , t, up, them, no, You, do, would, like had, there Word list • list of all words from a text • list of most frequent words • words, lemmas, senses, tags, domains, years ... Book 1 the, and, of, to, you, his, in, said, that, I, will, him, your, he, a, my, was, with, s, for, me, He, is, father, , it, them, be, The, all, land, have, from, , on, her, , son, , are, their, were, they, which, sons, t, up, had, there Book 2 the, I, to, a, of, is, that, little, you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, all, with, on, will, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, would, like Word list • list of all words from a text • list of most frequent words • words, lemmas, senses, tags, domains, years ... Book 1 Book 2 the, and, of, to, you, his, in, said, that, 1, will, him, your, he, a, my, was, with, s, for, me, He, is, father, God, it, them, be, The, all, land, have, from, Jacob, on, her, Yahweh, son, Joseph, are, their, were, they, which, sons, t, up, Abraham, had, there the, 1, to, a, of, is, that, little, you, he, and, said, was, prince, in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, all, with, on, will, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, would, like Frequency • number of occurrences (raw frequency) • relative frequency (hits per million) • document frequency (number of documents with a hit) • reduced frequency (ARF, ALDf) 1 < reduced < raw • normalization for comparison • hapax legomena (= 1 hit) P. Rychlý tatistical Natural Language Processing 5/28 ipf s Law • rank-frequency plot • rank x frequency = constant i-1-1-1-1-r~ 0 20 40 60 80 100 rangorde P. Rychlý statistical Natural Language Processing Zipf's Law o o o o o o o o o Zl cr (D O'he of O O and to *0 O O in O that his OO i' 10 100 1000 —I— 10000 Rank (log scale) P. Rychlý tatistical Natural Language Processing 7/28 Keywords • select only important words from a word list • compare to reference text (norm) <* simple math score: score = freq focus + N frčQreference ~l~ N Genesis Little Prince son God father Jacob Yahweh Joseph Abraham wife behold daughter prince planet flower little fox never too drawing reply star Collocations 9 meaning of words is defined by the context o collocations a salient words in the context • usually not the most frequent o filtering by part of speech, grammatical relation o compare to reference = context for other words 9 many statistics (usually single use only) based on frequencies • Ml-score, t-score, x2, ... • logDice - scalable log Dice = 14 + log AB fA + fß P. Rychlý tatistical Natural Language Processing 9/28 Collocations of Prince =•: U X =•= n x *=> =•= :o: x modifiers of ''prince" verbs with "prince" as object verbs with "prince" as subject little the little prince say said the little prince say the little prince said to himself fair fair, little prince ask asked the little prince corne saw the little prince coming Oh Oh , little prince demand demanded the little prince go And the little prince went away dear dear little prince see when he saw the little prince coming inquire inquired the little prince repeat repeated the little prince . who add the little prince added prince prince , dear little prince ask the little prince asked great great prince flush +" The little prince flushed P. Rychlý Statistical Natural Language Processing 10 / 28 Collocations of Prince Oh inquire be verbs with "prince" as object repeat demand see flush say say come ask ask consequence be y prince prince little planet Oh snow go add answer fair modifiers of "prince" Thesaurus • comparing collocation distributions o counting same context son as noun 301x Abraham as noun 134* Word Frequency ? Word Frequency ? 1 brother iei 1 Isaac 82 *** 2 wife 125 — 2 Jacob 184 »• 3 father 278 3 Joseph 157 «• 4 daughter 103 — Noah 41 «• child 80 5 Abrarn 61 *** 6 man 187 6 Laban 54 *** 7 servant 91 •« Esau 78 *** Esau 78 S God 234 »• Jacob 184 9 Abimelech 24 »• 10 name 85 father 278 P. Rychlý Statistical Natural Language Processing Multi-word units • meaning of some words is completely different in the context of specific co-occurring word • black hole, is not black and is not a hole • strong collocations • uses same statistics with different threshold o better to compare context distribution instead of only numbers • terminology - compare to a reference corpus P. Rychlý tatistical Natural Language Processing 13 / 28 Language models—what are they good for? 9 assigning scores to sequences of words • predicting words generating text • statistical machine translation • automatic speech recognition • optical character recognition P. Rychlý El tatistical Natural Language Processing 14 / 28 OCR + MT BUXQA B rOPOA i 'kv r iCCESS TO CITY o P. Rychlý 15 / 28 Language models - probability of a sentence • LM is a probability distribution over all possible word sequences. 9 What is the probability of utterance of s? Probability of sentence Pz_/w(Catalonia President urges protests) p/_w(President Catalonia urges protests) Pz_/w(urges Catalonia protests President) Ideally, the probability should strongly correlate with fluency and intelligibility of a word sequence. N-gram models • an approximation of long sequences using short n-grams • a straightforward implementation a an intuitive approach • good local fluency Randomly generated text "Jsi nebylo vidět vteřin přestal po schodech se dal do deníku a položili se táhl ji viděl na konci místnosti 101," řekl důstojník. Hungarian A társaság kótelezettségeiért kapta a kózépkori temploma az volt, hogy a felhasználók az adottságai, a felhasználó azonosítása az egyesúlet alapszabályát. P. Rychlý tatistical Natural Language Processing 17 / 28 N-gram models, naive approach 1/1/ = l/l/i. 1/1/2 ••• • • Wn p(W) = JJp(w/|wi • • • i/i//_i) Markov's assumption p(l/|/) = JJp(w/|w/_2j p(this is a sentence) = p(this) x p(is\this) x p(a\this, is) x p(sentence\is, a) p(a\this, is) = t/7/s /s a t/7/S /s Sparse data problem P. Rychlý tatistical Natural Language Processing 18 / 28 Computing, LM probabilities estimation Trigram model uses 2 preceding words for probability learning. Using maximum-likelihood estimation: p(w3\wi. 1/1/2) = count{w\, 1/1/2. ws) V count(wi, 1/1/2. w) quadrigram: (7orc/, of, t/?e, ?) () count p(w) rings 30,156 0.425 flies 2,977 0.042 well 1,536 0.021 manor 907 0.012 dance 767 0.010 P. Rychlý tatistical Natural Language Processing 19 / 28 Large LM - n-gram counts How many unique n-grams in a corpus? order unique singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) Corpus: Europarl, 30 M tokens. Language models smoothing The problem: an n-gram is missing in the data but is in a sentence —> p(sentence) = 0. We need to assign non-zero p for unseen data. This must hold: Vi/i/ : p(w) > 0 The issue is more pronounced for higher-order models. Smoothing: an attempt to amend real counts of n-grams to expected counts in any (unseen) data. Add-one, Add-<% Good-Turing smoothing Deleted estimation We can find unseen n-grams in another corpus. N-grams contained in one of them and not in the other help us to estimate general amount of unseen n-grams. E.g. bigrams not occurring in a training corpus but present in the other corpus million times (given the amount of all possible bigrams equals 7.5 billions) will occur approx. 10( 7.5 x 109 = 0.00013x P. Rychlý tatistical Natural Language Processing 22 / 28 Interpolation and back-off Previous methods treated all unseen n-grams the same. Consider trigrams beautiful young girl beautiful young granny Despite we don't have any of these in our training data, the former trigram should be more probable. We will use probability of lower order models, for which we have necessary data: young girl young granny beautiful young Interpolation P/(1/I/3 | l/l/l l/l/2) = \ip(w3) + \2p(w3\w2) + A3p(lV3|lVilV2) If we have enough data we can trust higher order models more and assign a higher significance to corresponding n-grams. Pi is probability distribution, thus this must hold: va„ : 0 < Xn < 1 £a„ = i n P. Rychlý tatistical Natural Language Processing 24 / 28 Quality and comparison of LMs We need to compare quality of various LM (various orders, various data, smoothing techniques etc.) 1) extrinsic (WER, MT, ASR, OCR) and 2) intrinsic (perplexity) evaluation A good LM should assign a higher probability to a good (looking) text than to an incorrect text. For a fixed test text we can compare various LMs. Cross-entropy H(plm) = — \ogpLM(wi, W2,... wn) n 1 n - V" \0gpLM{wi\wi, . . . n i=l Cross-entropy is average value of negative logarithms of words probabilities in testing text. It corresponds to a measure of uncertainty of a probability distribution. The lower the better. A good LM should reach entropy close to real entropy of language. That can't be measured directly but quite reliable estimates exist, e.g. Shannon's game. For English, entropy is estimated to approx. 1.3 bit per letter. Perplexity PP — 2H(pLM^ Perplexity is a simple transformation of cross-entropy. A good LM should not waste p for improbable phenomena. The lower entropy, the better —> the lower perplexity, the better. P. Rychlý El tatistical Natural Language Processing 27 / 28 Comparing smoothing methods (Europarl) method perplexity add-one 382.2 add-a 113.2 deleted est. 113.4 Good-Turing 112.9 P. Rychlý tatistical Natural Language Processing 28 / 28