Basics in Language and Probability Philipp Koehn 3 September 2020 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 1Quotes It must be recognized that the notion ”probability of a sentence” is an entirely useless one, under any known interpretation of this term. Noam Chomsky, 1969 Whenever I fire a linguist our system performance improves. Frederick Jelinek, 1988 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 2Conflicts? rationalist vs. empiricist scientist vs. engineer insight vs. data analysis explaining language vs. building applications Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 3 language Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 4A Naive View of Language • Language needs to name – nouns: objects in the world (dog) – verbs: actions (jump) – adjectives and adverbs: properties of objects and actions (brown, quickly) • Relationship between these have to specified – word order – morphology – function words Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 5A Bag of Words dog fox lazy jump brown quick Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 6Relationships dog fox lazy jump brown quick Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 7Marking of Relationships: Word Order dog fox lazy jump brown quick quick brown fox jump lazy dog Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 8Marking of Relationships: Function Words dog fox lazy jump brown quick quick brown fox jump over lazy dog Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 9Marking of Relationships: Morphology dog fox lazy jump brown quick quick brown fox jumps over lazy dog Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 10Some Nuance dog fox lazy jump brown quick the quick brown fox jumps over the lazy dog Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 11Marking of Relationships: Agreement • From Catullus, First Book, first verse (Latin): Cui dono lepidum novum libellum arida modo pumice expolitum ? Whom I-present lovely new little-book dry manner pumice polished ? (To whom do I present this lovely new little book now polished with a dry pumice?) • Gender (and case) agreement links adjectives to nouns Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 12Marking of Relationships to Verb: Case • German: Die Frau gibt dem Mann den Apfel The woman gives the man the apple subject indirect object object Der Frau gibt der Mann den Apfel The woman gives the man the apple indirect object subject object • Case inflection indicates role of noun phrases Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 13Case Morphology vs. Prepositions • Two different word orderings for English: – The woman gives the man the apple – The woman gives the apple to the man • Japanese: woman SUBJ man OBJ apple OBJ2 gives • Is there a real difference between prepositions and noun phrase case inflection? Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 14Words This is a simple sentence WORDS Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 15Morphology This is a simple sentence be 3sg present WORDS MORPHOLOGY Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 16Parts of Speech This is a simple sentence be 3sg present DT VBZ DT JJ NN WORDS MORPHOLOGY PART OF SPEECH Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 17Syntax This is a simple sentence be 3sg present DT VBZ DT JJ NN NP VP S NP WORDS MORPHOLOGY SYNTAX PART OF SPEECH Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 18Semantics This is a simple sentence be 3sg present DT VBZ DT JJ NN NP VP S NP SENTENCE1 string of words satisfying the grammatical rules of a languauge SIMPLE1 having few parts WORDS MORPHOLOGY SYNTAX PART OF SPEECH SEMANTICS Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 19Discourse This is a simple sentence be 3sg present DT VBZ DT JJ NN NP VP S NP SENTENCE1 string of words satisfying the grammatical rules of a languauge SIMPLE1 having few parts But it is an instructive one. CONTRAST WORDS MORPHOLOGY SYNTAX DISCOURSE PART OF SPEECH SEMANTICS Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 20Why is Language Hard? • Ambiguities on many levels • Rules, but many exceptions • No clear understand how humans process language • Can we learn everything about language by automatic data analysis? Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 21 data Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 22Data: Words • Definition: strings of letters separated by spaces • But how about: – punctuation: commas, periods, etc. typically separated (tokenization) – hyphens: high-risk – clitics: Joe’s – compounds: website, Computerlinguistikvorlesung • And what if there are no spaces: 伦敦每日快报指出,两台记载黛安娜王妃一九九七年巴黎 死亡车祸调查资料的手提电脑,被从前大都会警察总长的 办公室里偷走. Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 23Word Counts Most frequent words in the English Europarl corpus any word nouns Frequency in text Token 1,929,379 the 1,297,736 , 956,902 . 901,174 of 841,661 to 684,869 and 582,592 in 452,491 that 424,895 is 424,552 a Frequency in text Content word 129,851 European 110,072 Mr 98,073 commission 71,111 president 67,518 parliament 64,620 union 58,506 report 57,490 council 54,079 states 49,965 member Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 24Word Counts But also: There is a large tail of words that occur only once. 33,447 words occur once, for instance • cornflakes • mathematicians • Tazhikhistan Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 25Zipf’s law f × r = k f = frequency of a word r = rank of a word (if sorted by frequency) k = a constant Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 26Zipf’s law as a graph frequency rank Why a line in log-scale? fr = k ⇒ f = k r ⇒ log f = log k − log r Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 27 statistics Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 28Probabilities • Given word counts we can estimate a probability distribution: P(w) = count(w) w count(w ) • This type of estimation is called maximum likelihood estimation. Why? We will get to that later. • Estimating probabilities based on frequencies is called the frequentist approach to probability. • This probability distribution answers the question: If we randomly pick a word out of a text, how likely will it be word w? Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 29A Bit More Formal • We introduce a random variable W. • We define a probability distribution p, that tells us how likely the variable W is the word w: prob(W = w) = p(w) Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 30Joint Probabilities • Sometimes, we want to deal with two random variables at the same time. • Example: Words w1 and w2 that occur in sequence (a bigram) We model this with the distribution: p(w1, w2) • If the occurrence of words in bigrams is independent, we can reduce this to p(w1, w2) = p(w1)p(w2). Intuitively, this not the case for word bigrams. • We can estimate joint probabilities over two variables the same way we estimated the probability distribution over a single variable: p(w1, w2) = count(w1,w2) w1 ,w2 count(w1 ,w2 ) Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 31Conditional Probabilities • Another useful concept is conditional probability p(w2|w1) It answers the question: If the random variable W1 = w1, how what is the value for the second random variable W2? • Mathematically, we can define conditional probability as p(w2|w1) = p(w1,w2) p(w1) • If W1 and W2 are independent: p(w2|w1) = p(w2) Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 32Chain Rule • A bit of math gives us the chain rule: p(w2|w1) = p(w1,w2) p(w1) p(w1) p(w2|w1) = p(w1, w2) • What if we want to break down large joint probabilities like p(w1, w2, w3)? We can repeatedly apply the chain rule: p(w1, w2, w3) = p(w1) p(w2|w1) p(w3|w1, w2) Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 33Bayes Rule • Finally, another important rule: Bayes rule p(x|y) = p(y|x) p(x) p(y) • It can easily derived from the chain rule: p(x, y) = p(x, y) p(x|y) p(y) = p(y|x) p(x) p(x|y) = p(y|x) p(x) p(y) Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 34Expectation • We introduced the concept of a random variable X prob(X = x) = p(x) • Example: Roll of a dice. There is a 1 6 chance that it will be 1, 2, 3, 4, 5, or 6. • We define the expectation E(X) of a random variable as: E(X) = x p(x) x • Roll of a dice: E(X) = 1 6 × 1 + 1 6 × 2 + 1 6 × 3 + 1 6 × 4 + 1 6 × 5 + 1 6 × 6 = 3.5 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 35Variance • Variance is defined as V ar(X) = E((X − E(X))2 ) = E(X2 ) − E2 (X) V ar(X) = x p(x) (x − E(X))2 • Intuitively, this is a measure how far events diverge from the mean (expectation) • Related to this is standard deviation, denoted as σ. V ar(X) = σ2 E(X) = µ Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 36Variance • Roll of a dice: V ar(X) = 1 6 (1 − 3.5)2 + 1 6 (2 − 3.5)2 + 1 6 (3 − 3.5)2 + 1 6 (4 − 3.5)2 + 1 6 (5 − 3.5)2 + 1 6 (6 − 3.5)2 = 1 6 ((−2.5)2 + (−1.5)2 + (−0.5)2 + 0.52 + 1.52 + 2.52 ) = 1 6 (6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25) = 2.917 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 37Standard Distributions • Uniform: all events equally likely – ∀x, y : p(x) = p(y) – example: roll of one dice • Binomial: a serious of trials with only only two outcomes – probability p for each trial, occurrence r out of n times: b(r; n, p) = n r pr (1 − p)n−r – a number of coin tosses Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 38Standard Distributions • Normal: common distribution for continuous values – value in the range [− inf, x], given expectation µ and standard deviation σ: n(x; µ, σ) = 1√ 2πµ e−(x−µ)2 /(2σ2 ) – also called Bell curve, or Gaussian – examples: heights of people, IQ of people, tree heights, ... Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 39Estimation Revisited • We introduced previously an estimation of probabilities based on frequencies: P(w) = count(w) w count(w ) • Alternative view: Bayesian: what is the most likely model given the data p(M|D) • Model and data are viewed as random variables – model M as random variable – data D as random variable Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 40Bayesian Estimation • Reformulation of p(M|D) using Bayes rule: p(M|D) = p(D|M) p(M) p(D) argmaxM p(M|D) = argmaxM p(D|M) p(M) • p(M|D) answers the question: What is the most likely model given the data • p(M) is a prior that prefers certain models (e.g. simple models) • The frequentist estimation of word probabilities p(w) is the same as Bayesian estimation with a uniform prior (no bias towards a specific model), hence it is also called the maximum likelihood estimation Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 41Entropy • An important concept is entropy: H(X) = x −p(x) log2 p(x) • A measure for the degree of disorder Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 42Entropy Example p(a) = 1 One event H(X) = − 1 log2 1 = 0 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 43Entropy Example p(a) = 0.5 p(b) = 0.5 2 equally likely events: H(X) = − 0.5 log2 0.5 − 0.5 log2 0.5 = − log2 0.5 = 1 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 44Entropy Example p(a) = 0.25 p(b) = 0.25 p(c) = 0.25 p(d) = 0.25 4 equally likely events: H(X) = − 0.25 log2 0.25 − 0.25 log2 0.25 − 0.25 log2 0.25 − 0.25 log2 0.25 = − log2 0.25 = 2 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 45Entropy Example p(a) = 0.7 p(b) = 0.1 p(c) = 0.1 p(d) = 0.1 4 events, one more likely than the others: H(X) = − 0.7 log2 0.7 − 0.1 log2 0.1 − 0.1 log2 0.1 − 0.1 log2 0.1 = − 0.7 log2 0.7 − 0.3 log2 0.1 = − 0.7 × −0.5146 − 0.3 × −3.3219 = 0.36020 + 0.99658 = 1.35678 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 46Entropy Example p(a) = 0.97 p(b) = 0.01 p(c) = 0.01 p(d) = 0.01 4 events, one much more likely than the others: H(X) = − 0.97 log2 0.97 − 0.01 log2 0.01 − 0.01 log2 0.01 − 0.01 log2 0.01 = − 0.97 log2 0.97 − 0.03 log2 0.01 = − 0.97 × −0.04394 − 0.03 × −6.6439 = 0.04262 + 0.19932 = 0.24194 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 47Examples H(X) = 0 H(X) = 1 H(X) = 2 H(X) = 3 H(X) = 1.35678 H(X) = 0.24194 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 48Intuition Behind Entropy • A good model has low entropy → it is more certain about outcomes • For instance a translation table e f p(e|f) the der 0.8 that der 0.2 is better than e f p(e|f) the der 0.02 that der 0.01 ... ... ... • A lot of statistical estimation is about reducing entropy Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 49Information Theory and Entropy • Assume that we want to encode a sequence of events X • Each event is encoded by a sequence of bits • For example – Coin flip: heads = 0, tails = 1 – 4 equally likely events: a = 00, b = 01, c = 10, d = 11 – 3 events, one more likely than others: a = 0, b = 10, c = 11 – Morse code: e has shorter code than q • Average number of bits needed to encode X ≥ entropy of X Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 50The Entropy of English • We already talked about the probability of a word p(w) • But words come in sequence. Given a number of words in a text, can we guess the next word p(wn|w1, ..., wn−1)? • Assuming a model with a limited window size Model Entropy 0th order 4.76 1st order 4.03 2nd order 2.8 human, unlimited 1.3 Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020 51Next Lecture: Language Models • Next time, we will expand on the idea of a model of English in the form p(wn|w1, ..., wn−1) • Despite its simplicity, a tremendously useful tool for NLP • Nice machine learning challenge – sparse data – smoothing – back-off and interpolation Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020