Basics in Language and Probability
Philipp Koehn
3 September 2020
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
1Quotes
It must be recognized that the notion ”probability of a sentence” is an
entirely useless one, under any known interpretation of this term.
Noam Chomsky, 1969
Whenever I ﬁre a linguist our system performance improves.
Frederick Jelinek, 1988
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
2Conﬂicts?
rationalist vs. empiricist
scientist vs. engineer
insight vs. data analysis
explaining language vs. building applications
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
3
language
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
4A Naive View of Language
• Language needs to name
– nouns: objects in the world (dog)
– verbs: actions (jump)
– adjectives and adverbs: properties of objects and actions (brown, quickly)
• Relationship between these have to speciﬁed
– word order
– morphology
– function words
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
5A Bag of Words
dog
fox
lazy
jump
brown
quick
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
6Relationships
dog
fox
lazy
jump
brown
quick
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
7Marking of Relationships: Word Order
dog
fox
lazy
jump
brown
quick
quick brown fox jump lazy dog
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
8Marking of Relationships: Function Words
dog
fox
lazy
jump
brown
quick
quick brown fox jump over lazy dog
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
9Marking of Relationships: Morphology
dog
fox
lazy
jump
brown
quick
quick brown fox jumps over lazy dog
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
10Some Nuance
dog
fox
lazy
jump
brown
quick
the quick brown fox jumps over the lazy dog
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
11Marking of Relationships: Agreement
• From Catullus, First Book, ﬁrst verse (Latin):
Cui dono lepidum novum libellum arida modo pumice expolitum ?
Whom I-present lovely new little-book dry manner pumice polished ?
(To whom do I present this lovely new little book now polished with a dry pumice?)
• Gender (and case) agreement links adjectives to nouns
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
12Marking of Relationships to Verb: Case
• German:
Die Frau gibt dem Mann den Apfel
The woman gives the man the apple
subject indirect object object
Der Frau gibt der Mann den Apfel
The woman gives the man the apple
indirect object subject object
• Case inﬂection indicates role of noun phrases
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
13Case Morphology vs. Prepositions
• Two different word orderings for English:
– The woman gives the man the apple
– The woman gives the apple to the man
• Japanese:
woman SUBJ man OBJ apple OBJ2 gives
• Is there a real difference between prepositions and noun phrase case inﬂection?
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
14Words
This is a simple sentence WORDS
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
15Morphology
This is a simple sentence
be
3sg
present
WORDS
MORPHOLOGY
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
16Parts of Speech
This is a simple sentence
be
3sg
present
DT VBZ DT JJ NN
WORDS
MORPHOLOGY
PART OF SPEECH
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
17Syntax
This is a simple sentence
be
3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
18Semantics
This is a simple sentence
be
3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1
string of words
satisfying the
grammatical rules
of a languauge
SIMPLE1
having
few parts
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
SEMANTICS
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
19Discourse
This is a simple sentence
be
3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1
string of words
satisfying the
grammatical rules
of a languauge
SIMPLE1
having
few parts
But it is an instructive one.
CONTRAST
WORDS
MORPHOLOGY
SYNTAX
DISCOURSE
PART OF SPEECH
SEMANTICS
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
20Why is Language Hard?
• Ambiguities on many levels
• Rules, but many exceptions
• No clear understand how humans process language
• Can we learn everything about language by automatic data analysis?
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
21
data
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
22Data: Words
• Deﬁnition: strings of letters separated by spaces
• But how about:
– punctuation: commas, periods, etc. typically separated (tokenization)
– hyphens: high-risk
– clitics: Joe’s
– compounds: website, Computerlinguistikvorlesung
• And what if there are no spaces:
伦敦每日快报指出,两台记载黛安娜王妃一九九七年巴黎
死亡车祸调查资料的手提电脑,被从前大都会警察总长的
办公室里偷走.
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
23Word Counts
Most frequent words in the English Europarl corpus
any word nouns
Frequency in text Token
1,929,379 the
1,297,736 ,
956,902 .
901,174 of
841,661 to
684,869 and
582,592 in
452,491 that
424,895 is
424,552 a
Frequency in text Content word
129,851 European
110,072 Mr
98,073 commission
71,111 president
67,518 parliament
64,620 union
58,506 report
57,490 council
54,079 states
49,965 member
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
24Word Counts
But also:
There is a large tail of words that
occur only once.
33,447 words occur once, for instance
• cornﬂakes
• mathematicians
• Tazhikhistan
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
25Zipf’s law
f × r = k
f = frequency of a word
r = rank of a word (if sorted by frequency)
k = a constant
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
26Zipf’s law as a graph
frequency
rank
Why a line in log-scale?
fr = k ⇒ f = k
r ⇒ log f = log k − log r
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
27
statistics
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
28Probabilities
• Given word counts we can estimate a probability distribution:
P(w) = count(w)
w count(w )
• This type of estimation is called maximum likelihood estimation. Why? We will
get to that later.
• Estimating probabilities based on frequencies is called the frequentist approach
to probability.
• This probability distribution answers the question: If we randomly pick a word
out of a text, how likely will it be word w?
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
29A Bit More Formal
• We introduce a random variable W.
• We deﬁne a probability distribution p, that tells us how likely the variable W is
the word w:
prob(W = w) = p(w)
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
30Joint Probabilities
• Sometimes, we want to deal with two random variables at the same time.
• Example: Words w1 and w2 that occur in sequence (a bigram)
We model this with the distribution: p(w1, w2)
• If the occurrence of words in bigrams is independent, we can reduce this to
p(w1, w2) = p(w1)p(w2). Intuitively, this not the case for word bigrams.
• We can estimate joint probabilities over two variables the same way we
estimated the probability distribution over a single variable:
p(w1, w2) = count(w1,w2)
w1 ,w2
count(w1 ,w2 )
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
31Conditional Probabilities
• Another useful concept is conditional probability
p(w2|w1)
It answers the question: If the random variable W1 = w1, how what is the value
for the second random variable W2?
• Mathematically, we can deﬁne conditional probability as
p(w2|w1) = p(w1,w2)
p(w1)
• If W1 and W2 are independent: p(w2|w1) = p(w2)
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
32Chain Rule
• A bit of math gives us the chain rule:
p(w2|w1) = p(w1,w2)
p(w1)
p(w1) p(w2|w1) = p(w1, w2)
• What if we want to break down large joint probabilities like p(w1, w2, w3)?
We can repeatedly apply the chain rule:
p(w1, w2, w3) = p(w1) p(w2|w1) p(w3|w1, w2)
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
33Bayes Rule
• Finally, another important rule: Bayes rule
p(x|y) = p(y|x) p(x)
p(y)
• It can easily derived from the chain rule:
p(x, y) = p(x, y)
p(x|y) p(y) = p(y|x) p(x)
p(x|y) = p(y|x) p(x)
p(y)
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
34Expectation
• We introduced the concept of a random variable X
prob(X = x) = p(x)
• Example: Roll of a dice. There is a 1
6 chance that it will be 1, 2, 3, 4, 5, or 6.
• We deﬁne the expectation E(X) of a random variable as:
E(X) = x p(x) x
• Roll of a dice:
E(X) = 1
6 × 1 + 1
6 × 2 + 1
6 × 3 + 1
6 × 4 + 1
6 × 5 + 1
6 × 6 = 3.5
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
35Variance
• Variance is deﬁned as
V ar(X) = E((X − E(X))2
) = E(X2
) − E2
(X)
V ar(X) = x p(x) (x − E(X))2
• Intuitively, this is a measure how far events diverge from the mean (expectation)
• Related to this is standard deviation, denoted as σ.
V ar(X) = σ2
E(X) = µ
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
36Variance
• Roll of a dice:
V ar(X) =
1
6
(1 − 3.5)2
+
1
6
(2 − 3.5)2
+
1
6
(3 − 3.5)2
+
1
6
(4 − 3.5)2
+
1
6
(5 − 3.5)2
+
1
6
(6 − 3.5)2
=
1
6
((−2.5)2
+ (−1.5)2
+ (−0.5)2
+ 0.52
+ 1.52
+ 2.52
)
=
1
6
(6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25)
= 2.917
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
37Standard Distributions
• Uniform: all events equally likely
– ∀x, y : p(x) = p(y)
– example: roll of one dice
• Binomial: a serious of trials with only only two outcomes
– probability p for each trial, occurrence r out of n times:
b(r; n, p) = n
r pr
(1 − p)n−r
– a number of coin tosses
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
38Standard Distributions
• Normal: common distribution for continuous values
– value in the range [− inf, x], given expectation µ and standard deviation σ:
n(x; µ, σ) = 1√
2πµ
e−(x−µ)2
/(2σ2
)
– also called Bell curve, or Gaussian
– examples: heights of people, IQ of people, tree heights, ...
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
39Estimation Revisited
• We introduced previously an estimation of probabilities based on frequencies:
P(w) = count(w)
w count(w )
• Alternative view: Bayesian: what is the most likely model given the data
p(M|D)
• Model and data are viewed as random variables
– model M as random variable
– data D as random variable
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
40Bayesian Estimation
• Reformulation of p(M|D) using Bayes rule:
p(M|D) = p(D|M) p(M)
p(D)
argmaxM p(M|D) = argmaxM p(D|M) p(M)
• p(M|D) answers the question: What is the most likely model given the data
• p(M) is a prior that prefers certain models (e.g. simple models)
• The frequentist estimation of word probabilities p(w) is the same as Bayesian
estimation with a uniform prior (no bias towards a speciﬁc model), hence it is
also called the maximum likelihood estimation
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
41Entropy
• An important concept is entropy:
H(X) = x −p(x) log2 p(x)
• A measure for the degree of disorder
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
42Entropy Example
p(a) = 1
One event
H(X) = − 1 log2 1
= 0
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
43Entropy Example
p(a) = 0.5
p(b) = 0.5
2 equally likely events:
H(X) = − 0.5 log2 0.5 − 0.5 log2 0.5
= − log2 0.5
= 1
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
44Entropy Example
p(a) = 0.25
p(b) = 0.25
p(c) = 0.25
p(d) = 0.25
4 equally likely events:
H(X) = − 0.25 log2 0.25 − 0.25 log2 0.25
− 0.25 log2 0.25 − 0.25 log2 0.25
= − log2 0.25
= 2
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
45Entropy Example
p(a) = 0.7
p(b) = 0.1
p(c) = 0.1
p(d) = 0.1
4 events, one more likely than the
others:
H(X) = − 0.7 log2 0.7 − 0.1 log2 0.1
− 0.1 log2 0.1 − 0.1 log2 0.1
= − 0.7 log2 0.7 − 0.3 log2 0.1
= − 0.7 × −0.5146 − 0.3 × −3.3219
= 0.36020 + 0.99658
= 1.35678
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
46Entropy Example
p(a) = 0.97
p(b) = 0.01
p(c) = 0.01
p(d) = 0.01
4 events, one much more likely than
the others:
H(X) = − 0.97 log2 0.97 − 0.01 log2 0.01
− 0.01 log2 0.01 − 0.01 log2 0.01
= − 0.97 log2 0.97 − 0.03 log2 0.01
= − 0.97 × −0.04394 − 0.03 × −6.6439
= 0.04262 + 0.19932
= 0.24194
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
47Examples
H(X) = 0 H(X) = 1 H(X) = 2
H(X) = 3 H(X) = 1.35678 H(X) = 0.24194
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
48Intuition Behind Entropy
• A good model has low entropy
→ it is more certain about outcomes
• For instance a translation table
e f p(e|f)
the der 0.8
that der 0.2
is better than
e f p(e|f)
the der 0.02
that der 0.01
... ... ...
• A lot of statistical estimation is about reducing entropy
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
49Information Theory and Entropy
• Assume that we want to encode a sequence of events X
• Each event is encoded by a sequence of bits
• For example
– Coin ﬂip: heads = 0, tails = 1
– 4 equally likely events: a = 00, b = 01, c = 10, d = 11
– 3 events, one more likely than others: a = 0, b = 10, c = 11
– Morse code: e has shorter code than q
• Average number of bits needed to encode X ≥ entropy of X
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
50The Entropy of English
• We already talked about the probability of a word p(w)
• But words come in sequence. Given a number of words in a text, can we guess
the next word p(wn|w1, ..., wn−1)?
• Assuming a model with a limited window size
Model Entropy
0th order 4.76
1st order 4.03
2nd order 2.8
human, unlimited 1.3
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020
51Next Lecture: Language Models
• Next time, we will expand on the idea of a model of English in the form
p(wn|w1, ..., wn−1)
• Despite its simplicity, a tremendously useful tool for NLP
• Nice machine learning challenge
– sparse data
– smoothing
– back-off and interpolation
Philipp Koehn Machine Translation: Basics in Language and Probability 3 September 2020