Statistical Natural Language Processing
P. Rychlý
NLP Centre, Fl Mil, Brno
September 21, 2021
P. Rychlý		tatistical Natural Language Processing	1/28
O Word lists O Collocations Q Language Modeling -grams
Q Evaluation of Language Models
P. Rychlý		tatistical Natural Language Processing	2/28
Statistical Natural Language Processing
• statistics provides a summary (of a text) 9 highlights important or interesting facts 9 can be used to model data
• foundation of estimating probabilities
• fundamental statistics: size (+ domain, range)
P. Rychlý
Statistical Natural Language Processing
Statistical Natural Language Processing
• statistics provides a summary (of a text) 9 highlights important or interesting facts o can be used to model data
• foundation of estimating probabilities
• fundamental statistics: size (+ domain, range)
	lines	words	bytes
Book 1 Book 2	3,715 1,601	37,703 16,859	223,415 91,031
Word list
• list of all words from a text
• list of most frequent words
• words, lemmas, senses, tags, domains, years ...
Book 1	Book 2	
the, and, of, to, you, his, in,	the, 1, to, a, of, is, that	
said, that, 1, will, him, your, he,	, you, he, and, said,	was,
a, my, was, with, s, for, me,	, in, it,  not, me,	my,
He, is,          ,       , it, them,	have, And, are, one, for,	But,
be, The, all,       , have, from,	his, be, The, It, at, all,	with,
, on, her,	on, will, as, very, had,	this,
, are, their, were, they,	him, He, from, they,	, so,
which,        , t, up,	them, no, You, do, would,	like
had, there		
Word list
• list of all words from a text
• list of most frequent words
• words, lemmas, senses, tags, domains, years ...
Book 1
the, and, of, to, you, his, in, said, that, I, will, him, your, he, a, my, was, with, s, for, me, He, is, father, , it, them, be, The, all, land, have, from, , on, her, , son,
, are, their, were, they, which, sons, t, up, had, there
Book 2
the, I, to, a, of, is, that, little, you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, all, with, on, will, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, would, like
Word list
• list of all words from a text
• list of most frequent words
• words, lemmas, senses, tags, domains, years ...
Book 1	Book 2
the, and, of, to, you, his, in, said, that, 1, will, him, your, he, a, my, was, with, s, for, me, He, is, father, God, it, them, be, The, all, land, have, from, Jacob, on, her, Yahweh, son, Joseph, are, their, were, they, which, sons, t, up, Abraham, had, there	the, 1, to, a, of, is, that, little, you, he, and, said, was, prince, in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, all, with, on, will, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, would, like
Frequency
• number of occurrences (raw frequency)
• relative frequency (hits per million)
• document frequency (number of documents with a hit)
• reduced frequency (ARF, ALDf) 1 < reduced < raw
• normalization for comparison
• hapax legomena (= 1 hit)
P. Rychlý		tatistical Natural Language Processing	5/28
ipf s Law
• rank-frequency plot
• rank x frequency = constant
i-1-1-1-1-r~
0 20 40 60 80 100
rangorde
P. Rychlý
statistical Natural Language Processing
Zipf's Law
o o o o
o o o
o o
Zl
cr
(D
O'he
of O    O and
to
*0 O
O in
O that his OO i'
10
100
1000
—I—
10000
Rank (log scale)
P. Rychlý
tatistical Natural Language Processing
7/28
Keywords
• select only important words from a word list
• compare to reference text (norm) <* simple math score:
score =
freq focus + N
frčQreference ~l~ N
Genesis	Little Prince
son God father Jacob Yahweh Joseph Abraham wife behold daughter	prince planet flower little fox never too drawing reply star
Collocations
9 meaning of words is defined by the context
o collocations a salient words in the context
• usually not the most frequent
o filtering by part of speech, grammatical relation
o compare to reference = context for other words
9 many statistics (usually single use only) based on frequencies
• Ml-score, t-score, x2, ...
• logDice - scalable
log Dice = 14 + log
AB
fA + fß
P. Rychlý		tatistical Natural Language Processing	9/28
Collocations of Prince
=•: U X		=•= n x		*=>        =•= :o: x		
	modifiers of ''prince"		verbs with "prince" as object		verbs with "prince" as subject	
	little the little prince		say said the little prince		say the little prince said to himself	
	fair fair, little prince		ask asked the little prince		corne saw the little prince coming	
	Oh Oh , little prince		demand demanded the little prince		go And the little prince went away	
	dear dear little prince		see when he saw the little prince coming inquire inquired the little prince repeat repeated the little prince . who		add the little prince added	
	prince prince , dear little prince				ask the little prince asked	
	great great prince				flush +" The little prince flushed	
						
P. Rychlý		Statistical Natural Language Processing			10 / 28	
Collocations of Prince
Oh
inquire
be
verbs with "prince" as object
repeat demand
see
flush
say
say
come ask ask    consequence be
y prince
prince
little planet
Oh snow
go
add answer
fair
modifiers of "prince"
Thesaurus
• comparing collocation distributions o counting same context
son as noun 301x Abraham as noun 134*
	Word	Frequency ?				Word	Frequency ?	
1	brother		iei		1	Isaac	82 ***	
2	wife		125 —		2	Jacob	184 »•	
3	father		278		3	Joseph	157 «•	
4	daughter		103 —			Noah	41 «•	
	child		80		5	Abrarn	61 ***	
6	man		187		6	Laban	54 ***	
7	servant		91 •«			Esau	78 ***	
	Esau		78		S	God	234 »•	
	Jacob		184		9	Abimelech	24 »•	
10	name		85			father	278	
	P.	Rychlý		Statistical Natural Language Processing				
Multi-word units
• meaning of some words is completely different in the context of specific co-occurring word
• black hole, is not black and is not a hole
• strong collocations
• uses same statistics with different threshold
o better to compare context distribution instead of only numbers
• terminology - compare to a reference corpus
P. Rychlý
tatistical Natural Language Processing
13 / 28
Language models—what are they good for?
9 assigning scores to sequences of words
• predicting words generating text
• statistical machine translation
• automatic speech recognition
• optical character recognition
P. Rychlý	El	tatistical Natural Language Processing	14 / 28
OCR + MT
BUXQA B rOPOA
		i 'kv
r	iCCESS TO CITY	o
		
P. Rychlý
15 / 28
Language models - probability of a sentence
• LM is a probability distribution over all possible word sequences. 9 What is the probability of utterance of s?
Probability of sentence
Pz_/w(Catalonia President urges protests) p/_w(President Catalonia urges protests) Pz_/w(urges Catalonia protests President)
Ideally, the probability should strongly correlate with fluency and intelligibility of a word sequence.
N-gram models
• an approximation of long sequences using short n-grams
• a straightforward implementation a an intuitive approach
• good local fluency
Randomly generated text
"Jsi nebylo vidět vteřin přestal po schodech se dal do deníku a položili se táhl ji viděl na konci místnosti 101," řekl důstojník.
Hungarian
A társaság kótelezettségeiért kapta a kózépkori temploma az volt, hogy a felhasználók az adottságai, a felhasználó azonosítása az egyesúlet alapszabályát.
P. Rychlý
tatistical Natural Language Processing
17 / 28
N-gram models, naive approach
1/1/ = l/l/i. 1/1/2 ••• • • Wn
p(W) = JJp(w/|wi • • • i/i//_i)
Markov's assumption
p(l/|/) = JJp(w/|w/_2j
p(this is a sentence) = p(this) x p(is\this) x p(a\this, is) x p(sentence\is, a)
p(a\this, is) =
t/7/s /s a
t/7/S /s
Sparse data problem
P. Rychlý		tatistical Natural Language Processing	18 / 28
Computing, LM probabilities estimation
Trigram model uses 2 preceding words for probability learning. Using maximum-likelihood estimation:
p(w3\wi. 1/1/2) =
count{w\, 1/1/2. ws) V   count(wi, 1/1/2. w)
quadrigram: (7orc/, of, t/?e, ?) ()
	count	p(w)
rings	30,156	0.425
flies	2,977	0.042
well	1,536	0.021
manor	907	0.012
dance	767	0.010
		
P. Rychlý		tatistical Natural Language Processing	19 / 28
Large LM - n-gram counts
How many unique n-grams in a corpus?
order	unique	singletons
unigram	86,700	33,447 (38.6%)
bigram	1,948,935	1,132,844 (58.1%)
trigram	8,092,798	6,022,286 (74.4%)
4-gram	15,303,847	13,081,621 (85.5%)
5-gram	19,882,175	18,324,577 (92.2%)
Corpus: Europarl, 30 M tokens.
Language models smoothing
The problem: an n-gram is missing in the data but is in a sentence —> p(sentence) = 0.
We need to assign non-zero p for unseen data. This must hold:
Vi/i/ : p(w) > 0 The issue is more pronounced for higher-order models.
Smoothing: an attempt to amend real counts of n-grams to expected counts in any (unseen) data.
Add-one, Add-<% Good-Turing smoothing
Deleted estimation
We can find unseen n-grams in another corpus. N-grams contained in one of them and not in the other help us to estimate general amount of unseen n-grams.
E.g. bigrams not occurring in a training corpus but present in the other corpus million times (given the amount of all possible bigrams equals 7.5 billions) will occur approx.
10(
7.5 x 109
= 0.00013x
P. Rychlý		tatistical Natural Language Processing	22 / 28
Interpolation and back-off
Previous methods treated all unseen n-grams the same. Consider trigrams
beautiful young girl beautiful young granny
Despite we don't have any of these in our training data, the former trigram should be more probable.
We will use probability of lower order models, for which we have necessary data:
young girl young granny beautiful young
Interpolation
P/(1/I/3 | l/l/l l/l/2) = \ip(w3) + \2p(w3\w2) + A3p(lV3|lVilV2)
If we have enough data we can trust higher order models more and assign a higher significance to corresponding n-grams.
Pi is probability distribution, thus this must hold:
va„ : 0 < Xn < 1
£a„ = i
n
P. Rychlý		tatistical Natural Language Processing	24 / 28
Quality and comparison of LMs
We need to compare quality of various LM (various orders, various data, smoothing techniques etc.)
1) extrinsic (WER, MT, ASR, OCR) and 2) intrinsic (perplexity) evaluation
A good LM should assign a higher probability to a good (looking) text than to an incorrect text. For a fixed test text we can compare various LMs.
Cross-entropy
H(plm) = — \ogpLM(wi, W2,... wn)
n
1 n
- V" \0gpLM{wi\wi, . . .
n
i=l
Cross-entropy is average value of negative logarithms of words probabilities in testing text. It corresponds to a measure of uncertainty of a probability distribution. The lower the better.
A good LM should reach entropy close to real entropy of language. That can't be measured directly but quite reliable estimates exist, e.g. Shannon's game. For English, entropy is estimated to approx. 1.3 bit per letter.
Perplexity
PP — 2H(pLM^
Perplexity is a simple transformation of cross-entropy.
A good LM should not waste p for improbable phenomena.
The lower entropy, the better —> the lower perplexity, the better.
P. Rychlý	El	tatistical Natural Language Processing	27 / 28
Comparing smoothing methods (Europarl)
method	perplexity
add-one	382.2
add-a	113.2
deleted est.	113.4
Good-Turing	112.9
P. Rychlý		tatistical Natural Language Processing	28 / 28