MUNI
FI
Meanings
Context similarity in huge corpora
PA154 Language Modeling (8.2)
The word (and some of its parts) are the basic carriers of meaning
■ word without context - no meaning, many meaning potentials
■ the same word in different contexts - different meanings
■ word in similar contexts - same meaning
■ what is context?
Pavel Rychlý
pary@fi.muni.cz April 16, 2024
^avel Rychlý ■ Context similarity in huge corpora ■ April 16,2024
What is context?
Context is the words around the keyword.
■ What surroundings?:
■ the following word
■ previous word
■ window: +1 to +5
■ window: -5 to -1
■ Not all words around are important.
■ How do we determine importance?
■ the most common collocation - but that's "the"
■ (statisticaLLy) most significant - what formuLa?
^avel Rychlý ■ Context similarity in huge corpora ■ April 16,2024
Word Sketch
One-page summary of word behaviour
research	is noun 25,537ä t			
usually In plurals [99.i'M		. percentile 21	9)	
		n x	^*                  =•: U X	*=*                 =•: U
	modifier		modifies	subject_of
scientific			grant	aim
recent			project	focus
cancer			laboratory	investigate
empirical		...	institute	show
market		...	finding	examine
further		...	contract	indicate
Cray			programme	suggest
medical			council	reveal
historical		...	fellow	explore
applied			centre	concentrate
	Context si	Hilarity Tň hu	jI^A^St!!!1- April 16, 2024 *	involve
4/17
Word Sketch How to create it
Grammatical Relations Definition
■ Large balanced corpus
■ Find grammatical realtions (subjects, objects, heads, modifiers etc)
■ List of collocations for each grammatical session
■ Statistics for sorting each list
We can create a thesaurus from Word Sketch.
plain text file
a set of queries for each GR
queries contain labels for keyword and collocate
processing options
^avel Rychlý ■ Context similarity in huge corpora ■ April 16,2024
5/17
^avel Rychlý ■ Context similarity in huge corpora ■ April 16,2024
6/17
Grammar relation definitions
Association score
# 'modifier' and 'modify' gramrels definition *DUAL
^modifier/modify 2:"AD." 1:"N. ."
# 'and/or' gramrel definition =and/or
*SYMMETRIC
1:[] [word="and"|word="or"] 2:[] & l.tag = 2.tag
number of occurrences (wordi, gramrel, word2 AScore(w1,R, w2) = 14 + log2 Dice
14+log2||^™
1^1,/?,*
= ,W2|
# 'adverb' gramrel definition =adverb
1:[] 2:"AV."
2:"AV." 1:[]
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
Similarity coefficient
Data Sizes
comparison of word sketches w1 and w2 only important (significant) contexts what is the common
counts (wordx, (gramrel, word,)) and (word2, (gramrel, word,))
Z)(tup;,tupy)e{tup1»1ntups^}     + ASj - (AS, - ASj)2/S0
Sim(wi, w2) = ■
E
tupí € {tupWl Utupw2 }
AS;
Corpus sizes, their vocabularies and word counts in contexts
Corpus	Size	Words	Lemat	Different ctx	All ctx
BNC	111m	776k	722k	23m	63m
SYN2000	114m	1.65m	776k	19m	58m
OEC	1.12g	3.67m	3.12m	84m	569m
Itwac	1.92g	6.32m	4.76m	67m	587m
Vocabulary sizes and the number of different contexts grow sublinearly with the size of the corpus.
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
^avel Rychlý ■ Context similarity in huge corpora ■ April 16,2024
Matrix size
Practical data sizes
Similarity of all pairs of lemmas
Matrix of size N2, where N is 700k - 5m
Number of elements in orders of tera (1012)
Matrix is fortunately very sparse
Most values are 0 or "almost" 0
Even most of the whole rows/columns are empty
Computation only for words with minimum frequency Better to limit the number of contexts than the number of occurrences
Take only statistically significant contexts
Corpus	MIN	Lemmat	KWIC	CTX
BNC	1	152k	5.7m	608 k
BNC	20	68 k	5.6m	588k
OEC	2	269k	27.5 m	994k
OEC	20	128k	27.3m	981k
OEC	200	48 k	26.7m	965k
Itwac	20	137k	24.8 m	1.1m
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
11/17
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
12/17
Practical data sizes
■ Matrix of size N2, where N is 50k - 200k
■ Number of elements in orders of giga (1010)
■ The value of each element is created by applying the similarity function to vectors of length K = 500k - lm.
■ The straightforward algorithm for computing the whole matrix has a time complexity 0(N2K).
m The complexity is polynomial, but the algorithm is practically unusable for given ranges of values.
■ Estimated calculation times are in months or years.
■ Heuristics reduce the sizes of N and K at the expense of accuracy the resulting values.
■ The calculation time is then in the order of days with an error of 1-4%.
Efficient algorithm
Even the smaller matrix is very sparse
No need to calculate similarity for words that have nothing
together,
they have no common context.
The main loop of the algorithm is not through words, but through contexts.
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
Efficient algorithm
Optimization
■ Input: list of all possible words in contexts, (w, r, w'), with frequencies of occurrences in the corpus
■ Output: word similarity matrix sim(wi, w2)
for (r, w') in CONTEXTS:
WLIST = set of all w where (w, r, w') exists for wi in WLIST: for w2 in WLIST:
sim(w1,W2)+ = /(frequencies)
If | WUST\ > 10000, skip the context.
We do not keep the matrix sim(wi, w2) in memory during the ca Iculation.
Repeated runs of the main loop for the limited range w^. Instead of sim(wi, w2)+ = x we generate (wlt w2,x) to the output.
We then sort the output list and add the individual xs. Use of TPMMS (Two Phase Multi-way Merge Sort) with continuous by summation. Instead of several hundred GB, we sort a few GB.
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
^avel Rychlý ■ Context similarity in huge corpora ■ April 16,2024
Results
Algorithm is orders of magnitude faster than straightforward algorithm. (18 days x 2 hours)
Corpus	MIN	Lemmat	KWIC	CTX	Time
BNC	1	152k	5.7m	608k	13m 9s
BNC	20	68k	5.6m	588k	9m 30s
OEC	2	269k	27.5 m	994k	lh40m
OEC	20	128k	27.3m	981k	lh 27m
OEC	200	48 k	26.7m	965k	lh 10m
Itwac	20	137k	24.8m	1.1m	lh 16m
Without changes in precision Possibilities of easy parallelization.
^avel Rychly ■ Context similarity in huge corpora ■ April 16,2024
17/17