Corpus Acquisition from the Internet
Philipp Koehn
partially based on slides from Christian Buck
8 November 2022
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
1Big Data
For many language pairs, lots of text available.
Text you read
in your lifetime
Translated text
available
English text
available
300 million words
billions of words
trillions of words
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
2Mining the Web
• Largest source for text: the World Wide Web
– publicly available crawl of the web
– hosted by Amazon Web Services, but can be downloaded
– regularly updated (semi-annual)
– 2-4 billion web pages per crawl
• Currently ﬁlling up hard drives in our lab
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
3Monolingual Data
• Starting point: 35TB of text
• Processing pipeline [Buck et al., 2014]
– language detection
– deduplication
– normalization of Unicode characters
– sentence splitting
• Obtained corpora
Language Lines (B) Tokens (B) Bytes BLEU (WMT)
English 59.13 975.63 5.14 TB German
3.87 51.93 317.46 GB +0.5
Spanish 3.50 62.21 337.16 GB French
3.04 49.31 273.96 GB +0.6
Russian 1.79 21.41 220.62 GB +1.2
Czech 0.47 5.79 34.67 GB +0.6
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
4Parallel Data
• Basic processing pipeline [Smith et al., 2013]
– ﬁnd parallel web pages (based on URL only)
– align document by HTML structure
– sentence splitting and tokenization
– sentence alignment
– ﬁltering (remove boilerplate)
• Obtained corpora
French German Spanish Russian Japanese Chinese
Segments 10.2M 7.50M 5.67M 3.58M 1.70M 1.42M
Foreign Tokens 128M 79.9M 71.5M 34.7M 9.91M 8.14M
English Tokens 118M 87.5M 67.6M 36.7M 19.1M 14.8M
Bengali Farsi Telugu Somali Kannada Pashto
Segments 59.9K 44.2K 50.6K 52.6K 34.5K 28.0K
Foreign Tokens 573K 477K 336K 318K 305K 208K
English Tokens 537K 459K 358K 325K 297K 218K
• Much more work needed!
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
5Data Cleaning and Subsampling
• Not all data useful – some may be harmful
• Removing data based on
– domain relevance
– alignment quality
– redundancy
– bad language (orthography, non-words)
– machine translated or poorly translated
• Removing bad data always reduces training time
• Removing bad data sometimes helps quality
• Clean data approach (only using high quality data) helps in limited domains
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
6
corpus crawling
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
7Finding Monolingual Text
• Simple Idea
1. Download many websites
2. Extract text from HTML
3. Guess language of text
4. Add to corpus
5. Proﬁt
• Turns out all these steps are quite involved
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
9
extracting text
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
10A Web Page
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
11HTML Source
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
12Method 1: Strip Tags
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
13Method 2: HTML Parser
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
14
language detection
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
15What Language?
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
16Clues: Letter N-Grams
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
17Example: langid.py
• Muitas intervenc¸ ˜oes alertaram
– prediction: Portuguese
– high conﬁdence (-90.8)
• Muitas intervenc¸ ˜oes
– prediction: Portuguese
– fairly high conﬁdence (-68.2)
• Muitas
– prediction: English
– low conﬁdence (9.1)
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
18Language Identiﬁcation Tools
• langid.py (Lui & Baldwin, ACL 2012)
– 1-4 grams, NaiveBayes, Feature Selection
• TextCat (based on Cavnar & Trenkle, 1994)
– similar to langid.py
– no Feature Selection
• Compact/Chromium Language Detector 2 (Google)
– takes hints from tld, meta data
– super fast
– detects spans of text
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
19Detected Languages in CommonCrawl
(Buck and Heaﬁeld, LREC2014)
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
20Most Common English Phrases
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
21Beneﬁt of Huge Language Models
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
22
bilingual corpus crawling
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
23Mining Bilingual Text
• Bilingual text = same text in different languages
• Usually: one side translation of the other
• Full page or interface/content only
• Potentially translation on same page
e.g., Twitter, Facebook posts
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
24Pipeline
1. Identify web sites worth crawling
2. Crawl web site
3. Language detection — as before
4. Extract text from HTML — as before
5. Align documents
6. Align sentences
7. Clean corpus
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
25
identify web sites
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
26Targeted Crawling
• A few web sites with a lot of parallel text, e.g.,
– European Union, e.g., proceedings of the European Parliament
– Canadian Hansards
– United Nations
– Project Syndicate
– TED Talks
– Movie / TV show subtitles
– Global Voices
• Hand-written tools
– crawling
– text extraction
– document alignment
• Few days effort per site
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
27Broad Crawling
• Identify many web sites to crawl
– has the phrase This page in English or variants
– has link to language ﬂag
– known to have content in multiple languages (from CommonCrawl)
• Follow links
– up to n links deep into site
– up to n links in total
– only follow links to web pages, not images, etc.
• Avoid crawling sites too deeply that do not have parallel text?
(requires quick feedback from downstream processing)
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
28
document alignment
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
29Document Alignment
• Early Work: STRAND (Resnik 1998, 1999)
(Structural Translation Recognition, Acquiring Natural Data)
• Pipeline
1. candidate generation
2. candidate ranking
3. ﬁltering
4. optional: sentence alignment
5. evaluation
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
30Link Structure
• Parent page: a page that links to different language versions
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
31Parent Page Example
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
32Sibling Page
• A page that links to its translation in another language
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
33URL Matching
• Often URLs differ only slightly, often indicating language
xyz.com/en/ xyz.com/fr/
xyz.com/bla.htm xyz.com/bla.htm?lang=FR
xyz.com/the cat xyz.fr/le chat
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
34Finding URL Patterns
• URLs with pattern =en
Count Pattern
545875 lang=en
140420 lng=en
126434 LANG=en
110639 hl=en
99065 language=en
81471 tlng=en
56968 l=en
47504 locale=en
33656 langue=en
33503 lang=eng
19421 uil=English
15170 ln=en
14242 Language=EN
13948 lang=EN
12108 language=english
11997 lang=engcro
11646 store=en
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
35Finding URL Patterns
• URLs with pattern lang.*=.*
Count Pattern
13948 lang=EN
13456 language=ca
13098 switchlang=1
12960 language=zh
12890 lang=Spanish
12471 lang=th
12266 langBox=US
12108 language=english
12003 lang=cz
11997 lang=engcro
11635 lang=sl
11578 lang=d
11474 lang=lv
11376 lang=NL
11349 lang=croeng
11244 lang=English
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
36Document Length
• Extract texts and compare lengths (Smith 2001)
• Document or sentence level
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
37Document Object Model
• Translated web pages often retain similar structure
• This includes links to the same images, etc.
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
38Linearized Structure
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
39Levenshtein Alignment
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
40Content Similarity
• Simple things
– same numbers or names in documents
– often quite effective
• Use of lexicon
– treat documents as bag of words
– consider how many words in EN document have translations in FR document
• A bit more complex
– semantic representations of documents content
– bag of word vectors
– neural network embeddings
• Major challenge: do this fast for n × m document pairs
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
41Google’s Content Matching
• Basic idea: translate everything into English, match large n-grams
• For each non-English document:
1. Translate everything to English using MT
2. Find distinctive ngrams
(a) rare, but not too rare (5-grams)
(b) used for matching only
• Build inverted index: ngram → documents
[cat sat on] → {[doc1, ES], [doc3, DE], ...}
[on the mat] → {[doc1, ES], [doc2, FR], ...}
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
42Matching using Inverted Index
[cat sat on] -> {[doc1, ES], [doc3, DE], ...}
[on the mat] -> {[doc1, ES], [doc2, ES], ...}
[on the table] -> {[doc3, DE]}
• For each n-gram
– generate all pairs where:
∗ document list short (≤ 50)
∗ source language different
• Result: [doc1, doc3], ...
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
43Scoring using Forward Index
• Forward index maps documents to n-grams
• For each document pair [d1, d2]
– collect scoring n-grams for both documents
– build IDF-weighted vector
– distance: cosine similarity
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
44Scoring Document Pairs
• Given ngrams(d1) = n1, n2, ..., nr
ngrams(d2) = n1, n2, ..., nr
• Inverse document frequency
idf(n) = log
|D|
df(n)
where: |D| = number of documents
df(n) = number of documents with n
• Scoring of IDF-weighted vectors v
v1,x = idf(nx) if nx ∈ ngrams(d1), 0 otherwise
v2,x = idf(nx) if nx ∈ ngrams(d2), 0 otherwise
score(d1, d2) =
v1 ˙v2
||v1||||v2|||
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
45
sentence alignment
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
46Sentence Alignment
• Much early work in 1990s, e.g., Gale and Church (1991)
– ﬁnd sequence of 1-1, 1-2, 0-1, etc., sentence alignment groups
– good element in sequence = similar number of words
– dynamic programming search for best sequence
• Featurized alignments
– with dictionary (Hunalign)
– with induced dictionary (Gargantua)
– consider tags such as <P>
• Sensitive to noise — often large parts of page not translated
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
47Sentence Pair Similarity
• Core Problem: both sentences must have same meaning
• Translate foreign sentence into English
measure similarity with metrics like BLEU
• Words in one sentence have translation in the other
• Cross-lingual sentence embeddings
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
48Sentence Embeddings
• LASER: Neural machine translation model with bottleneck feature
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
49Sentence Embeddings
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
50Vecalign
• Uses LASER sentence embeddings
• Linear time coarse-to-ﬁne algorithm
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
51
sentence pair ﬁltering
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
52Filtering Bad Data
• Mismatched sentence pairs from errors in pipeline
• Non-literal translation
e.g. news stories are notoriously non-literal
• Bad translations
• Machine translation
– much of the parallel text on the Internet generated by Google Translate
– detection hard — looks like very clean parallel data
– maybe too clean (little reordering, very literal)
– watermarking machine translation (Venugopal et al., 2011)
• How clean should it be?
– trade-off between precision and recall unclear
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022
53Methods
• Dual cross-entropy
– view sentence pair as input/output
– score with neural machine translation model in both directions
– scores should be low and similar
• LASER embeddings
• Feature-based approaches
– matching numbers, named entities
– language model probabilities
– lexical translation probabilities
• Classiﬁer
– positive example: sentence pair from clean corpus
– negative example: corrupted example (misalignment, words changed, ...)
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022