Corpus Acquisition from the Internet Philipp Koehn partially based on slides from Christian Buck 8 November 2022 Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 1Big Data For many language pairs, lots of text available. Text you read in your lifetime Translated text available English text available 300 million words billions of words trillions of words Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 2Mining the Web • Largest source for text: the World Wide Web – publicly available crawl of the web – hosted by Amazon Web Services, but can be downloaded – regularly updated (semi-annual) – 2-4 billion web pages per crawl • Currently filling up hard drives in our lab Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 3Monolingual Data • Starting point: 35TB of text • Processing pipeline [Buck et al., 2014] – language detection – deduplication – normalization of Unicode characters – sentence splitting • Obtained corpora Language Lines (B) Tokens (B) Bytes BLEU (WMT) English 59.13 975.63 5.14 TB German 3.87 51.93 317.46 GB +0.5 Spanish 3.50 62.21 337.16 GB French 3.04 49.31 273.96 GB +0.6 Russian 1.79 21.41 220.62 GB +1.2 Czech 0.47 5.79 34.67 GB +0.6 Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 4Parallel Data • Basic processing pipeline [Smith et al., 2013] – find parallel web pages (based on URL only) – align document by HTML structure – sentence splitting and tokenization – sentence alignment – filtering (remove boilerplate) • Obtained corpora French German Spanish Russian Japanese Chinese Segments 10.2M 7.50M 5.67M 3.58M 1.70M 1.42M Foreign Tokens 128M 79.9M 71.5M 34.7M 9.91M 8.14M English Tokens 118M 87.5M 67.6M 36.7M 19.1M 14.8M Bengali Farsi Telugu Somali Kannada Pashto Segments 59.9K 44.2K 50.6K 52.6K 34.5K 28.0K Foreign Tokens 573K 477K 336K 318K 305K 208K English Tokens 537K 459K 358K 325K 297K 218K • Much more work needed! Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 5Data Cleaning and Subsampling • Not all data useful – some may be harmful • Removing data based on – domain relevance – alignment quality – redundancy – bad language (orthography, non-words) – machine translated or poorly translated • Removing bad data always reduces training time • Removing bad data sometimes helps quality • Clean data approach (only using high quality data) helps in limited domains Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 6 corpus crawling Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 7Finding Monolingual Text • Simple Idea 1. Download many websites 2. Extract text from HTML 3. Guess language of text 4. Add to corpus 5. Profit • Turns out all these steps are quite involved Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 9 extracting text Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 10A Web Page Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 11HTML Source Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 12Method 1: Strip Tags Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 13Method 2: HTML Parser Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 14 language detection Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 15What Language? Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 16Clues: Letter N-Grams Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 17Example: langid.py • Muitas intervenc¸ ˜oes alertaram – prediction: Portuguese – high confidence (-90.8) • Muitas intervenc¸ ˜oes – prediction: Portuguese – fairly high confidence (-68.2) • Muitas – prediction: English – low confidence (9.1) Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 18Language Identification Tools • langid.py (Lui & Baldwin, ACL 2012) – 1-4 grams, NaiveBayes, Feature Selection • TextCat (based on Cavnar & Trenkle, 1994) – similar to langid.py – no Feature Selection • Compact/Chromium Language Detector 2 (Google) – takes hints from tld, meta data – super fast – detects spans of text Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 19Detected Languages in CommonCrawl (Buck and Heafield, LREC2014) Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 20Most Common English Phrases Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 21Benefit of Huge Language Models Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 22 bilingual corpus crawling Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 23Mining Bilingual Text • Bilingual text = same text in different languages • Usually: one side translation of the other • Full page or interface/content only • Potentially translation on same page e.g., Twitter, Facebook posts Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 24Pipeline 1. Identify web sites worth crawling 2. Crawl web site 3. Language detection — as before 4. Extract text from HTML — as before 5. Align documents 6. Align sentences 7. Clean corpus Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 25 identify web sites Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 26Targeted Crawling • A few web sites with a lot of parallel text, e.g., – European Union, e.g., proceedings of the European Parliament – Canadian Hansards – United Nations – Project Syndicate – TED Talks – Movie / TV show subtitles – Global Voices • Hand-written tools – crawling – text extraction – document alignment • Few days effort per site Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 27Broad Crawling • Identify many web sites to crawl – has the phrase This page in English or variants – has link to language flag – known to have content in multiple languages (from CommonCrawl) • Follow links – up to n links deep into site – up to n links in total – only follow links to web pages, not images, etc. • Avoid crawling sites too deeply that do not have parallel text? (requires quick feedback from downstream processing) Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 28 document alignment Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 29Document Alignment • Early Work: STRAND (Resnik 1998, 1999) (Structural Translation Recognition, Acquiring Natural Data) • Pipeline 1. candidate generation 2. candidate ranking 3. filtering 4. optional: sentence alignment 5. evaluation Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 30Link Structure • Parent page: a page that links to different language versions Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 31Parent Page Example Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 32Sibling Page • A page that links to its translation in another language Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 33URL Matching • Often URLs differ only slightly, often indicating language xyz.com/en/ xyz.com/fr/ xyz.com/bla.htm xyz.com/bla.htm?lang=FR xyz.com/the cat xyz.fr/le chat Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 34Finding URL Patterns • URLs with pattern =en Count Pattern 545875 lang=en 140420 lng=en 126434 LANG=en 110639 hl=en 99065 language=en 81471 tlng=en 56968 l=en 47504 locale=en 33656 langue=en 33503 lang=eng 19421 uil=English 15170 ln=en 14242 Language=EN 13948 lang=EN 12108 language=english 11997 lang=engcro 11646 store=en Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 35Finding URL Patterns • URLs with pattern lang.*=.* Count Pattern 13948 lang=EN 13456 language=ca 13098 switchlang=1 12960 language=zh 12890 lang=Spanish 12471 lang=th 12266 langBox=US 12108 language=english 12003 lang=cz 11997 lang=engcro 11635 lang=sl 11578 lang=d 11474 lang=lv 11376 lang=NL 11349 lang=croeng 11244 lang=English Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 36Document Length • Extract texts and compare lengths (Smith 2001) • Document or sentence level Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 37Document Object Model • Translated web pages often retain similar structure • This includes links to the same images, etc. Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 38Linearized Structure Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 39Levenshtein Alignment Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 40Content Similarity • Simple things – same numbers or names in documents – often quite effective • Use of lexicon – treat documents as bag of words – consider how many words in EN document have translations in FR document • A bit more complex – semantic representations of documents content – bag of word vectors – neural network embeddings • Major challenge: do this fast for n × m document pairs Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 41Google’s Content Matching • Basic idea: translate everything into English, match large n-grams • For each non-English document: 1. Translate everything to English using MT 2. Find distinctive ngrams (a) rare, but not too rare (5-grams) (b) used for matching only • Build inverted index: ngram → documents [cat sat on] → {[doc1, ES], [doc3, DE], ...} [on the mat] → {[doc1, ES], [doc2, FR], ...} Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 42Matching using Inverted Index [cat sat on] -> {[doc1, ES], [doc3, DE], ...} [on the mat] -> {[doc1, ES], [doc2, ES], ...} [on the table] -> {[doc3, DE]} • For each n-gram – generate all pairs where: ∗ document list short (≤ 50) ∗ source language different • Result: [doc1, doc3], ... Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 43Scoring using Forward Index • Forward index maps documents to n-grams • For each document pair [d1, d2] – collect scoring n-grams for both documents – build IDF-weighted vector – distance: cosine similarity Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 44Scoring Document Pairs • Given ngrams(d1) = n1, n2, ..., nr ngrams(d2) = n1, n2, ..., nr • Inverse document frequency idf(n) = log |D| df(n) where: |D| = number of documents df(n) = number of documents with n • Scoring of IDF-weighted vectors v v1,x = idf(nx) if nx ∈ ngrams(d1), 0 otherwise v2,x = idf(nx) if nx ∈ ngrams(d2), 0 otherwise score(d1, d2) = v1 ˙v2 ||v1||||v2||| Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 45 sentence alignment Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 46Sentence Alignment • Much early work in 1990s, e.g., Gale and Church (1991) – find sequence of 1-1, 1-2, 0-1, etc., sentence alignment groups – good element in sequence = similar number of words – dynamic programming search for best sequence • Featurized alignments – with dictionary (Hunalign) – with induced dictionary (Gargantua) – consider tags such as

• Sensitive to noise — often large parts of page not translated Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 47Sentence Pair Similarity • Core Problem: both sentences must have same meaning • Translate foreign sentence into English measure similarity with metrics like BLEU • Words in one sentence have translation in the other • Cross-lingual sentence embeddings Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 48Sentence Embeddings • LASER: Neural machine translation model with bottleneck feature Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 49Sentence Embeddings Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 50Vecalign • Uses LASER sentence embeddings • Linear time coarse-to-fine algorithm Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 51 sentence pair filtering Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 52Filtering Bad Data • Mismatched sentence pairs from errors in pipeline • Non-literal translation e.g. news stories are notoriously non-literal • Bad translations • Machine translation – much of the parallel text on the Internet generated by Google Translate – detection hard — looks like very clean parallel data – maybe too clean (little reordering, very literal) – watermarking machine translation (Venugopal et al., 2011) • How clean should it be? – trade-off between precision and recall unclear Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022 53Methods • Dual cross-entropy – view sentence pair as input/output – score with neural machine translation model in both directions – scores should be low and similar • LASER embeddings • Feature-based approaches – matching numbers, named entities – language model probabilities – lexical translation probabilities • Classifier – positive example: sentence pair from clean corpus – negative example: corrupted example (misalignment, words changed, ...) Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 8 November 2022