Machine Translation Philipp Koehn 31 August 2021 Philipp Koehn Machine Translation 31 August 2021 1What is This? • A class on machine translation • Taught at Johns Hopkins University, Fall 2021 • Class web site: http://www.mt-class.org/jhu/ • Tuesdays and Thursdays, 1:30-2:45, Hodson 213 • Instructor: Philipp Koehn • TAs: Kelly Marchisio • Grading – five programming assignments (12% each) – final project (30%) – in-class presentation: language in ten minutes (10%) Philipp Koehn Machine Translation 31 August 2021 2Why Take This Class? • Close look at an artificial intelligence problem • Practical introduction to natural language processing • Introduction to deep learning for structured prediction Philipp Koehn Machine Translation 31 August 2021 3Textbooks Philipp Koehn Machine Translation 31 August 2021 4 some history Philipp Koehn Machine Translation 31 August 2021 5An Old Idea Warren Weaver on translation as code breaking (1947): When I look at an article in Russian, I say: ”This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode”. Philipp Koehn Machine Translation 31 August 2021 6Early Efforts and Disappointment • Excited research in 1950s and 1960s 1954 Georgetown experiment Machine could translate 250 words and 6 grammar rules • 1966 ALPAC report: – only $20 million spent on translation in the US per year – no point in machine translation Philipp Koehn Machine Translation 31 August 2021 7Rule-Based Systems • Rule-based systems – build dictionaries – write transformation rules – refine, refine, refine • M´et´eo system for weather forecasts (1976) • Systran (1968), Logos and Metal (1980s) "have" := if subject(animate) and object(owned-by-subject) then translate to "kade... aahe" if subject(animate) and object(kinship-with-subject) then translate to "laa... aahe" if subject(inanimate) then translate to "madhye... aahe" Philipp Koehn Machine Translation 31 August 2021 8Statistical Machine Translation • 1980s: IBM • 1990s: increased research • Mid 2000s: Phrase-Based MT (Moses, Google) • Around 2010: commercial viability Philipp Koehn Machine Translation 31 August 2021 9Neural Machine Translation • Late 2000s: neural models for computer vision • Since mid 2010s: neural models for machine translation • 2016: Neural machine translation the new state of the art Philipp Koehn Machine Translation 31 August 2021 10Hype Hype 1950 1960 1970 1980 1990 2000 2010 Reality Georgetown experiment Expert systems / 5th generation AI Statistical MT Neural MT 2020 Philipp Koehn Machine Translation 31 August 2021 11 how good is machine translation? Philipp Koehn Machine Translation 31 August 2021 12Machine Translation: Chinese Philipp Koehn Machine Translation 31 August 2021 13Machine Translation: French Philipp Koehn Machine Translation 31 August 2021 14A Clear Plan Source Target Lexical Transfer Interlingua Philipp Koehn Machine Translation 31 August 2021 15A Clear Plan Source Target Lexical Transfer Syntactic Transfer Interlingua Analysis Generation Philipp Koehn Machine Translation 31 August 2021 16A Clear Plan Source Target Lexical Transfer Syntactic Transfer Semantic Transfer Interlingua Analysis Generation Philipp Koehn Machine Translation 31 August 2021 17A Clear Plan Source Target Lexical Transfer Syntactic Transfer Semantic Transfer Interlingua Analysis Generation Philipp Koehn Machine Translation 31 August 2021 18Learning from Data Statistical Machine Translation System Training Data Linguistic Tools Statistical Machine Translation System Translation Source Text Training Using parallel corpora monolingual corpora dictionaries Philipp Koehn Machine Translation 31 August 2021 19 why is that a good plan? Philipp Koehn Machine Translation 31 August 2021 20Word Translation Problems • Words are ambiguous He deposited money in a bank account with a high interest rate. Sitting on the bank of the Mississippi, a passing ship piqued his interest. • How do we find the right meaning, and thus translation? • Context should be helpful Philipp Koehn Machine Translation 31 August 2021 21Syntactic Translation Problems • Languages have different sentence structure das behaupten sie wenigstens this claim they at least the she • Convert from object-verb-subject (OVS) to subject-verb-object (SVO) • Ambiguities can be resolved through syntactic analysis – the meaning the of das not possible (not a noun phrase) – the meaning she of sie not possible (subject-verb agreement) Philipp Koehn Machine Translation 31 August 2021 22Semantic Translation Problems • Pronominal anaphora I saw the movie and it is good. • How to translate it into German (or French)? – it refers to movie – movie translates to Film – Film has masculine gender – ergo: it must be translated into masculine pronoun er • We are not handling this very well [Le Nagard and Koehn, 2010] Philipp Koehn Machine Translation 31 August 2021 23Semantic Translation Problems • Coreference Whenever I visit my uncle and his daughters, I can’t decide who is my favorite cousin. • How to translate cousin into German? Male or female? • Complex inference required Philipp Koehn Machine Translation 31 August 2021 24Semantic Translation Problems • Discourse Since you brought it up, I do not agree with you. Since you brought it up, we have been working on it. • How to translated since? Temporal or conditional? • Analysis of discourse structure — a hard problem Philipp Koehn Machine Translation 31 August 2021 25Learning from Data • What is the best translation? Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334 Philipp Koehn Machine Translation 31 August 2021 26Learning from Data • What is the best translation? Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334 • Counts in European Parliament corpus Philipp Koehn Machine Translation 31 August 2021 27Learning from Data • What is the best translation? Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334 • Phrasal rules Sicherheitspolitik → security policy 1580 Sicherheitspolitik → safety policy 13 Sicherheitspolitik → certainty policy 0 Lebensmittelsicherheit → food security 51 Lebensmittelsicherheit → food safety 1084 Lebensmittelsicherheit → food certainty 0 Rechtssicherheit → legal security 156 Rechtssicherheit → legal safety 5 Rechtssicherheit → legal certainty 723 Philipp Koehn Machine Translation 31 August 2021 28Learning from Data • What is most fluent? a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700 Philipp Koehn Machine Translation 31 August 2021 29Learning from Data • What is most fluent? a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700 • Hits on Google Philipp Koehn Machine Translation 31 August 2021 30Learning from Data • What is most fluent? a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700 a translation problem 235,000 Philipp Koehn Machine Translation 31 August 2021 31Learning from Data • What is most fluent? police disrupted the demonstration 2,140 police broke up the demonstration 66,600 police dispersed the demonstration 25,800 police ended the demonstration 762 police dissolved the demonstration 2,030 police stopped the demonstration 722,000 police suppressed the demonstration 1,400 police shut down the demonstration 2,040 Philipp Koehn Machine Translation 31 August 2021 32Learning from Data • What is most fluent? police disrupted the demonstration 2,140 police broke up the demonstration 66,600 police dispersed the demonstration 25,800 police ended the demonstration 762 police dissolved the demonstration 2,030 police stopped the demonstration 722,000 police suppressed the demonstration 1,400 police shut down the demonstration 2,040 Philipp Koehn Machine Translation 31 August 2021 33 where are we now? Philipp Koehn Machine Translation 31 August 2021 34Word Alignment house the in stay will he that assumes michael michael geht davon aus dass er im haus bleibt , Philipp Koehn Machine Translation 31 August 2021 35Phrase-Based Model • Foreign input is segmented in phrases • Each phrase is translated into English • Phrases are reordered • Workhorse of today’s statistical machine translation Philipp Koehn Machine Translation 31 August 2021 36Syntax-Based Translation Sie PPER will VAFIN eine ART Tasse NN Kaffee NN trinken VVINF NP VP S PRO she VB drink NN | cup IN | of NP PP NN NP DET | a VBZ | wants VB VP VP NPTO | to NN coffee S PRO VP ➏ ➊ ➋ ➌ ➍ ➎ Philipp Koehn Machine Translation 31 August 2021 37Semantic Translation • Abstract meaning representation [Knight et al., ongoing] (w / want-01 :agent (b / boy) :theme (l / love :agent (g / girl) :patient b)) • Generalizes over equivalent syntactic constructs (e.g., active and passive) • Defines semantic relationships – semantic roles – co-reference – discourse relations Philipp Koehn Machine Translation 31 August 2021 38Neural Model Embed RNN Weighted Sum Attention RNN Embed RNN the das Embed Cost Weighted Sum Attention Embed RNN house Haus Embed Cost Weighted Sum Attention Embed RNN is ist Embed Cost Weighted Sum Attention Embed RNN big groß Embed Cost Softmax Weighted Sum Attention Embed RNN . . Embed Cost Weighted Sum Attention Embed RNN Embed Cost Softmax RNN Weighted Sum Attention RNN Embed RNN RNNRNN RNN RNN RNN Output Word Prediction Output Word Output Word Embeddings Error Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word ti yi E yi - log ti [yi] si ci αij hj E xj xj hj RNN RNN RNN RNN RNN Softmax Softmax Softmax Softmax Philipp Koehn Machine Translation 31 August 2021 39 what is it good for? Philipp Koehn Machine Translation 31 August 2021 40 what is it good enough for? Philipp Koehn Machine Translation 31 August 2021 41Why Machine Translation? Assimilation — reader initiates translation, wants to know content • user is tolerant of inferior quality • focus of majority of research (GALE program, etc.) Communication — participants don’t speak same language, rely on translation • users can ask questions, when something is unclear • chat room translations, hand-held devices • often combined with speech recognition, IWSLT campaign Dissemination — publisher wants to make content available in other languages • high demands for quality • currently almost exclusively done by human translators Philipp Koehn Machine Translation 31 August 2021 42Problem: No Single Right Answer Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security officials. Philipp Koehn Machine Translation 31 August 2021 43Quality HTER assessment 0% publishable 10% editable 20% 30% gistable 40% triagable 50% (scale developed in preparation of DARPA GALE programme) Philipp Koehn Machine Translation 31 August 2021 44Applications HTER assessment application examples 0% Seamless bridging of language divide publishable Automatic publication of official announcements 10% editable Increased productivity of human translators 20% Access to official publications Multi-lingual communication (chat, social networks) 30% gistable Information gathering Trend spotting 40% triagable Identifying relevant documents 50% Philipp Koehn Machine Translation 31 August 2021 45Current State of the Art HTER assessment language pairs and domains 0% French-English restricted domain publishable French-English news stories 10% German-English news stories editable Chinese-English news stories 20% 30% gistable Swahili–English news stories 40% triagable Uyghur–English news stories 50% (informal rough estimates by presenter) Philipp Koehn Machine Translation 31 August 2021 46Thank You questions? Philipp Koehn Machine Translation 31 August 2021