Words and Morphology Philipp Koehn 20 October 2020 Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 1A Naive View of Language • Language needs to name – nouns: objects in the world (dog) – verbs: actions (jump) – adjectives and adverbs: properties of objects and actions (brown, quickly) • Relationship between these have to specified – word order – morphology – function words Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 2Marking of Relationships: Agreement • From Catullus, First Book, first verse (Latin): • Gender (and case) agreement links adjectives to nouns Cui dono lepidum novum libellum arida modo pumice expolitum ? Whom I-present lovely new little-book dry manner pumice polished ? (To whom do I present this lovely new little book now polished with a dry pumice?) Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 3Marking of Relationships to Verb: Case • German: Die Frau gibt dem Mann den Apfel The woman gives the man the apple subject indirect object object • Case inflection indicates role of noun phrases Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 4Writingwordstogether • Definition of word boundaries purely an artifact of writing system • Differences between languages – Agglutinative compounding Informatikseminar vs. computer science seminar – Function word vs. affix • Border cases – Joe’s — one token or two? – Morphology of affixes often depends on phonetics / spelling conventions dog+s → dogs vs. pony → ponies ... but note the English function word a: a donkey vs. an aardvark Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 5Changing Part-of-Speech • Derivational morphology allows changing part of speech of words • Example: – base: nation, noun → national, adjective → nationally, adverb → nationalist, noun → nationalism, noun → nationalize, verb • Sometimes distinctions between POS quite fluid (enabled by morphology) – I want to integrate morphology – I want the integration of morphology Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 6Meaning Altering Affixes • English undo redo hypergraph • German: zer- implies action causes destruction Er zerredet das Thema → He talks the topic to death • Spanish: -ito means object is small burro → burrito Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 7Adding Subtle Meaning • Morphology allows adding subtle meaning – verb tenses: time action is occurring, if still ongoing, etc. – count (singular, plural): how many instances of an object are involved – definiteness (the cat vs. a cat): relation to previously mentioned objects – grammatical gender: helps with co-reference and other disambiguation • Sometimes redundant: same information repeated many times Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 8 how does morphology impact machine translation? Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 9Unknown Words • Ratio of unknown words in WMT 2013 test set: Source language Ratio unknown Russian 2.0% Czech 1.5% German 1.2% French 0.5% English (to French) 0.5% • Caveats: – corpus sizes differ – not clear which unknown words have known morphological variants Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 10Differently Encoded Information • Languages with different sentence structure das behaupten sie wenigstens this claim they at least the she • Convert from inflected language into configuration language (and vice versa) • Ambiguities can be resolved through syntactic analysis – the meaning the of das not possible (not a noun phrase) – the meaning she of sie not possible (subject-verb agreement) Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 11Non-Local Information • Pronominal anaphora I saw the movie and it is good. • How to translate it into German (or French)? – it refers to movie – movie translates to Film – Film has masculine gender – ergo: it must be translated into masculine pronoun er • We are not handling pronouns very well Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 12Complex Semantic Inference • Example Whenever I visit my uncle and his daughters, I can’t decide who is my favorite cousin. • How to translate cousin into German? Male or female? Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 13 morphological pre-precessing schemes Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 14German • German sentence with morphological analysis Er wohnt in einem großen Haus Er wohnen -en+t in ein +em groß +en Haus + He lives in a big house • Four inflected words in German, but English... also inflected both English verb live and German verb wohnen inflected for tense, person, count not inflected corresponding English words not inflected (a and big) → easier to translate if inflection is stripped less inflected English word house inflected for count German word Haus inflected for count and case → reduce morphology to singular/plural indicator • Reduce German morphology to match English Er wohnen+3P-SGL in ein groß Haus+SGL Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 15Turkish • Example – Turkish: Sonuc¸larına1 dayanılarak2 bir3 ortakli˘gi4 olus¸turulacaktır5. – English: a3 partnership4 will be drawn-up5 on the basis2 of conclusions1 . • Turkish morphology → English function words (will, be, on, the, of) • Morphological analysis Sonuc¸ +lar +sh +na daya +hnhl +yarak bir ortaklık +sh olus¸ +dhr +hl +yacak +dhr • Alignment with morphemes sonuc¸ +lar +sh +na daya+hnhl +yarak bir ortaklık +sh olus¸ +dhr +hl +yacak +dhr conclusion +s of the basis on a partnership draw up +ed will be ⇒ Split Turkish into morphemes, drop some Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 16Arabic • Basic structure of Arabic morphology [CONJ+ [PART+ [al+ BASE +PRON]]] • Examples for clitics (prefixes or suffixes) – definite determiner al+ (English the) – pronominal morpheme +hm (English their/them) – particle l+ (English to/for) – conjunctive pro-clitic w+ (English and) • Same basic strategies as for German and Turkish – morphemes akin to English words → separated out as tokens – properties (e.g., tense) also expressed in English → keep attached to word – morphemes without equivalence in English → drop Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 17Arabic Preprocessing Schemes ST Simple tokenization (punctuations, numbers, remove diacritics) wsynhY Alr}ys jwlth bzyArp AlY trkyA . D1 Decliticization: split off conjunction clitics w+ synhy Alr}ys jwlth bzyArp nhYVBP +S3MS Al+ r}ysNN jwlpNN +P3MS b+ zyArpNN Embed the Embed house Embed is Embed big Embed . Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN RNN RNN RNN • Several layers, use weighted sum of representations at different layers – syntactic information is better represented in early layers – semantic information is better represented in deeper layers. Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 30BERT • Contextualized word embeddings with Transformer model • Masked training The quick brown fox jumps over the lazy dog. ⇑ The quick MASK fox MASK over the lazy dog. • Next sentence prediction Each unhappy family is unhappy in its own way. ⇑ All happy families are alike. Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 31GPT-3 (2020) • Essentially BERT, but bigger • Model: Transformer – 175 billion parameters – 96 layers – 12288 dimensional representations – 96 attention heads • Training – trained on about 500 billion word data set, less than 1 epoch – 3640 petaflop/s-days on NVIDIA V100 (each can do 0.1 petaflops) • There currently seems to be not plateau: bigger is better Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 32 multi-lingual word embeddings Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 33Multi-Lingual Word Embeddings • Word embeddings often viewed as semantic representations of words • Tempting to view embedding spaces as language-independent cat (English), gato (Spanish) and Katze (German) are mapped to same vector • Common semantic space for words in all languages? Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 34Language-Specific Word Embeddings caballo (horse) vaca(cow) cerdo (pig) perro (dog) gato (cat) horse cow pig dog cat • Train English word embeddings CE and Spanish word embeddings CS Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 35Mapping Word Embedding Spaces caballo (horse) vaca(cow) cerdo (pig) perro (dog) gato (cat) horse cow pig dog cat • Learn mapping matrix WS→E to minimize Euclidean distance between each word and its translation cost = i ||WS→E cS i − cE i || • Needed: Seed lexicon of word translations (may be based on cognates) • Hubness problem: some words being the nearest neighbor of many words Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 36Using only Monolingual Data dog cat lion Löwe Katze Hund • Learn transformation matrix WS→E without seed lexicon? • Intuition: relationship between dog, cat, and lion, independent of language • How can we rotate the triangle to match up? Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 37Using only Monolingual Data dog cat lion Löwe Katze Hund dog cat lionLöwe Katze Hund • One idea: learn transformation matrix WGerman→English so that words match up Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 38Adversarial Training • Another idea: adversarial training – points in the German and English space do not match up → adversary can classify them as either German and English • Training objective of adversary to learn classifier P costD(P|W) = − 1 n n i=1 logP(German|Wgi) − 1 m m j=1 logP(English|ej) • Training objective of unsupervised learner costL(W|P) = − 1 n n i=1 logP(English|Wgi) − 1 m m j=1 logP(German|ej) Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 39 large vocabularies Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 40Large Vocabularies • Zipf’s law tells us that words in a language are very unevenly distributed. – large tail of rare words (e.g., new words retweeting, website, woke, lit) – large inventory of names, e.g., eBay, Yahoo, Microsoft • Neural methods not well equipped to deal with such large vocabularies (ideal representations are continuous space vectors → word embeddings) • Large vocabulary – large embedding matrices for input and output words – prediction and softmax over large number of words • Computationally expensive, both in terms of memory and speed Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 41Special Treatment for Rare Words • Limit vocabulary to 20,000 to 80,000 words • First idea – map other words to unknown word token (UNK) – model learns to map input UNK to output UNK – replace with translation from backup dictionary • Not used anymore, except for numbers and units – numbers: English 540,000, Chinese 54 TENTHOUSAND, Indian 5.4 lakh – units: map 25cm to 10 inches Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 42Some Causes for Large Vocabularies • Morphology tweet, tweets, tweeted, tweeting, retweet, ... → morphological analysis? • Compounding homework, website, ... → compound splitting? • Names Netanyahu, Jones, Macron, Hoboken, ... → transliteration? ⇒ Breaking up words into subwords may be a good idea Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 43Byte Pair Encoding • Start by breaking up words into characters t h e f a t c a t i s i n t h e t h i n b a g • Merge frequent pairs t h→th th e f a t c a t i s i n th e th i n b a g a t→at th e f at c at i s i n th e th i n b a g i n→in th e f at c at i s in th e th in b a g th e→the the f at c at i s in the th in b a g • Each merge operation increases the vocabulary size – starting with the size of the character set (maybe 100 for Latin script) – stopping after, say, 50,000 operations Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 44Byte Pair Encoding Obama receives Net@@ any@@ ahu the relationship between Obama and Net@@ any@@ ahu is not exactly friendly . the two wanted to talk about the implementation of the international agreement and about Teheran ’s destabil@@ ising activities in the Middle East . the meeting was also planned to cover the conflict with the Palestinians and the disputed two state solution . relations between Obama and Net@@ any@@ ahu have been stra@@ ined for years . Washington critic@@ ises the continuous building of settlements in Israel and acc@@ uses Net@@ any@@ ahu of a lack of initiative in the peace process . the relationship between the two has further deteriorated because of the deal that Obama negotiated on Iran ’s atomic programme . in March , at the invitation of the Republic@@ ans , Net@@ any@@ ahu made a controversial speech to the US Congress , which was partly seen as an aff@@ ront to Obama . the speech had not been agreed with Obama , who had rejected a meeting with reference to the election that was at that time im@@ pending in Israel . Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 45Subwords • Byte pair encoding induces subwords • But: only accidentally along linguistic concepts of morphology – morphological: critic@@ ises, im@@ pending – not morphological: aff@@ ront, Net@@ any@@ ahu • Still: Similar to unsupervised morphology (frequent suffixes, etc.) Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 46Sentence Piece Obama receives Net any ahu the relationship between Obama and Net any ahu is not exactly friendly . the two wanted to talk about the implementation of the international agreement and about Teheran ’s destabil ising activities in the Middle East . the meeting was also planned to cover the conflict with the Palestinians and the disputed two state solution . relations between Obama and Net any ahu have been stra ined for years . Washington critic ises the continuous building of settlements in Israel and acc uses Net any ahu of a lack of initiative in the peace process . the relationship between the two has further deteriorated because of the deal that Obama negotiated on Iran ’s atomic programme . in March , at the invitation of the Republic ans , Net any ahu made a controversial speech to the US Congress , which was partly seen as an aff ront to Obama . the speech had not been agreed with Obama , who had rejected a meeting with reference to the election that was at that time im pending in Israel . Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 47 character-based models Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 48Character-Based Models • Explicit word models that yield word embeddings • Standard methods for frequent words – distribution of beautiful in the data → embedding for beautiful • Character-based models – create sequence embedding for character string b e a u t i f u l – training objective: match word embedding for beautiful • Induce embeddings for unseen morphological variants – character string b e a u t i f u l l y → embedding for beautifully • Hope that this learns morphological principles Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 49Character Sequence Models • Same model as for words • Tokens = single characters, incl. special space symbol • But: generally poor performance • With some refinements, use in output shown competitive Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 50Character Based Word Models • Word embeddings as before • Compute word embeddings based on character sequence • Typically, interpolated with traditional word embeddings Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 51Recurrent Neural Networks RNN Embed RNN w Embed RNN o Embed RNN r Embed RNN d Embed RNN s Embed RNN RNN Embed RNN Right-to-Left RNN Left-to-Right RNN Character Embedding Character or Character Trigram RNN RNN RNN RNN RNN copycopy FF Word Embedding Concatenation Philipp Koehn Machine Translation: Words and Morphology 20 October 2020 52Convolutional Neural Networks Embed w Embed o Embed r Embed d Embed s Embed Embed Convolutions Character Embedding Character or Character Trigram FF Word Embedding Max Pooling CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN CNN MaxPool MaxPool MaxPool MaxPool FF Feed-Forward • Convolutions of diferent size: 2 characters, 3 characters, ..., 7 characters • May be based on letter n-grams (trigrams shown) Philipp Koehn Machine Translation: Words and Morphology 20 October 2020