Adaptation
Philipp Koehn
24 October 2023
Philipp Koehn Machine Translation: Adaptation 24 October 2023
1Adaptation
• Better quality when system is adapted to a task
• Domain adaptation to a speciﬁc domain, e.g., information technology
• Some training more relevant
• May also adapt to speciﬁc user (personalization)
• May optimize for a speciﬁc document or sentence
Philipp Koehn Machine Translation: Adaptation 24 October 2023
2
domains
Philipp Koehn Machine Translation: Adaptation 24 October 2023
3Domain
• Deﬁnition
a collection of text with similar topic, style, level of formality, etc.
• Practically: a corpus that comes from a speciﬁc source
Philipp Koehn Machine Translation: Adaptation 24 October 2023
4Example
Available parallel corpora on OPUS web site (Italian–English)
Philipp Koehn Machine Translation: Adaptation 24 October 2023
5Differences in Corpora
Medical Abilify is a medicine containing the active substance aripiprazole.
It is available as 5 mg, 10 mg, 15 mg and 30 mg tablets, as 10 mg, 15 mg and 30 mg orodispersible tablets (tablets
that dissolve in the mouth), as an oral solution (1 mg/ml) and as a solution for injection (7.5 mg/ml).
Software Localization Default GNOME Theme
OK
People
Literature There was a slight noise behind her and she turned just in time to seize a small boy by the slack of his
roundabout and arrest his ﬂight.
Law Corrigendum to the Interim Agreement with a view to an Economic Partnership Agreement between the European
Community and its Member States, of the one part, and the Central Africa Party, of the other part.
Religion This is The Book free of doubt and involution, a guidance for those who preserve themselves from evil and
follow the straight path.
News The Facebook page of a leading Iranian leading cartoonist, Mana Nayestani, was hacked on Tuesday, 11
September 2012, by pro-regime hackers who call themselves ”Soldiers of Islam”.
Movie subtitles We’re taking you to Washington, D.C.
Do you know where the prisoner was transported to?
Uh, Washington.
Okay.
Twitter Thank u @Starbucks & @Spotify for celebrating artists who #GiveGood with a donation to @BTWFoundation,
and to great organizations by @Metallica and @ChanceTheRapper! Limited edition cards available now at Starbucks!
Philipp Koehn Machine Translation: Adaptation 24 October 2023
6Dimensions
Topic The subject matter of the text, such as politics or sports.
Modality How was this text originally created? Is this written text or transcribed
speech, and if speech, is it a formal presentation or an informal dialogue full of
incompleted and ungrammatical sentences?
Register Level of politeness. In some languages, this is very explicit, such as the
use of the informal Du or the formal Sie for the personal pronoun you in German.
Intent Is the text a statement of fact, an attempt to persuade, or communication
between multiple parties?
Style Is it a terse informal text, are full of emotional and ﬂowery language?
Philipp Koehn Machine Translation: Adaptation 24 October 2023
7Dimensions
• In reality, no clear information about dimensions
• For example: Wikipedia
– spans a whole range of topics
– fairly consistent in modality and style
• Practical goal: enforce a certain level of politeness
• Probably
– European parliament proceedings more polite
– movie subtitles less polite
Philipp Koehn Machine Translation: Adaptation 24 October 2023
8Impact of Domain
• Different word meanings
– bat in baseball
– bat in wildlife report
• Different style
– What’s up, dude?
– Good morning, sir.
Philipp Koehn Machine Translation: Adaptation 24 October 2023
9Diverse Problem
• Data may differ narrowly or drastically
• Amount of relevant and less relevant data differ
• Data may be split by domain or mixed
• Data may differ by quality
• Each corpus may be relatively homogeneous or heterogeneous
• May need to adapt on the ﬂy
⇒ Different methods may apply, experimentation needed
Philipp Koehn Machine Translation: Adaptation 24 October 2023
10Multiple Domain Scenario
Sports
Law
Finance
IT
Sports IT Finance Law
• Multiple collections of data, clearly identiﬁed
e.g., sports, information technology, ﬁnance, law, ...
• Train specialized model for each domain
• Route test sentences to appropriate model (using classiﬁer, if not known)
• Probabilistic assignment
Philipp Koehn Machine Translation: Adaptation 24 October 2023
11In/Out Domain Scenario
• Optimize system for just one domain
• Available data
– small amounts of in-domain data
– large amounts of out-of-domain data
• Need to balance both data sources
Philipp Koehn Machine Translation: Adaptation 24 October 2023
12Why Use Out-of-Domain Data?
• In-domain data much more valuable
• But: gaps
– word-to-be-translated may not occur
– word-to-be-translated may not occur with the correct translation
• Motivation
– out-of-domain data may ﬁll these gaps
– but be careful not to drown out in-domain data
Philipp Koehn Machine Translation: Adaptation 24 October 2023
13S4
Taxonomy of Adaptation Effects
[Carpuat, Daume, Fraser, Quirk, 2012]
• Seen: Never seen this word before
News to medical: diabetes mellitus
• Sense: Never seen this word used in this way
News to technical: monitor
• Score: The wrong output is scored higher
News to medical: manifest
• Search: Decoding/search erred
Philipp Koehn Machine Translation: Adaptation 24 October 2023
14Adaptation Effects
German source Verfahren und Anlage zur Durchf¨uhrung einer exothermen Gasphasenreaktion an
einem heterogenen partikelf¨ormigen Katalysator
Human reference translation Method and system for carrying out an exothermic gas phase
reaction on a heterogeneous particulate catalyst
General model translation Procedures and equipment for the implementation of an exothermen
gas response response to a heterogeneous particle catalytic converter
In-Domain (chemistry patents) model translation Method and system for carrying out an
exothermic gas phase reaction on a heterogeneous particulate catalyst
• Stylistic, e.g., method, system vs. procedures, equipment)
• Word sense, e.g., catalyst vs. catalytic converter)
• Better language coverage
e.g., exothermic gas phase reaction vs. exothermen gas response response
Philipp Koehn Machine Translation: Adaptation 24 October 2023
15
mixture models
Philipp Koehn Machine Translation: Adaptation 24 October 2023
16Combine Data
Combined
Domain
Model
• Too biased towards out of domain data
• May ﬂag translation options with indicator feature functions
Philipp Koehn Machine Translation: Adaptation 24 October 2023
17Interpolate Data
Combined
Domain
Model
Out-of-domain data
In-domain data
Oversample in-domain data
Philipp Koehn Machine Translation: Adaptation 24 October 2023
18Interpolate Models
In
Domain
Model
Out-of
Domain
Model
Philipp Koehn Machine Translation: Adaptation 24 October 2023
19Domain-Aware Training
• Train a model on all domains
• Indicate domain for each input sentence
• Domain token
– append domain token to each input sentence, e.g., <SPORTS>
– label training data
– label test data
• Neural machine translation models
– domain token will have word embedding
– attention model will rely on domain token as needed
Philipp Koehn Machine Translation: Adaptation 24 October 2023
20Unknown Domain at Test Time
• Domain of input sentence unknown
• Classiﬁer: predict domain of input sentence
– predict domain token
– augment input sentence
• Probability distribution over domains
– sentences may not fall neatly into one of our pre-deﬁned domains
– e.g., rule violation in sports → SPORTS, LAW
– encode soft domain assignment in vector
– may be also used to label training data
Philipp Koehn Machine Translation: Adaptation 24 October 2023
21Fine-Grained Domains: Personalization
• Thousands of domains
– machine translation system personalized for individual translators
– machine translation system optimized for authors/speakers
• Domain token/classiﬁcation idea does not scale well
• Not much data for each domain
Philipp Koehn Machine Translation: Adaptation 24 October 2023
22Fine-Grained Domains: Personalization
• Only inﬂuence word prediction layer
• Recall output word distribution ti as a softmax given
– previous hidden state (si−1)
– previous output word embedding (Eyi−1)
– input context (ci)
ti = softmax W(Usi−1 + V Eyi−1 + Cci) + b
• More generally, prediction given some conditioning vector zi
ti = softmax Wzi + b
• Add an additional bias term βp speciﬁc to a person p
ti = softmax Wzi + b + βp
Philipp Koehn Machine Translation: Adaptation 24 October 2023
23Topic Models
• Cluster corpus by topic — Latent Dirichlet Allocation (LDA)
• Train separate sub-models for each topic
• For input sentence, detect topic (or topic distribution)
Philipp Koehn Machine Translation: Adaptation 24 October 2023
27Sentence Selection
Combined
Domain
Model
• Select out-of-domain sentence pairs that are similar to in-domain data
Philipp Koehn Machine Translation: Adaptation 24 October 2023
28Sentence Selection
• Various methods
• Goal 1: Increase coverage (ﬁll gaps)
• Goal 2: Get content with in-domain content, style, etc.
Philipp Koehn Machine Translation: Adaptation 24 October 2023
29Moore Lewis
In-Domain
Language
Model
Out-of Domain
Language
Model
score
score
• Build language models
– out of domain
– in domain
• Score each sentence
• Sub-select sentence pairs with
pIN(f) − pOUT(f) > τ
Philipp Koehn Machine Translation: Adaptation 24 October 2023
30Modiﬁed Moore Lewis
In-Domain
Language
Model
(source)
Out-of Domain
Language
Model
(source)
score
score
Out-of Domain
Language
Model
(target)
In-Domain
Language
Model
(target)
• 2 sets of language models
– source language
– target language
• Add scores
Philipp Koehn Machine Translation: Adaptation 24 October 2023
31Coverage-Based Methods
• Problem with subsampling sentences based on similarity: not much new is
added
• Original goal: increase coverage with out-of-domain data
→ coverage-based selection
Philipp Koehn Machine Translation: Adaptation 24 October 2023
32Basic Approach
• Score each candidate sentence pair to be added based on word-based score
1
|si| w∈s
score(w, s1,..,i−1)
• Simple word score: check if word w occurred in the previously added sentences
s1, ..., si−1
score(w, s1,..,i−1) =
0 if w ∈ s1, ..., si−1
1 otherwise
• Add sentence with highest score
Philipp Koehn Machine Translation: Adaptation 24 October 2023
33Scoring N-Grams
• Compute coverage of n-grams, not just words
1
|si| × N
N−1
n=0 wj,...,j+n∈s
score(wj,...,j+n, s1,..,i−1)
Philipp Koehn Machine Translation: Adaptation 24 October 2023
34Feature Decay
• Not hard 0/1 scoring
• Decaying function based on frequency
score(w, s1,..,i−1) = frequency(w, s1,..,i−1) e−λ frequency(w,s1,..,i−1)
• May also consider frequency of n-grams in raw corpus
(avoid overﬁtting to rare n-grams)
Philipp Koehn Machine Translation: Adaptation 24 October 2023
35Instance Weighting
• So far: either include sentence pair or not
• Now: weigh sentence pair based on relevance
• Use same scoring metrics as previously for ﬁltering
• Scale learning rate by relevance score
Philipp Koehn Machine Translation: Adaptation 24 October 2023
36
ﬁne tuning
Philipp Koehn Machine Translation: Adaptation 24 October 2023
37Fine-Tuning
In
Domain
Model
Out-of
Domain
Model
+
• First train system on out-of-domain data (or: all available data)
• Stop at convergence
• Then, continue training on in-domain data
Philipp Koehn Machine Translation: Adaptation 24 October 2023
38Catastrophic Forgetting
• Fine tuning may overﬁt to in-domain data (catastrophic forgetting)
• Two goals
– do well on in-domain data
– maintain quality on out-of-domain data
• Makes model more robust on in-domain data as well
Philipp Koehn Machine Translation: Adaptation 24 October 2023
39Updating only Some Model Parameters
• Too many parameters, too few in-domain data
• Update only some parameters
– weights for decoder state progression
– output word prediction softmax
– output word embeddings
Philipp Koehn Machine Translation: Adaptation 24 October 2023
42Low Rank Adaptation (LoRA)
• Generic method to use fewer parameters during adaptation
• Augment each parameter matrix with two smaller matrices
– original matrix: m × m matrix
– adaptation matrices: m × r and r × m with r << m
– original matrix unchanged
Philipp Koehn Machine Translation: Adaptation 24 October 2023
43Document-Level Adaptation
Translation
Model
corrected
translation
input
draft
adapt
• Computer aided translation: translator post-edits machine translation
• Provides additional training data (translated sentences)
• Incrementally update model
Philipp Koehn Machine Translation: Adaptation 24 October 2023
44Sentence-Level Adaptation
• Adapt model to each sentence to be translated
• Find most similar sentence in parallel corpus (fuzzy match)
• Retrieve it and its translation
• Adapt model with this sentence pair
Philipp Koehn Machine Translation: Adaptation 24 October 2023
45Curriculum Training
• Recall: relevance score for each sentence pair
• Training epochs
– start with all data (100%)
– train only on somewhat relevant data (50%)
– train only on relevant data (25%)
– train only on very relevant data (10%)
Philipp Koehn Machine Translation: Adaptation 24 October 2023
Beyond Parallel Corpora
Philipp Koehn
26 October 2023
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
1
data and machine learning
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
2Supervised and Unsupervised
• We framed machine translation as a supervised machine learning task
– training examples with labels
– here: input sentences with translation
– structured prediction: output has to be constructed in several steps
• Unsupervised learning
– training examples without labels
– here: just sentences in the input language
– we will also look at using just sentences output language
• Semi-supervised learning
– some labeled training data
– some unlabeled training data (usually more)
• Self-training
– make predictions on unlabeled training data
– use predicted labeled as supervised translation data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
3Transfer Learning
• Learning from data similar to our task
• Other language pairs
– ﬁrst, train a model on different language pair
– then, train on the targeted language pair
– or: train jointly on both
• Multi-Task training
– train on a related task ﬁrst
– e.g., part-of-speeh tagging
• Share some or all of the components
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
4
using monolingual data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
5Using Monolingual Data
• Language model
– trained on large amounts of target language data
– better ﬂuency of output
• Key to success of statistical machine translation
• Neural machine translation
– integrate neural language model into model
– create artiﬁcial data with backtranslation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
8Back Translation
• Monolingual data is parallel data that misses its other half
• Let’s synthesize that half
reverse system
ﬁnal system
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
9Back Translation
• Steps
1. train a system in reverse language translation
2. use this system to translate target side monolingual data
→ synthetic parallel corpus
3. combine generated synthetic parallel data with real parallel data to build the
ﬁnal system
• Roughly equal amounts of synthetic and real data
• Useful method for domain adaptation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
10Iterative Back Translation
• Quality of backtranslation system matters
• Build a better backtranslation system ... with backtranslation
back system 2 ﬁnal system
back system 1
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
11Iterative Back Translation
• Example
German–English Back Final
no back-translation - 29.6
*10k iterations 10.6 29.6 (+0.0)
*100k iterations 21.0 31.1 (+1.5)
convergence 23.7 32.5 (+2.9)
re-back-translation 27.9 33.6 (+4.0)
* = limited training of back-translation system
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
12Variants
• Copy Target
– if no good neural machine translation system to start with
– just copy target language text to the source
• Forward Translation
– synthesize training data in same direction as training
– self-training (inferior but sometimes successful)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
13Round Trip Training
• We could iterate through steps of
– train system
– create synthetic corpus
• Dual learning: train models in both directions together
– translation models F → E and E → F
– take sentence f
– translate into sentence e’
– translate that back into sentence f’
– training objective: f should match f’
• Setup could be fooled by just copying (e’ = f)
⇒ score e’ with a language for language E
add language model score as cost to training objective
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
14Round Trip Training
MT
F→E
MT
E→F
ef
LM
E
LM
F
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023
15Monolingual Pre-Training
• Initial training of neural machine translation model on monolingual data
• Replace some input word sequences with <pad> (30% of words)
• Train model MASKED → TEXT on both source and target text
• Reorder sentences (each training example has 3 sentences)
<en> Advanced NLP techniques master class ” how <pad> ” </s>
3rd <pad> : 18 </s>
Results <pad> 40 of 729
⇓
3rd grade : 18 </s>
Advanced NLP techniques master class ” how to with clients ” </s>
Results 1 – 40 of 729
Philipp Koehn Machine Translation: Beyond Parallel Corpora 26 October 2023