Machine Learning Tricks
Philipp Koehn
13 October 2020
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
1Machine Learning
• Myth of machine learning
– given: real world examples
– automatically build model
– make predictions
• Promise of deep learning
– do not worry about speciﬁc properties of problem
– deep learning automatically discovers the feature
• Reality: bag of tricks
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
2Today’s Agenda
• No new translation model
• Discussion of failures in machine learning
• Various tricks to address them
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
3Fair Warning
• At some point, you will think:
Why are you telling us all this madness?
• Because pretty much all of it is commonly used
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
4
failures in machine learning
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
5Failures in Machine Learning
λ
error(λ)
Too high learning rate may lead to too drastic parameter updates
→ overshooting the optimum
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
6Failures in Machine Learning
λ
error(λ)
Bad initialization may require many updates to escape a plateau
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
7Failures in Machine Learning
λ
error(λ)
local optimum
global optimum
Local optima trap training
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
8Learning Rate
• Gradient computation gives direction of change
• Scaled by learning rate
• Weight updates
• Simplest form: ﬁxed value
• Annealing
– start with larger value (big changes at beginning)
– reduce over time (minor adjustments to reﬁne model)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
9Initialization of Weights
• Initialize weights to random values
• But: range of possible values matters
λ
error(λ)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
10Sigmoid Activation Function
x
y
Derivative of sigmoid
Near zero for large positive and negative values
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
11Rectiﬁed Linear Unit
x
y
Derivative of ReLU
Flat and for large interval: Gradient is 0
”Dead cells” elements in output that are always 0, no matter the input
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
12Local Optima
• Cartoon depiction
λ
error(λ)
local optimum
global optimum
• Reality
– highly dimensional space
– complex interaction between individual parameter changes
– ”bumpy”
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
13Vanishing and Exploding Gradients
RNN RNNRNN RNN RNN RNN RNN
• Repeated multiplication with same values
• If gradients are too low → 0
• If gradients are too big → ∞
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
14Overﬁtting and Underﬁtting
Under-Fitting Good Fit Over-Fitting
• Complexity of the problem has too match the capacity of the model
• Capacity number of trainable parameters
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
15
ensuring randomness
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
16Ensuring Randomness
• Typical theoretical assumption
independent and identically distributed
training examples
• Approximate this ideal
– avoid undue structure in the training data
– avoid undue structure in initial weight setting
• ML approach: Maximum entropy training
– Fit properties of training data
– Otherwise, model should be as random as possible
(i.e., has maximum entropy)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
17Shufﬂing the Training Data
• Typical training data in machine translation
– different types of corpora
∗ European Parliament Proceedings
∗ collection of movie subtitles
– temporal structure in each corpus
– similar sentences next too each other (e.g., same story / debate)
• Online updating: last examples matter more
• Convergence criterion: no improvement recently
→ stretch of hard examples following easy examples: prematurely stopped
⇒ randomly shufﬂe the training data
(maybe each epoch)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
18Weight Initialization
• Initialize weights to random values
• Values are chosen from a uniform distribution
• Ideal weights lead to node values in transition area for activation function
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
19For Example: Sigmoid
• Input values in range [−1; 1]
⇒ Output values in range [0.269;0.731]
• Magic formula (n size of the previous layer)
−
1
√
n
,
1
√
n
• Magic formula for hidden layers
−
√
6
√
nj + nj+1
,
√
6
√
nj + nj+1
– nj is the size of the previous layer
– nj+1 size of next layer
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
20Problem: Overconﬁdent Models
• Predictions of the neural machine translation models are surprisingly conﬁdent
• Often almost all the probability mass is assigned to a single word
(word prediction probabilities of over 99%)
• Problem for decoding and training
– decoding: sensible alternatives get low scores, bad for beam search
– training: overﬁtting is more likely
• Solution: label smoothing
• Jargon notice
– in classiﬁcation tasks, we predict a label
– jargon term for any output
→ here, we smooth the word predictions
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
21Label Smoothing during Decoding
• Common strategy to combat peaked distributions: smooth them
• Recall
– prediction layer produces numbers for each word
– converted into probabilities using the softmax
p(yi) =
exp si
j exp sj
• Softmax calculation can be smoothed with so-called temperature T
p(yi) =
exp si/T
j exp sj/T
• Higher temperature → distribution smoother
(i.e., less probability is given to most likely choice)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
22Label Smoothing during Training
• Root of problem: training
• Training object: assign all probability mass to single correct word
• Label smoothing
– truth gives some probability mass to other words (say, 10% of it)
– uniformly distributed over all words
– relative to unigram word probabilities
(relative counts of each word in the target side of the training data)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
23
adjusting the learning rate
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
24Adjusting the Learning Rate
• Gradient descent training: weight update follows the gradient downhill
• Actual gradients have fairly large values, scale with a learning rate
(low number, e.g., µ = 0.001)
• Change the learning rate over time
– starting with larger updates
– reﬁning weights with smaller updates
– adjust for other reasons
• Learning rate schedule
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
25Momentum Term
• Consider case where weight value far from optimum
• Most training examples push the weight value in the same direction
• Small updates take long to accumulate
• Solution: momentum term mt
– accumulate weight updates at each time step t
– some decay rate for sum (e.g., 0.9)
– combine momentum term mt−1 with weight update value ∆wt
mt = 0.9mt−1 + ∆wt
wt = wt−1 − µ mt
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
26Adapting Learning Rate per Parameter
• Common strategy: reduce the learning rate µ over time
• Initially parameters are far away from optimum → change a lot
• Later nuanced reﬁnements needed → change little
• Now: different learning rate for each parameter
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
27Adagrad
• Different parameters at different stages of training
→ different learning rate for each parameter
• Adagrad
– record gradients for each parameter
– accumulate their square values over time
– use this sum to reduce learning rate
• Update formula
– gradient gt = dEt
dw of error E with respect to weight w
– divide the learning rate µ by accumulated sum
∆wt =
µ
t
τ=1 g2
τ
gt
• Big changes in the parameter value (corresponding to big gradients gt)
→ reduction of the learning rate of the weight parameter
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
28Adam: Elements
• Combine idea of momentum term and reduce parameter update by accumulated
change
• Momentum term idea (e.g., β1 = 0.9)
mt = β1mt−1 + (1 − β1)gt
• Accumulated gradients (decay with β2 = 0.999)
vt = β2vt−1 + (1 − β2)g2
t
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
29Adam: Technical Correction
• Initially, values for mt and vt are close to initial value of 0
• Adjustment
ˆmt =
mt
1 − βt
1
, ˆvt =
vt
1 − βt
2
• With t → ∞ this correction goes away
limt→∞
1
1 − βt
→ 1
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
30Adam
• Given
– learning rate µ
– momentum ˆmt
– accumulated change ˆvt
• Weight update per Adam (e.g., = 10−8
)
∆wt =
µ
√
ˆvt +
ˆmt
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
31Batched Gradient Updates
• Accumulate all weight updates for all the training example → update
(converges slowly)
• Process each training example → update (stochastic gradient descent)
(quicker convergence, but last training disproportionately higher impact)
• Process data in batches
– compute all their gradients for individual word predictions errors
– use sum over each batch to update parameters
→ better parallelization on GPUs
• Process data on multiple compute cores
– batch processing may take different amount of time
– asynchronous training: apply updates when they arrive
– mismatch between original weights and updates may not matter much
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
32
avoiding local optima
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
33Avoiding Local Optima
• One of hardest problem for designing neural network architectures and
optimization methods
• Ensure that model converges to at least to a set of parameter values that give
results close to this optimum on unseen test data.
• There is no real solution to this problem.
• It requires experimentation and analysis that is more craft than science.
• Still, this section presents a number of methods that generally help avoiding
getting stuck in local optima.
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
34Overﬁtting and Underﬁtting
• Neural machine translation models
– 100s of millions of parameters
– 100s of millions of training examples (individual word predictions)
• No hard rules for relationship between these two numbers
• Too many parameters and too few training examples → overﬁtting
• Too few parameters and many training examples → underﬁtting
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
35Regularization
• Motivation: prefer as few parameters as possible
• Strategy: set un-needed paramters a value of 0
• Method
– adjust training objective
– add cost for any non-zero parameter
– typically done with L2 norm
• Practical impact
– derivative of L2 norm is value of parameter
– if not signal from training: reduce value of parameter
– alsp called weight decay
• Not common in deep learning, but other methods understood as regularization
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
36Curriculum Learning
• Human learning
– learn simple concepts ﬁrst
– learn more complex material later
• Early epochs: only easy training examples
– only short sentences
– create artiﬁcial data by extracting smaller segments
(similar to phrase pair extraction in statistical machine translation)
– Later epochs: all training data
• Not easy to callibrate
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
37Dropout
• Training may get stuck in local optima
– some properties of task have been learned
– discovery of other properties would take it too far out of its comfort zone.
• Machine translation example
– model learned the language model aspects
– but cannot ﬁgure out role of input sentence
• Drop out: for each batch, eliminate some nodes
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
38Dropout
• Dropout
– For each batch, different random set of nodes is removed
– Their values are set to 0 and their weights are not updated
– 10%, 20% or even 50% of all the nodes
• Why does this work?
– robustness: redundant nodes play similar nodes
– ensemble learning: different subnetworks are different models
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
39Gradient Clipping
• Exploding gradients: gradients become too large during backward pass
⇒ Limit total value of gradients for a layer to threshold (τ)
• Use of L2 norm of gradient values g
L2(g) =
j
g2
j
• Adjust each gradient value gi for each element i in the vector
gi = gi ×
τ
max(τ, L2(g))
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
40Layer Normalization
• During inference, average node values may become too large or too small
• Has also impact on training (gradients are multiplied with node values)
⇒ Normalize node values
• During training, learn bias layer
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
41Layer Normalization: Math
• Feed-forward layer hl
, weights W, computed sum sl
, activation function
sl
= W hl−1
hl
= sigmoid(hl
)
• Compute mean µl
and variance σl
of sum vector sl
µl
=
1
H
H
i−1
sl
i
σl
=
1
H
H
i−1
(sl
i − µl)2
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
42Layer Normalization: Math
• Normalize sl
ˆsl =
1
σl
(sl
− µl
)
• Learnable bias vectors g and b
ˆsl =
g
σl
(sl
− µl
) + b
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
43Shortcuts and Highways
• Deep learning: many layers of processing
⇒ Error propagation has to travel farther
• All parameters in processing change have to be adjusted
• Instead of always passing through all layers, add connections from ﬁrst to last
• Jargon alert
– shortcuts
– residual connections
– skip connections
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
44Shortcuts
• Feed-forward layer
y = f(x)
• Pass through input x
y = f(x) + x
• Note: gradient is
y = f (x) + 1
• Constant 1 → gradient is passed through unchanged
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
45Highways
• Regulate how much information from f(x) and x should impact the output y
• Gate t(x) (typically computed by a feed-forward layer)
y = t(x) f(x) + (1 − t(x)) x
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
46Shortcuts and Highways
FF
Basic Layer Skip Connection Highway Network
Add
FF
Add
FF
Gate
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
47LSTM and Vanishing Gradients
• Recall: Long short term memory (LSTM) cells
• Pass through of memory
memoryt
= gateinput × inputt
+ gateforget × memoryt−1
• Forget gate has values close to 1 → gradient passed through nearly unchanged
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
48
generative adversarial training
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
49Sequence-Level Training
• Traditional training
– predict one word at a time
– compare against correct word
– proceed training with correct word
• Sequence-level training
– predict entire sequence
– measure translation with sentence-level metric (e.g., BLEU)
• May use n-best translations, beam search, etc.
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
50Generative Adversarial Networks (GAN)
• Game between two players
– generator proposes a translation
– discriminator distinguishes between generator’s translation and human
translation
– generator tries to fool discriminator
• Training example: input sentence x and output sentence y
• Generator
– traditional neural machine translation model
– generates full sentence translations t for each input sentence
• Discriminator
– is trained to classify (x, y) as correct example
– is trained to classify (x, t) as generated example
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
51Generative Adversarial Networks (GAN)
1. First train generator to some maturity
2. Train discriminator on generator predictions and human reference translations
3. Train jointly
– generator with additional objective to fool discriminator
– discriminator to do well on detecting generator’s output as such
• In practice, this is hard to callibrate correctly
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
52Relationship to Reinforcement Learning
• No immediate feedback
– chess playing: quality of move only revealed at end of game
– walk through maze to avoid monsters and ﬁnd gold
• Policy: decision process to which steps to take
(here: generator, traditional neural machine translation model)
• Reward: end result
(here: ability to fool discriminator)
• Popular technique: Monte Carlo search
(here: Monte Carlo decoding)
• Training is called policy search
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020