Machine Learning Tricks Philipp Koehn 13 October 2020 Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 1Machine Learning • Myth of machine learning – given: real world examples – automatically build model – make predictions • Promise of deep learning – do not worry about specific properties of problem – deep learning automatically discovers the feature • Reality: bag of tricks Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 2Today’s Agenda • No new translation model • Discussion of failures in machine learning • Various tricks to address them Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 3Fair Warning • At some point, you will think: Why are you telling us all this madness? • Because pretty much all of it is commonly used Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 4 failures in machine learning Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 5Failures in Machine Learning λ error(λ) Too high learning rate may lead to too drastic parameter updates → overshooting the optimum Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 6Failures in Machine Learning λ error(λ) Bad initialization may require many updates to escape a plateau Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 7Failures in Machine Learning λ error(λ) local optimum global optimum Local optima trap training Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 8Learning Rate • Gradient computation gives direction of change • Scaled by learning rate • Weight updates • Simplest form: fixed value • Annealing – start with larger value (big changes at beginning) – reduce over time (minor adjustments to refine model) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 9Initialization of Weights • Initialize weights to random values • But: range of possible values matters λ error(λ) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 10Sigmoid Activation Function x y Derivative of sigmoid Near zero for large positive and negative values Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 11Rectified Linear Unit x y Derivative of ReLU Flat and for large interval: Gradient is 0 ”Dead cells” elements in output that are always 0, no matter the input Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 12Local Optima • Cartoon depiction λ error(λ) local optimum global optimum • Reality – highly dimensional space – complex interaction between individual parameter changes – ”bumpy” Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 13Vanishing and Exploding Gradients RNN RNNRNN RNN RNN RNN RNN • Repeated multiplication with same values • If gradients are too low → 0 • If gradients are too big → ∞ Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 14Overfitting and Underfitting Under-Fitting Good Fit Over-Fitting • Complexity of the problem has too match the capacity of the model • Capacity number of trainable parameters Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 15 ensuring randomness Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 16Ensuring Randomness • Typical theoretical assumption independent and identically distributed training examples • Approximate this ideal – avoid undue structure in the training data – avoid undue structure in initial weight setting • ML approach: Maximum entropy training – Fit properties of training data – Otherwise, model should be as random as possible (i.e., has maximum entropy) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 17Shuffling the Training Data • Typical training data in machine translation – different types of corpora ∗ European Parliament Proceedings ∗ collection of movie subtitles – temporal structure in each corpus – similar sentences next too each other (e.g., same story / debate) • Online updating: last examples matter more • Convergence criterion: no improvement recently → stretch of hard examples following easy examples: prematurely stopped ⇒ randomly shuffle the training data (maybe each epoch) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 18Weight Initialization • Initialize weights to random values • Values are chosen from a uniform distribution • Ideal weights lead to node values in transition area for activation function Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 19For Example: Sigmoid • Input values in range [−1; 1] ⇒ Output values in range [0.269;0.731] • Magic formula (n size of the previous layer) − 1 √ n , 1 √ n • Magic formula for hidden layers − √ 6 √ nj + nj+1 , √ 6 √ nj + nj+1 – nj is the size of the previous layer – nj+1 size of next layer Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 20Problem: Overconfident Models • Predictions of the neural machine translation models are surprisingly confident • Often almost all the probability mass is assigned to a single word (word prediction probabilities of over 99%) • Problem for decoding and training – decoding: sensible alternatives get low scores, bad for beam search – training: overfitting is more likely • Solution: label smoothing • Jargon notice – in classification tasks, we predict a label – jargon term for any output → here, we smooth the word predictions Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 21Label Smoothing during Decoding • Common strategy to combat peaked distributions: smooth them • Recall – prediction layer produces numbers for each word – converted into probabilities using the softmax p(yi) = exp si j exp sj • Softmax calculation can be smoothed with so-called temperature T p(yi) = exp si/T j exp sj/T • Higher temperature → distribution smoother (i.e., less probability is given to most likely choice) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 22Label Smoothing during Training • Root of problem: training • Training object: assign all probability mass to single correct word • Label smoothing – truth gives some probability mass to other words (say, 10% of it) – uniformly distributed over all words – relative to unigram word probabilities (relative counts of each word in the target side of the training data) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 23 adjusting the learning rate Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 24Adjusting the Learning Rate • Gradient descent training: weight update follows the gradient downhill • Actual gradients have fairly large values, scale with a learning rate (low number, e.g., µ = 0.001) • Change the learning rate over time – starting with larger updates – refining weights with smaller updates – adjust for other reasons • Learning rate schedule Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 25Momentum Term • Consider case where weight value far from optimum • Most training examples push the weight value in the same direction • Small updates take long to accumulate • Solution: momentum term mt – accumulate weight updates at each time step t – some decay rate for sum (e.g., 0.9) – combine momentum term mt−1 with weight update value ∆wt mt = 0.9mt−1 + ∆wt wt = wt−1 − µ mt Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 26Adapting Learning Rate per Parameter • Common strategy: reduce the learning rate µ over time • Initially parameters are far away from optimum → change a lot • Later nuanced refinements needed → change little • Now: different learning rate for each parameter Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 27Adagrad • Different parameters at different stages of training → different learning rate for each parameter • Adagrad – record gradients for each parameter – accumulate their square values over time – use this sum to reduce learning rate • Update formula – gradient gt = dEt dw of error E with respect to weight w – divide the learning rate µ by accumulated sum ∆wt = µ t τ=1 g2 τ gt • Big changes in the parameter value (corresponding to big gradients gt) → reduction of the learning rate of the weight parameter Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 28Adam: Elements • Combine idea of momentum term and reduce parameter update by accumulated change • Momentum term idea (e.g., β1 = 0.9) mt = β1mt−1 + (1 − β1)gt • Accumulated gradients (decay with β2 = 0.999) vt = β2vt−1 + (1 − β2)g2 t Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 29Adam: Technical Correction • Initially, values for mt and vt are close to initial value of 0 • Adjustment ˆmt = mt 1 − βt 1 , ˆvt = vt 1 − βt 2 • With t → ∞ this correction goes away limt→∞ 1 1 − βt → 1 Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 30Adam • Given – learning rate µ – momentum ˆmt – accumulated change ˆvt • Weight update per Adam (e.g., = 10−8 ) ∆wt = µ √ ˆvt + ˆmt Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 31Batched Gradient Updates • Accumulate all weight updates for all the training example → update (converges slowly) • Process each training example → update (stochastic gradient descent) (quicker convergence, but last training disproportionately higher impact) • Process data in batches – compute all their gradients for individual word predictions errors – use sum over each batch to update parameters → better parallelization on GPUs • Process data on multiple compute cores – batch processing may take different amount of time – asynchronous training: apply updates when they arrive – mismatch between original weights and updates may not matter much Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 32 avoiding local optima Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 33Avoiding Local Optima • One of hardest problem for designing neural network architectures and optimization methods • Ensure that model converges to at least to a set of parameter values that give results close to this optimum on unseen test data. • There is no real solution to this problem. • It requires experimentation and analysis that is more craft than science. • Still, this section presents a number of methods that generally help avoiding getting stuck in local optima. Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 34Overfitting and Underfitting • Neural machine translation models – 100s of millions of parameters – 100s of millions of training examples (individual word predictions) • No hard rules for relationship between these two numbers • Too many parameters and too few training examples → overfitting • Too few parameters and many training examples → underfitting Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 35Regularization • Motivation: prefer as few parameters as possible • Strategy: set un-needed paramters a value of 0 • Method – adjust training objective – add cost for any non-zero parameter – typically done with L2 norm • Practical impact – derivative of L2 norm is value of parameter – if not signal from training: reduce value of parameter – alsp called weight decay • Not common in deep learning, but other methods understood as regularization Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 36Curriculum Learning • Human learning – learn simple concepts first – learn more complex material later • Early epochs: only easy training examples – only short sentences – create artificial data by extracting smaller segments (similar to phrase pair extraction in statistical machine translation) – Later epochs: all training data • Not easy to callibrate Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 37Dropout • Training may get stuck in local optima – some properties of task have been learned – discovery of other properties would take it too far out of its comfort zone. • Machine translation example – model learned the language model aspects – but cannot figure out role of input sentence • Drop out: for each batch, eliminate some nodes Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 38Dropout • Dropout – For each batch, different random set of nodes is removed – Their values are set to 0 and their weights are not updated – 10%, 20% or even 50% of all the nodes • Why does this work? – robustness: redundant nodes play similar nodes – ensemble learning: different subnetworks are different models Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 39Gradient Clipping • Exploding gradients: gradients become too large during backward pass ⇒ Limit total value of gradients for a layer to threshold (τ) • Use of L2 norm of gradient values g L2(g) = j g2 j • Adjust each gradient value gi for each element i in the vector gi = gi × τ max(τ, L2(g)) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 40Layer Normalization • During inference, average node values may become too large or too small • Has also impact on training (gradients are multiplied with node values) ⇒ Normalize node values • During training, learn bias layer Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 41Layer Normalization: Math • Feed-forward layer hl , weights W, computed sum sl , activation function sl = W hl−1 hl = sigmoid(hl ) • Compute mean µl and variance σl of sum vector sl µl = 1 H H i−1 sl i σl = 1 H H i−1 (sl i − µl)2 Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 42Layer Normalization: Math • Normalize sl ˆsl = 1 σl (sl − µl ) • Learnable bias vectors g and b ˆsl = g σl (sl − µl ) + b Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 43Shortcuts and Highways • Deep learning: many layers of processing ⇒ Error propagation has to travel farther • All parameters in processing change have to be adjusted • Instead of always passing through all layers, add connections from first to last • Jargon alert – shortcuts – residual connections – skip connections Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 44Shortcuts • Feed-forward layer y = f(x) • Pass through input x y = f(x) + x • Note: gradient is y = f (x) + 1 • Constant 1 → gradient is passed through unchanged Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 45Highways • Regulate how much information from f(x) and x should impact the output y • Gate t(x) (typically computed by a feed-forward layer) y = t(x) f(x) + (1 − t(x)) x Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 46Shortcuts and Highways FF Basic Layer Skip Connection Highway Network Add FF Add FF Gate Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 47LSTM and Vanishing Gradients • Recall: Long short term memory (LSTM) cells • Pass through of memory memoryt = gateinput × inputt + gateforget × memoryt−1 • Forget gate has values close to 1 → gradient passed through nearly unchanged Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 48 generative adversarial training Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 49Sequence-Level Training • Traditional training – predict one word at a time – compare against correct word – proceed training with correct word • Sequence-level training – predict entire sequence – measure translation with sentence-level metric (e.g., BLEU) • May use n-best translations, beam search, etc. Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 50Generative Adversarial Networks (GAN) • Game between two players – generator proposes a translation – discriminator distinguishes between generator’s translation and human translation – generator tries to fool discriminator • Training example: input sentence x and output sentence y • Generator – traditional neural machine translation model – generates full sentence translations t for each input sentence • Discriminator – is trained to classify (x, y) as correct example – is trained to classify (x, t) as generated example Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 51Generative Adversarial Networks (GAN) 1. First train generator to some maturity 2. Train discriminator on generator predictions and human reference translations 3. Train jointly – generator with additional objective to fool discriminator – discriminator to do well on detecting generator’s output as such • In practice, this is hard to callibrate correctly Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 52Relationship to Reinforcement Learning • No immediate feedback – chess playing: quality of move only revealed at end of game – walk through maze to avoid monsters and find gold • Policy: decision process to which steps to take (here: generator, traditional neural machine translation model) • Reward: end result (here: ability to fool discriminator) • Popular technique: Monte Carlo search (here: Monte Carlo decoding) • Training is called policy search Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020