Low-Resource Machine Translation PV061 Pavel Rychlý NLP Centre, FI MU 20 Sep 2023 Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 1 / 13 Low-Resource Languages Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 2 / 13 Parallel Data in Practice Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 3 / 13 Finding data Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 4 / 13 Data Augmentation word or phrase replacement back-translation (iterative back-translation) reducing noise monolingual data selection synthetic parallel data filtering distinguishing between original and back-translated data Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 5 / 13 Unsupervised MT alignment of monolingual embeddings generating word translations multilingual embeddings fine-tuning on monolingual data Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 6 / 13 Multilingual MT single encoder-decoder for all the languages single target multiple target langs with annotation target lang ID in source target lang ID as first token in target per-language encoder-decoder single encoder with per-language decoder per-language encoder with single decoder Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 7 / 13 Zero-shot NMT pivoting pivot language (English) many-to-many multilingual MT Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 8 / 13 Evaluation Pretrained models (COMET) not supported ChrF++ datasets: FLoRes Evaluation Benchmark FLORES-101 (3000 English sentences) FLOREW-200 Open Language Data Initiative (OLDI) Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 9 / 13 WMT Shared Tasks Shared Tasks are the main part of the Conference on Machine Translation (WMT). 2023: http://www2.statmt.org/wmt23/index.html general translation task (former News task), terminology translation task, literary translation task, word-level autocompletion task, sign language translation task, biomedical translation task, indic translation task, african translation task, metrics evaluation task,Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 10 / 13 Summary 1 statistical MT IBM Model 1 language modelling phrase-based models decoding neural MT language modelling, RNN RNN MT attention decoding Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 11 / 13 Summary 2 transformers subwords evaluation data acquisition Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 12 / 13 Next semester PA107 Corpus Tools Project WMT 2024 Shared Tasks Pavel Rychlý ·Low-Resource Machine Translation ·20 Sep 2023 13 / 13