HFT: High Frequency Tokens for Low-Resource NMT

SIGNORONI, Edoardo a Pavel RYCHLÝ. HFT: High Frequency Tokens for Low-Resource NMT. Online. In Atul Kr. Ojha, Chao-Hong Liu, Ekaterina Vylomova, Jade Abbott, Jonathan Washington, Nathaniel Oco, Tommi A Pirinen, Valentin Malykh, Varvara Logacheva, Xiaobing Zhao. Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022). Gyeongju, Republic of Korea: Association for Computational Linguistics, 2022, s. 56-63. ISSN 2951-2093.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	HFT: High Frequency Tokens for Low-Resource NMT
Autoři	SIGNORONI, Edoardo (380 Itálie, domácí) a Pavel RYCHLÝ (203 Česká republika, domácí).
Vydání	Gyeongju, Republic of Korea, Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022), od s. 56-63, 8 s. 2022.
Nakladatel	Association for Computational Linguistics

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10200 1.2 Computer and information sciences
Stát vydavatele	Spojené státy
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	elektronická verze "online"
WWW	URL
Kód RIV	RIV/00216224:14330/22:00127008
Organizační jednotka	Fakulta informatiky
ISSN	2951-2093
Klíčová slova anglicky	Machine Translation; Tokenization
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnil: RNDr. Pavel Šmerk, Ph.D., učo 3880. Změněno: 15. 5. 2024 09:10.

Anotace

Tokenization has been shown to impact the quality of downstream tasks, such as Neural Machine Translation (NMT), which is susceptible to out-of-vocabulary words and low frequency training data. Current state-of-the-art algorithms have been helpful in addressing the issues of out-of-vocabulary words, bigger vocabulary sizes and token frequency by implementing subword segmentation. We argue, however, that there is still room for improvement, in particular regarding low-frequency tokens in the training data. In this paper, we present “High Frequency Tokenizer”, or HFT, a new language-independent subword segmentation algorithm that addresses this issue. We also propose a new metric to measure the frequency coverage of a tokenizer’s vocabulary, based on a frequency rank weighted average of the frequency values of its items. We experiment with a diverse set of language corpora, vocabulary sizes, and writing systems and report improvements on both frequency statistics and on the average length of the output. We also observe a positive impact on downstream NMT.

Návaznosti
EF19_073/0016943, projekt VaV	Název: Interní grantová agentura Masarykovy univerzity
LM2018101, projekt VaV	Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Akronym: LINDAT/CLARIAH-CZ)
LM2018101, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
MUNI/IGA/1334/2021, interní kód MU	Název: A New Machine Translation-based approach to Parallel Corpora Alignment
MUNI/IGA/1334/2021, interní kód MU	Investor: Masarykova univerzita, A New Machine Translation-based approach to Parallel Corpora Alignment

VytisknoutZobrazeno: 19. 7. 2024 12:29

HFT: High Frequency Tokens for Low-Resource NMT

Další aplikace