D 2024

Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis

DENISOVÁ, Michaela and Pavel RYCHLÝ

Basic information

Original name

Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis

Edition

Cham, International Conference on Text, Speech, and Dialogue, p. 30-42, 12 pp. 2024

Publisher

Springer Nature Switzerland

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

Organization unit

Faculty of Informatics

ISBN

978-3-031-70563-2

Keywords in English

bilingual lexicon induction; cross-lingual word embeddings; neural machine translation systems

Tags

Tags

Reviewed
Změněno: 17/10/2024 15:37, Mgr. Michaela Denisová

Abstract

V originále

Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources.

Links

MUNI/A/1590/2023, interní kód MU
Name: Využití technik umělé inteligence pro zpracování dat, komplexní analýzy a vizualizaci rozsáhlých dat
Investor: Masaryk University, Using artificial intelligence techniques for data processing, complex analysis and visualization of large-scale data