Bilingual Lexicon Induction From Comparable and Parallel Data:
A Comparative Analysis

D 2024

Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis

DENISOVÁ, Michaela a Pavel RYCHLÝ

Základní údaje

Originální název

Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis

Autoři

DENISOVÁ, Michaela a Pavel RYCHLÝ

Vydání

Cham, International Conference on Text, Speech, and Dialogue, od s. 30-42, 13 s. 2024

Nakladatel

Springer Nature Switzerland

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

Preprint version

Impakt faktor

Impact factor: 0.402 v roce 2005

Označené pro přenos do RIV

Ano

Kód RIV

RIV/00216224:14330/24:00136956

Organizační jednotka

Fakulta informatiky

ISBN

978-3-031-70562-5

ISSN

Klíčová slova anglicky

bilingual lexicon induction; cross-lingual word embeddings; neural machine translation systems

Štítky

firank_B

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 4. 4. 2025 12:11, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources.

Návaznosti

MUNI/A/1590/2023, interní kód MU

Název: Využití technik umělé inteligence pro zpracování dat, komplexní analýzy a vizualizaci rozsáhlých dat

Investor: Masarykova univerzita, Využití technik umělé inteligence pro zpracování dat, komplexní analýzy a vizualizaci rozsáhlých dat

Citovat

DENISOVÁ, Michaela a Pavel RYCHLÝ. Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis. Online. In Nöth, E., Horák, A., Sojka, P. International Conference on Text, Speech, and Dialogue. Cham: Springer Nature Switzerland, 2024, s. 30-42. ISBN 978-3-031-70562-5. Dostupné z: https://doi.org/10.1007/978-3-031-70563-2_3.

@inproceedings{2426237,
   author = {Denisová, Michaela and Rychlý, Pavel},
   address = {Cham},
   booktitle = {International Conference on Text, Speech, and Dialogue},
   doi = {https://doi.org/10.1007/978-3-031-70563-2_3},
   editor = {Nöth, E., Horák, A., Sojka, P.},
   keywords = {bilingual lexicon induction; cross-lingual word embeddings; neural machine translation systems},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Cham},
   isbn = {978-3-031-70562-5},
   pages = {30-42},
   publisher = {Springer Nature Switzerland},
   title = {Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis},
   url = {https://tsdconference.org/tsd2024/download/preprints/1195.pdf},
   year = {2024}
}

TY  - CONF
ID  - 2426237
AU  - Denisová, Michaela - Rychlý, Pavel
PY  - 2024
TI  - Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis
PB  - Springer Nature Switzerland
CY  - Cham
SN  - 9783031705625
KW  - bilingual lexicon induction
KW  - cross-lingual word embeddings
KW  - neural machine translation systems
UR  - https://tsdconference.org/tsd2024/download/preprints/1195.pdf
N2  - Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources.
ER  -

DENISOVÁ, Michaela a Pavel RYCHLÝ. Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis. Online. In Nöth, E., Horák, A., Sojka, P. \textit{International Conference on Text, Speech, and Dialogue}. Cham: Springer Nature Switzerland, 2024, s.~30-42. ISBN~978-3-031-70562-5. Dostupné z: https://doi.org/10.1007/978-3-031-70563-2\_{}3.

Přehled o publikaci