Discovering Continuous Multi-word Expressions in Czech

NEVĚŘILOVÁ, Zuzana. Discovering Continuous Multi-word Expressions in Czech. Computación y Sistemas. Mexico: Centro de Investigación en Computación, 2018, roč. 22, č. 3, s. 845-852. ISSN 1405-5546. Dostupné z: https://dx.doi.org/10.13053/CyS-22-3-3022.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Discovering Continuous Multi-word Expressions in Czech
Autoři	NEVĚŘILOVÁ, Zuzana (203 Česká republika, garant, domácí).
Vydání	Computación y Sistemas, Mexico, Centro de Investigación en Computación, 2018, 1405-5546.

Další údaje
Originální jazyk	angličtina
Typ výsledku	Článek v odborném periodiku
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Mexiko
Utajení	není předmětem státního či obchodního tajemství
WWW	URL
Kód RIV	RIV/00216224:14330/18:00109727
Organizační jednotka	Fakulta informatiky
Doi	http://dx.doi.org/10.13053/CyS-22-3-3022
UT WoS	000471005100013
Klíčová slova anglicky	Multiword expression; Multi-word expression; MWE; MWE discovery; inter-lingual homographs
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnil: RNDr. Pavel Šmerk, Ph.D., učo 3880. Změněno: 27. 4. 2020 19:31.

Anotace

Multi-word expressions frequently cause incorrect annotations in corpora, since they often contain foreign words or syntactic anomalies. In case of foreign material, the annotation quality depends on whether the correct language of the sequence is detected. In case of inter-lingual homographs, this problem becomes difficult. In the previous work, we created a dataset of Czech continuous multi-word expressions (MWEs). The candidates were discovered automatically from Czech web corpus considering their orthographic variability. The candidates were classified and annotated manually. Afterwards, the dataset was extended automatically by generating all word forms of those MWEs that were annotated as nouns. In this work, we used the dataset as positive examples, we filtered out negative examples from the MWE candidates. We trained a classifier with mean accuracy 92.7%. We have shown that the combined approach slightly outperforms approaches concerning only association measures mainly on MWEs containing inter-lingual homographs and out-of-vocabulary words. The discovery methods can be applied to other languages which encounter orthographic variability in web corpora.

Návaznosti
EF16_013/0001781, projekt VaV	Název: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity

VytisknoutZobrazeno: 27. 7. 2024 14:24

Discovering Continuous Multi-word Expressions in Czech

Další aplikace