Detailed Information on Publication Record

NEVĚŘILOVÁ, Zuzana. Discovering Continuous Multi-word Expressions in Czech. Computación y Sistemas. Mexico: Centro de Investigación en Computación, 2018, vol. 22, No 3, p. 845-852. ISSN 1405-5546. Available from: https://dx.doi.org/10.13053/CyS-22-3-3022.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Discovering Continuous Multi-word Expressions in Czech
Authors	NEVĚŘILOVÁ, Zuzana (203 Czech Republic, guarantor, belonging to the institution).
Edition	Computación y Sistemas, Mexico, Centro de Investigación en Computación, 2018, 1405-5546.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Mexico
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
RIV identification code	RIV/00216224:14330/18:00109727
Organization unit	Faculty of Informatics
Doi	http://dx.doi.org/10.13053/CyS-22-3-3022
UT WoS	000471005100013
Keywords in English	Multiword expression; Multi-word expression; MWE; MWE discovery; inter-lingual homographs
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 27/4/2020 19:31.

Abstract
Multi-word expressions frequently cause incorrect annotations in corpora, since they often contain foreign words or syntactic anomalies. In case of foreign material, the annotation quality depends on whether the correct language of the sequence is detected. In case of inter-lingual homographs, this problem becomes difficult. In the previous work, we created a dataset of Czech continuous multi-word expressions (MWEs). The candidates were discovered automatically from Czech web corpus considering their orthographic variability. The candidates were classified and annotated manually. Afterwards, the dataset was extended automatically by generating all word forms of those MWEs that were annotated as nouns. In this work, we used the dataset as positive examples, we filtered out negative examples from the MWE candidates. We trained a classifier with mean accuracy 92.7%. We have shown that the combined approach slightly outperforms approaches concerning only association measures mainly on MWEs containing inter-lingual homographs and out-of-vocabulary words. The discovery methods can be applied to other languages which encounter orthographic variability in web corpora.

Abstract

Multi-word expressions frequently cause incorrect annotations in corpora, since they often contain foreign words or syntactic anomalies. In case of foreign material, the annotation quality depends on whether the correct language of the sequence is detected. In case of inter-lingual homographs, this problem becomes difficult. In the previous work, we created a dataset of Czech continuous multi-word expressions (MWEs). The candidates were discovered automatically from Czech web corpus considering their orthographic variability. The candidates were classified and annotated manually. Afterwards, the dataset was extended automatically by generating all word forms of those MWEs that were annotated as nouns. In this work, we used the dataset as positive examples, we filtered out negative examples from the MWE candidates. We trained a classifier with mean accuracy 92.7%. We have shown that the combined approach slightly outperforms approaches concerning only association measures mainly on MWEs containing inter-lingual homographs and out-of-vocabulary words. The discovery methods can be applied to other languages which encounter orthographic variability in web corpora.

Links
EF16_013/0001781, research and development project	Name: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity