Annotation of Multi-Word Expressions in Czech Texts

NEVĚŘILOVÁ, Zuzana. Annotation of Multi-Word Expressions in Czech Texts. In Horák, Aleš; Rychlý, Pavel; Rambousek, Adam. Ninth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2015, s. 103-112. ISBN 978-80-263-0974-1.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Annotation of Multi-Word Expressions in Czech Texts
Autoři	NEVĚŘILOVÁ, Zuzana (203 Česká republika, garant, domácí).
Vydání	Brno, Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, od s. 103-112, 10 s. 2015.
Nakladatel	Tribun EU

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	60200 6.2 Languages and Literature
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	tištěná verze "print"
WWW	URL
Kód RIV	RIV/00216224:14210/15:00085165
Organizační jednotka	Filozofická fakulta
ISBN	978-80-263-0974-1
ISSN	2336-4289
Klíčová slova anglicky	multi-word expressions; corpus; orthographical variants
Štítky	rivok
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnila: RNDr. Zuzana Nevěřilová, Ph.D., učo 3839. Změněno: 27. 5. 2021 09:13.

Anotace

Multi-word expressions (MWEs) are difficult to define and also difficult to annotate. Some of them cause serious errors in the traditional annotation pipeline tokenization - morphological analysis - morphological disambiguation. Many cases of incorrect annotation in Czech corpora are known. To narrow the research topic, we focus only in fixed MWEs – those with fixed word order and no ellidable components. In this paper, we propose a corpus-based method that reveals fixed MWE candidates. From the web-based corpus of Czech, we extracted 25,091 expressions, 2,140 of them were identified as MWEs, 332 as probable MWEs, and 174 of them can be either MWEs or one single word. Our method is based on corpus data observation that indicates that people are unsure when writing a MWE whether it is one word, a word with dashes, or several words. The result is a list of MWE candidates and also an application that classifies the input as MWE, probable MWE, or non-MWE.

Návaznosti
MUNI/A/1165/2014, interní kód MU	Název: Čeština v jednotě synchronie a diachronie - 2015
MUNI/A/1165/2014, interní kód MU	Investor: Masarykova univerzita, Čeština v jednotě synchronie a diachronie - 2015, DO R. 2020_Kategorie A - Specifický výzkum - Studentské výzkumné projekty
7F14047, projekt VaV	Název: Harvesting big text data for under-resourced languages (Akronym: HaBiT)
7F14047, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Harvesting big text data for under-resourced languages

VytisknoutZobrazeno: 19. 9. 2024 20:24

Annotation of Multi-Word Expressions in Czech Texts

Další aplikace