Czech MWE Database

PALA, Karel, Lukáš SVOBODA a Pavel ŠMERK. Czech MWE Database. In Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC '08). Marrakech, Morocco: European Language Resources Association (ELRA), 2008, s. 1-5. ISBN 2-9517408-4-0.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Czech MWE Database
Název česky	Česká databáze víceslovných vyrazů
Autoři	PALA, Karel (203 Česká republika, garant), Lukáš SVOBODA (203 Česká republika) a Pavel ŠMERK (203 Česká republika).
Vydání	Marrakech, Morocco, Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC '08), s. 1-5, 2008.
Nakladatel	European Language Resources Association (ELRA)

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
WWW	URL
Kód RIV	RIV/00216224:14330/08:00024204
Organizační jednotka	Fakulta informatiky
ISBN	2-9517408-4-0
UT WoS	000324028903004
Klíčová slova anglicky	multiword expressions;word sketch engine
Štítky	multiword expressions, word sketch engine
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnil: RNDr. Pavel Šmerk, Ph.D., učo 3880. Změněno: 26. 5. 2010 09:10.

Anotace

In this paper we deal with a recently developed large Czech MWE database containing at the moment 160 000 MWEs (treated as lexical units). We describe the structure of the database and give basic types of MWEs according to domains they belong to. We compare the built MWEs database with the corpus data from Czech National Corpus (approx. 100 mil. tokens) and present results of this comparison in the paper. To obtain a more complete list of MWEs we propose and use a technique exploiting the Word Sketch Engine, which allows us to work with statistical parameters such as frequency of MWEs and their components as well as with the salience for the whole MWEs. We also discuss exploitation of the database for working out a more adequate tagging and lemmatization. The final goal is to be able to recognize MWEs in corpus text and lemmatize them as complete lexical units, i. e. to make tagging and lemmatization more adequate.

Anotace česky
Článek popisuje strukturu a obsah české databáze víceslovných výrazů obsahující v současnosti více než 160 000 položek a porovnává ji s daty Českého národního korpusu. Dále je navrženo, jak databázi doplňovat pomocí Word Sketch Engine.

Návaznosti
LC536, projekt VaV	Název: Centrum komputační lingvistiky
LC536, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Centrum komputační lingvistiky
1ET200610406, projekt VaV	Název: Jazyková poradna na internetu
1ET200610406, projekt VaV	Investor: Akademie věd ČR, Jazyková poradna na internetu
2C06009, projekt VaV	Název: Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce (Akronym: COT-SEWing)
2C06009, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce

VytisknoutZobrazeno: 22. 9. 2024 19:17

Czech MWE Database

Další aplikace