V 2000

On Disambiguation in Czech Corpora

POPELÍNSKÝ, Lubomír, Tomáš PAVELEK and Tomáš PTÁČNÍK

Basic information

Original name

On Disambiguation in Czech Corpora

Authors

POPELÍNSKÝ, Lubomír, Tomáš PAVELEK and Tomáš PTÁČNÍK

Edition

Brno (CZE), 012 pp. 2000

Publisher

FI MU

Other information

Language

English

Type of outcome

Výzkumná zpráva

Field of Study

20206 Computer hardware and architecture

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

RIV identification code

RIV/00216224:14330/00:00002818

Organization unit

Faculty of Informatics

Keywords in English

Lemma disambiguation; Corpus; Natural language processing; Machine learning
Změněno: 25/2/2001 17:39, doc. RNDr. Lubomír Popelínský, Ph.D.

Abstract

V originále

Lemma disambiguation means finding the basic word form, typically nominative singular for nouns or infinitive for verbs. We developed a multistrategy method for lemma disambiguation of unannotated text. The method is based on a combination of inductive logic programming and instance-based learning. We present results of the most important subtasks of lemma disambiguation for Czech language. Although no expert knowledge on Czech grammar has been used the accuracy reaches 90% with a fraction of words remaining ambiguous. We also display first results of tag disambiguation.

Links

VS97028, research and development project
Name: Laboratoř zpracování přirozeného jazyka (s aplikacemi pro podporu výuky zrakově postižených)
Investor: Ministry of Education, Youth and Sports of the CR, Natural Language Processing Laboratory (with applications supporting education of people with limited sight)