On Disambiguation in Czech Corpora

V 2000

On Disambiguation in Czech Corpora

POPELÍNSKÝ, Lubomír, Tomáš PAVELEK and Tomáš PTÁČNÍK

Basic information

Original name

On Disambiguation in Czech Corpora

Authors

POPELÍNSKÝ, Lubomír, Tomáš PAVELEK and Tomáš PTÁČNÍK

Edition

Brno (CZE), 012 pp. 2000

Publisher

FI MU

Other information

Language

English

Type of outcome

Research report

Field of Study

20206 Computer hardware and architecture

Country of publisher

Czech Republic

Confidentiality degree

is not subject to a state or trade secret

RIV identification code

RIV/00216224:14330/00:00002818

Organization unit

Faculty of Informatics

Keywords in English

Lemma disambiguation; Corpus; Natural language processing; Machine learning

Abstract

V originále

Lemma disambiguation means finding the basic word form, typically nominative singular for nouns or infinitive for verbs. We developed a multistrategy method for lemma disambiguation of unannotated text. The method is based on a combination of inductive logic programming and instance-based learning. We present results of the most important subtasks of lemma disambiguation for Czech language. Although no expert knowledge on Czech grammar has been used the accuracy reaches 90% with a fraction of words remaining ambiguous. We also display first results of tag disambiguation.

Links

VS97028, research and development project

Name: Laboratoř zpracování přirozeného jazyka (s aplikacemi pro podporu výuky zrakově postižených)

Investor: Ministry of Education, Youth and Sports of the CR, Natural Language Processing Laboratory (with applications supporting education of people with limited sight)

Přehled o publikaci