Czech Grammar Agreement Dataset for Evaluation of Language
Models

D 2016

Czech Grammar Agreement Dataset for Evaluation of Language Models

BAISA, Vít

Basic information

Original name

Czech Grammar Agreement Dataset for Evaluation of Language Models

Authors

BAISA, Vít (203 Czech Republic, guarantor, belonging to the institution)

Edition

Brno, RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, p. 63-67, 5 pp. 2016

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

URL

RIV identification code

RIV/00216224:14330/16:00091975

Organization unit

Faculty of Informatics

ISBN

978-80-263-1095-2

ISSN

UT WoS

000466886400007

Keywords (in Czech)

jazykový model; gramatická shoda; slovesná přípona; čeština; podmět; přísudek; vyhodnocení; perplexita

Keywords in English

language model; grammar agreement; verb suffix; Czech language; subject; predicate; dataset; evaluation; perplexity

Abstract

V originále

AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.

Links

MUNI/A/0863/2015, interní kód MU

Name: Čeština v jednotě synchronie a diachronie - 2016

Investor: Masaryk University, Category A

7F14047, research and development project

Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)

Investor: Ministry of Education, Youth and Sports of the CR

Detailed Information on Publication Record