Czech Grammar Agreement Dataset for Evaluation of Language
Models

BAISA, Vít. Czech Grammar Agreement Dataset for Evaluation of Language Models. In RASLAN 2016 Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, p. 63-67. ISBN 978-80-263-1095-2.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Czech Grammar Agreement Dataset for Evaluation of Language Models
Authors	BAISA, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition	Brno, RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, p. 63-67, 5 pp. 2016.
Publisher	Tribun EU

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/16:00091975
Organization unit	Faculty of Informatics
ISBN	978-80-263-1095-2
ISSN	2336-4289
UT WoS	000466886400007
Keywords (in Czech)	jazykový model; gramatická shoda; slovesná přípona; čeština; podmět; přísudek; vyhodnocení; perplexita
Keywords in English	language model; grammar agreement; verb suffix; Czech language; subject; predicate; dataset; evaluation; perplexity
Tags	International impact, Reviewed
Changed by	Changed by: Mgr. et Mgr. Vít Baisa, Ph.D., učo 139654. Changed: 27/5/2021 09:10.

Abstract

AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.

Links
MUNI/A/0863/2015, interní kód MU	Name: Čeština v jednotě synchronie a diachronie - 2016
MUNI/A/0863/2015, interní kód MU	Investor: Masaryk University, Category A
7F14047, research and development project	Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
7F14047, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 11/10/2024 14:27

Czech Grammar Agreement Dataset for Evaluation of Language Models

Other applications