Neural Tagger for Czech Language: Capturing Linguistic
Phenomena in Web Corpora

NEVĚŘILOVÁ, Zuzana and Marie STARÁ. Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. Brno: Tribun EU, 2019, p. 23-32. ISBN 978-80-263-1517-9.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora
Authors	NEVĚŘILOVÁ, Zuzana (203 Czech Republic, belonging to the institution) and Marie STARÁ (203 Czech Republic, belonging to the institution).
Edition	Brno, Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019, p. 23-32, 10 pp. 2019.
Publisher	Tribun EU

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/19:00111625
Organization unit	Faculty of Informatics
ISBN	978-80-263-1517-9
ISSN	2336-4289
UT WoS	000604899800003
Keywords in English	Czech Tagger; Multi-word Expressions; Pretrained WordEmbeddings
Tags	International impact, Reviewed
Changed by	Changed by: Mgr. Michal Petr, učo 65024. Changed: 16/5/2022 15:20.

Abstract

We propose a new tagger for the Czech language and particu-larly for the tagset used for annotation of corpora of the TenTen family.The tagger is based on neural networks with pretrained word embed-dings. We selected the newest Czech Web corpus of the TenTen familyas training data, but we removed sentences with phenomena that wereoften annotated incorrectly. We let the tagger to learn the annotation ofthese phenomena on its own. We also experimented with the recognitionof multi-word expressions since this information can support the correcttagging.We evaluated the tagger on 6,950 sentences (84,023 tokens) from thecstenten17corpus and achieved 75.25% accuracy when compared bytags. When compared by attributes, we achieved 91.62% accuracy; theaccuracy of POS tag prediction is 96.5%.

Links
EF16_013/0001781, research and development project	Name: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity
LM2015071, research and development project	Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
LM2015071, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 11/10/2024 14:31

Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora

Other applications