Are We There Yet? A Thorough Evaluation of POS Tagging on Czech

D 2026

Are We There Yet? A Thorough Evaluation of POS Tagging on Czech

OHLÍDALOVÁ, Vlasta; Miloš JAKUBÍČEK a Pavel RYCHLÝ

Základní údaje

Originální název

Are We There Yet? A Thorough Evaluation of POS Tagging on Czech

Autoři

OHLÍDALOVÁ, Vlasta; Miloš JAKUBÍČEK a Pavel RYCHLÝ

Vydání

Proceedings, Part II. Erlangen, Německo, Text, Speech, and Dialogue, 28th International Conference, TSD 2025, od s. 263-274, 12 s. 2026

Nakladatel

Springer, Cham

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Německo

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Odkazy

Konferenční sborník

Impakt faktor

Impact factor: 0.402 v roce 2005

Označené pro přenos do RIV

Ano

Organizační jednotka

Fakulta informatiky

ISBN

978-3-032-02550-0

ISSN

Klíčová slova anglicky

morphological analysis; evaluation; POS tagging

Štítky

firank_B

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 1. 4. 2026 11:07, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

With recent advances in natural language processing, part-of-speech (POS) tagging is one of the areas that has seen significant improvements. Contemporary state-of-the-art tools report accuracies approaching 100% even for morphologically rich languages such as Czech that used to pose a challenge in the past. In this study, we investigate whether such accuracy is reproducible on real-world data, as previous research has demonstrated substantial discrepancies between evaluations conducted on gold-standard corpora and those based on text typically occurring on the web. To address this issue, we selected a set of widely used and well-established POS taggers and applied them to a random sample of documents from the csTenTen23 web corpus. Tokens, for which the taggers produced differing outputs, were then manually annotated. Our results indicate that the ability of modern POS taggers to handle real-world data – including a broad range of genres and topics – has improved significantly in comparison to the earlier statistically based POS taggers. Furthermore, we observe a shift in the most problematic tagging category: whereas case assignment was previously a major source of errors, the best current models struggle more with POS category distinctions. We argue that this shift may reflect ambiguities inherent in the POS category itself, where even human annotators may not fully agree.

Návaznosti

LM2023062, projekt VaV

Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Citovat

OHLÍDALOVÁ, Vlasta; Miloš JAKUBÍČEK a Pavel RYCHLÝ. Are We There Yet? A Thorough Evaluation of POS Tagging on Czech. In Kamil Ekštein, Miloslav Konopík, Ondřej Pražák, František Pártl (Eds.). Text, Speech, and Dialogue, 28th International Conference, TSD 2025. Proceedings, Part II. Erlangen, Německo: Springer, Cham, 2026, s. 263-274. ISBN 978-3-032-02550-0. Dostupné z: https://doi.org/10.1007/978-3-032-02551-7_23.

@inproceedings{2514099,
   author = {Ohlídalová, Vlasta and Jakubíček, Miloš and Rychlý, Pavel},
   address = {Erlangen, Německo},
   booktitle = {Text, Speech, and Dialogue, 28th International Conference, TSD 2025},
   doi = {https://doi.org/10.1007/978-3-032-02551-7_23},
   edition = {Proceedings, Part II.},
   editor = {Kamil Ekštein, Miloslav Konopík, Ondřej Pražák, František Pártl (Eds.)},
   keywords = {morphological analysis; evaluation; POS tagging},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Erlangen, Německo},
   isbn = {978-3-032-02550-0},
   pages = {263-274},
   publisher = {Springer, Cham},
   title = {Are We There Yet? A Thorough Evaluation of POS Tagging on Czech},
   url = {https://link.springer.com/book/10.1007/978-3-032-02551-7},
   year = {2026}
}

TY  - CONF
ID  - 2514099
AU  - Ohlídalová, Vlasta - Jakubíček, Miloš - Rychlý, Pavel
PY  - 2026
TI  - Are We There Yet? A Thorough Evaluation of POS Tagging on Czech
PB  - Springer, Cham
CY  - Erlangen, Německo
SN  - 9783032025500
KW  - morphological analysis
KW  - evaluation
KW  - POS tagging
UR  - https://link.springer.com/book/10.1007/978-3-032-02551-7
N2  - With recent advances in natural language processing, part-of-speech (POS) tagging is one of the areas that has seen significant improvements. Contemporary state-of-the-art tools report accuracies approaching 100% even for morphologically rich languages such as Czech that used to pose a challenge in the past. In this study, we investigate whether such accuracy is reproducible on real-world data, as previous research has demonstrated substantial discrepancies between evaluations conducted on gold-standard corpora and those based on text typically occurring on the web. To address this issue, we selected a set of widely used and well-established POS taggers and applied them to a random sample of documents from the csTenTen23 web corpus. Tokens, for which the taggers produced differing outputs, were then manually annotated. Our results indicate that the ability of modern POS taggers to handle real-world data – including a broad range of genres and topics – has improved significantly in comparison to the earlier statistically based POS taggers. Furthermore, we observe a shift in the most problematic tagging category: whereas case assignment was previously a major source of errors, the best current models struggle more with POS category distinctions. We argue that this shift may reflect ambiguities inherent in the POS category itself, where even human annotators may not fully agree.
ER  -

OHLÍDALOVÁ, Vlasta; Miloš JAKUBÍČEK a Pavel RYCHLÝ. Are We There Yet? A Thorough Evaluation of POS Tagging on Czech. In Kamil Ekštein, Miloslav Konopík, Ondřej Pražák, František Pártl (Eds.). \textit{Text, Speech, and Dialogue, 28th International Conference, TSD 2025}. Proceedings, Part II. Erlangen, Německo: Springer, Cham, 2026, s.~263-274. ISBN~978-3-032-02550-0. Dostupné z: https://doi.org/10.1007/978-3-032-02551-7\_{}23.

Přehled o publikaci