Improving Machine Understanding of Czech Medical Text Using
Self-Supervised and Rule-Based Data Augmentation

D 2025

Improving Machine Understanding of Czech Medical Text Using Self-Supervised and Rule-Based Data Augmentation

ANETTA, Krištof and Aleš HORÁK

Basic information

Original name

Improving Machine Understanding of Czech Medical Text Using Self-Supervised and Rule-Based Data Augmentation

Authors

ANETTA, Krištof (703 Slovakia, guarantor, belonging to the institution) and Aleš HORÁK (203 Czech Republic, belonging to the institution)

Edition

Cham, Modeling Decisions for Artificial Intelligence, 22nd International Conference, MDAI 2025, p. 315-327, 386 pp. 2025

Publisher

Springer

Other information

Language

English

Type of outcome

Proceedings paper

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Switzerland

Confidentiality degree

is not subject to a state or trade secret

Publication form

printed version "print"

References:

URL

Impact factor

Impact factor: 0.402 in 2005

Organization unit

Faculty of Informatics

ISBN

978-3-032-00890-9

ISSN

DOI

http://dx.doi.org/10.1007/978-3-032-00891-6_25

Keywords in English

EHR; health records; medical text; clinical text; data augmentation; annotation; self-supervised; bootstrapping; Czech

Abstract

V originále

Medical doctor decision-making benefits from the development of effective support software. But for software to accurately interpret meaning and assist in clinical contexts, high-quality annotated health record data must be available for training and evaluation. This paper addresses this issue in the Czech language context, detailing a stage in a unique electronic health record (EHR) bootstrapping project. Using over 42 million words of Czech oncology records, we curated the creation of the CSEHR dataset: over 62,000 words of text with manually annotated medical concepts, out of which over 12,000 have been developed through multiple stages of review to serve as ground truth. We are leveraging this seed data to bootstrap larger annotated corpora, enabling scalable development of Czech healthcare NLP applications. This paper focuses on combining two data augmentation approaches. Approach 1, semi-supervised, consists in automated dataset augmentation using self-annotation to increase annotation density. Approach 2, based on distant supervision, consists in manual development of rules for improving annotations in training data. Results show that combining these two approaches on training data and fine-tuning an XLM-RoBERTa model for entity recognition increases the token classification F1 score by more than 5 points. This demonstrates the promise of this technique in further bootstrapping steps.

Links

LM2023062, research and development project

Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Investor: Ministry of Education, Youth and Sports of the CR

90254, large research infrastructures

Name: e-INFRA CZ II

Cite

ANETTA, Krištof and Aleš HORÁK. Improving Machine Understanding of Czech Medical Text Using Self-Supervised and Rule-Based Data Augmentation. In Vicenç Torra, Yasuo Narukawa, Josep Domingo-Ferrer. Modeling Decisions for Artificial Intelligence, 22nd International Conference, MDAI 2025. Cham: Springer, 2025, p. 315-327, 386 pp. ISBN 978-3-032-00890-9. Available from: https://dx.doi.org/10.1007/978-3-032-00891-6_25.

@inproceedings{2498157,
   author = {Anetta, Krištof and Horák, Aleš},
   address = {Cham},
   booktitle = {Modeling Decisions for Artificial Intelligence, 22nd International Conference, MDAI 2025},
   doi = {http://dx.doi.org/10.1007/978-3-032-00891-6_25},
   editor = {Vicenç Torra, Yasuo Narukawa, Josep Domingo-Ferrer},
   keywords = {EHR; health records; medical text; clinical text; data augmentation; annotation; self-supervised; bootstrapping; Czech},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Cham},
   isbn = {978-3-032-00890-9},
   note = {CORE B},
   pages = {315-327},
   publisher = {Springer},
   title = {Improving Machine Understanding of Czech Medical Text Using Self-Supervised and Rule-Based Data Augmentation},
   url = {https://link.springer.com/chapter/10.1007/978-3-032-00891-6_25},
   year = {2025}
}

TY  - CONF
ID  - 2498157
AU  - Anetta, Krištof - Horák, Aleš
PY  - 2025
TI  - Improving Machine Understanding of Czech Medical Text Using Self-Supervised and Rule-Based Data Augmentation
PB  - Springer
CY  - Cham
SN  - 9783032008909
N1  - CORE B
KW  - EHR
KW  - health records
KW  - medical text
KW  - clinical text
KW  - data augmentation
KW  - annotation
KW  - self-supervised
KW  - bootstrapping
KW  - Czech
UR  - https://link.springer.com/chapter/10.1007/978-3-032-00891-6_25
N2  - Medical doctor decision-making benefits from the development of effective support software. But for software to accurately interpret meaning and assist in clinical contexts, high-quality annotated health record data must be available for training and evaluation. This paper addresses this issue in the Czech language context, detailing a stage in a unique electronic health record (EHR) bootstrapping project. Using over 42 million words of Czech oncology records, we curated the creation of the CSEHR dataset: over 62,000 words of text with manually annotated medical concepts, out of which over 12,000 have been developed through multiple stages of review to serve as ground truth. We are leveraging this seed data to bootstrap larger annotated corpora, enabling scalable development of Czech healthcare NLP applications. This paper focuses on combining two data augmentation approaches. Approach 1, semi-supervised, consists in automated dataset augmentation using self-annotation to increase annotation density. Approach 2, based on distant supervision, consists in manual development of rules for improving annotations in training data. Results show that combining these two approaches on training data and fine-tuning an XLM-RoBERTa model for entity recognition increases the token classification F1 score by more than 5 points. This demonstrates the promise of this technique in further bootstrapping steps.
ER  -

ANETTA, Krištof and Aleš HORÁK. Improving Machine Understanding of Czech Medical Text Using Self-Supervised and Rule-Based Data Augmentation. In Vicen\c c Torra, Yasuo Narukawa, Josep Domingo-Ferrer. \textit{Modeling Decisions for Artificial Intelligence, 22nd International Conference, MDAI 2025}. Cham: Springer, 2025, p.~315-327, 386 pp. ISBN~978-3-032-00890-9. Available from: https://dx.doi.org/10.1007/978-3-032-00891-6\_{}25.

Přehled o publikaci