New Human-Annotated Dataset of Czech Health Records for
Training Medical Concept Recognition Models

D 2024

New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models

ANETTA, Krištof a Aleš HORÁK

Základní údaje

Originální název

New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models

Autoři

ANETTA, Krištof a Aleš HORÁK

Vydání

Cham, Text, Speech, and Dialogue, od s. 110-120, 11 s. 2024

Nakladatel

Springer Nature Switzerland

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Švýcarsko

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

paměťový nosič (CD, DVD, flash disk)

Impakt faktor

Impact factor: 0.402 v roce 2005

Označené pro přenos do RIV

Ano

Kód RIV

RIV/00216224:14330/24:00136991

Organizační jednotka

Fakulta informatiky

ISBN

978-3-031-70562-5

ISSN

Klíčová slova anglicky

medical text analysis; electronic health records; medical concept terms; medical concept dataset; named entity recognition

Štítky

firank_B

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 4. 4. 2025 12:12, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

Following the widespread successes of leveraging recent large language models (LLMs) in various NLP tasks, this paper focuses on medical text content understanding. Adapting a foundational LLM to the medical domain requires a special kind of datasets where core medical concepts are accurately annotated. This paper addresses the need of better medical concept recognition in free-text electronic health records in low-resourced Slavic languages and introduces CSEHR, a new human-annotated dataset of Czech oncology health records. It describes the dataset inception, management, considerations, processing, and finally presents baseline concept recognition model results. XLM-RoBERTa models trained on the dataset using 5-fold cross-validation achieved an average weighted F1 score of 0.672 in exact and 0.777 in partial medical concept recognition ranging from 0.335 to 0.857 per different concept classes. This paper then describes future plans of bootstrapping larger annotated corpora from the CSEHR dataset and of making the dataset publicly available. This endeavor is unique in the realm of Slavic languages and already at this stage it represents a major step in the field of Slavic medical concept recognition.",

Návaznosti

LM2023062, projekt VaV

Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

MUNI/A/1590/2023, interní kód MU

Název: Využití technik umělé inteligence pro zpracování dat, komplexní analýzy a vizualizaci rozsáhlých dat

Investor: Masarykova univerzita, Využití technik umělé inteligence pro zpracování dat, komplexní analýzy a vizualizaci rozsáhlých dat

Citovat

ANETTA, Krištof a Aleš HORÁK. New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models. In Nöth, Elmar. Text, Speech, and Dialogue. Cham: Springer Nature Switzerland, 2024, s. 110-120. ISBN 978-3-031-70562-5. Dostupné z: https://doi.org/10.1007/978-3-031-70563-2_9.

@inproceedings{2427858,
   author = {Anetta, Krištof and Horák, Aleš},
   address = {Cham},
   booktitle = {Text, Speech, and Dialogue},
   doi = {https://doi.org/10.1007/978-3-031-70563-2_9},
   editor = {Nöth, Elmar},
   keywords = {medical text analysis; electronic health records; medical concept terms; medical concept dataset; named entity recognition},
   howpublished = {paměťový nosič},
   language = {eng},
   location = {Cham},
   isbn = {978-3-031-70562-5},
   pages = {110-120},
   publisher = {Springer Nature Switzerland},
   title = {New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models},
   year = {2024}
}

TY  - CONF
ID  - 2427858
AU  - Anetta, Krištof - Horák, Aleš
PY  - 2024
TI  - New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models
PB  - Springer Nature Switzerland
CY  - Cham
SN  - 9783031705625
KW  - medical text analysis
KW  - electronic health records
KW  - medical concept terms
KW  - medical concept dataset
KW  - named entity recognition
N2  - Following the widespread successes of leveraging recent large language models (LLMs) in various NLP tasks, this paper focuses on medical text content understanding. Adapting a foundational LLM to the medical domain requires a special kind of datasets where core medical concepts are accurately annotated. This paper addresses the need of better medical concept recognition in free-text electronic health records in low-resourced Slavic languages and introduces CSEHR, a new human-annotated dataset of Czech oncology health records. It describes the dataset inception, management, considerations, processing, and finally presents baseline concept recognition model results. XLM-RoBERTa models trained on the dataset using 5-fold cross-validation achieved an average weighted F1 score of 0.672 in exact and 0.777 in partial medical concept recognition ranging from 0.335 to 0.857 per different concept classes. This paper then describes future plans of bootstrapping larger annotated corpora from the CSEHR dataset and of making the dataset publicly available. This endeavor is unique in the realm of Slavic languages and already at this stage it represents a major step in the field of Slavic medical concept recognition.",
ER  -

ANETTA, Krištof a Aleš HORÁK. New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models. In Nöth, Elmar. \textit{Text, Speech, and Dialogue}. Cham: Springer Nature Switzerland, 2024, s.~110-120. ISBN~978-3-031-70562-5. Dostupné z: https://doi.org/10.1007/978-3-031-70563-2\_{}9.

Přehled o publikaci