2024
New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models
ANETTA, Krištof a Aleš HORÁKZákladní údaje
Originální název
New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models
Autoři
ANETTA, Krištof a Aleš HORÁK
Vydání
Cham, Text, Speech, and Dialogue, od s. 110-120, 11 s. 2024
Nakladatel
Springer Nature Switzerland
Další údaje
Jazyk
angličtina
Typ výsledku
Stať ve sborníku
Obor
10200 1.2 Computer and information sciences
Stát vydavatele
Švýcarsko
Utajení
není předmětem státního či obchodního tajemství
Forma vydání
paměťový nosič (CD, DVD, flash disk)
Impakt faktor
Impact factor: 0.402 v roce 2005
Označené pro přenos do RIV
Ano
Kód RIV
RIV/00216224:14330/24:00136991
Organizační jednotka
Fakulta informatiky
ISBN
978-3-031-70562-5
ISSN
UT WoS
EID Scopus
Klíčová slova anglicky
medical text analysis; electronic health records; medical concept terms; medical concept dataset; named entity recognition
Štítky
Příznaky
Mezinárodní význam, Recenzováno
Změněno: 4. 4. 2025 12:12, RNDr. Pavel Šmerk, Ph.D.
Anotace
V originále
Following the widespread successes of leveraging recent large language models (LLMs) in various NLP tasks, this paper focuses on medical text content understanding. Adapting a foundational LLM to the medical domain requires a special kind of datasets where core medical concepts are accurately annotated. This paper addresses the need of better medical concept recognition in free-text electronic health records in low-resourced Slavic languages and introduces CSEHR, a new human-annotated dataset of Czech oncology health records. It describes the dataset inception, management, considerations, processing, and finally presents baseline concept recognition model results. XLM-RoBERTa models trained on the dataset using 5-fold cross-validation achieved an average weighted F1 score of 0.672 in exact and 0.777 in partial medical concept recognition ranging from 0.335 to 0.857 per different concept classes. This paper then describes future plans of bootstrapping larger annotated corpora from the CSEHR dataset and of making the dataset publicly available. This endeavor is unique in the realm of Slavic languages and already at this stage it represents a major step in the field of Slavic medical concept recognition.",
Návaznosti
| LM2023062, projekt VaV |
| ||
| MUNI/A/1590/2023, interní kód MU |
|