2025
Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis
ANETTA, Krištof a Aleš HORÁKZákladní údaje
Originální název
Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis
Autoři
ANETTA, Krištof a Aleš HORÁK
Vydání
Brno, Czech Republic, Recent Advances in Slavonic Natural Language Processing, RASLAN 2025, od s. 117-126, 10 s. 2025
Nakladatel
Tribun EU
Další údaje
Jazyk
angličtina
Typ výsledku
Stať ve sborníku
Obor
10200 1.2 Computer and information sciences
Stát vydavatele
Česká republika
Utajení
není předmětem státního či obchodního tajemství
Forma vydání
elektronická verze "online"
Označené pro přenos do RIV
Ano
Organizační jednotka
Fakulta informatiky
ISBN
978-80-263-1858-3
ISSN
Klíčová slova anglicky
Electronic health records; EHR; corpus; dataset; redundancy; near-duplicate; deduplication; Czech.
Změněno: 13. 1. 2026 12:39, Bc. Barbora Stenglová
Anotace
V originále
Electronic health records (EHRs) contain extensive repetition arising fromtemplatedstructures,copy-pastepractices, andrecurrentclinical phrasing. While such redundancy facilitates documentation consistency, it also affects the efficiency of data processing and downstream natural language processing applications. This study investigates the internal textual redundancy of a Czech dataset of narrative parts of oncology health records using a fast near-duplicate detection method and a subsequent clustering analysis. We quantify the degree and distribution of repeated content across documents, visualize the resulting clusters to identify patterns, and experiment with creating cluster-aware pruned datasets for more efficient language model training. For comparison, we report baseline redundancy measures on a Czech literary corpus, illustrating the contrast between natural and clinical text. Inadditiontoprovidinginsightintohowredundancyshapesthelinguistic and informational landscape of Czech EHRs, we discuss our findings in the context of state-of-the-art clinical LLMs for English, making a case not only for continued development of redundancy-mitigating approaches, but also for the use of synthetic health record data.
Návaznosti
| LM2023062, projekt VaV |
|