Measuring Redundancy in Czech Electronic Health Records:
Near-Duplicate Detection and Cluster Analysis

D 2025

Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis

ANETTA, Krištof a Aleš HORÁK

Základní údaje

Originální název

Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis

Autoři

ANETTA, Krištof a Aleš HORÁK

Vydání

Brno, Czech Republic, Recent Advances in Slavonic Natural Language Processing, RASLAN 2025, od s. 117-126, 10 s. 2025

Nakladatel

Tribun EU

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2025.

Označené pro přenos do RIV

Ano

Organizační jednotka

Fakulta informatiky

ISBN

978-80-263-1858-3

ISSN

Klíčová slova anglicky

Electronic health records; EHR; corpus; dataset; redundancy; near-duplicate; deduplication; Czech.

Změněno: 13. 1. 2026 12:39, Bc. Barbora Stenglová

Anotace

V originále

Electronic health records (EHRs) contain extensive repetition arising fromtemplatedstructures,copy-pastepractices, andrecurrentclinical phrasing. While such redundancy facilitates documentation consistency, it also affects the efficiency of data processing and downstream natural language processing applications. This study investigates the internal textual redundancy of a Czech dataset of narrative parts of oncology health records using a fast near-duplicate detection method and a subsequent clustering analysis. We quantify the degree and distribution of repeated content across documents, visualize the resulting clusters to identify patterns, and experiment with creating cluster-aware pruned datasets for more efficient language model training. For comparison, we report baseline redundancy measures on a Czech literary corpus, illustrating the contrast between natural and clinical text. Inadditiontoprovidinginsightintohowredundancyshapesthelinguistic and informational landscape of Czech EHRs, we discuss our findings in the context of state-of-the-art clinical LLMs for English, making a case not only for continued development of redundancy-mitigating approaches, but also for the use of synthetic health record data.

Návaznosti

LM2023062, projekt VaV

Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Citovat

ANETTA, Krištof a Aleš HORÁK. Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis. Online. In A. Horák, P. Rychlý, A. Rambousek (eds.). Recent Advances in Slavonic Natural Language Processing, RASLAN 2025. Brno, Czech Republic: Tribun EU, 2025, s. 117-126. ISBN 978-80-263-1858-3.

@inproceedings{2546678,
   author = {Anetta, Krištof and Horák, Aleš},
   address = {Brno, Czech Republic},
   booktitle = {Recent Advances in Slavonic Natural Language Processing, RASLAN 2025},
   editor = {A. Horák, P. Rychlý, A. Rambousek (eds.)},
   keywords = {Electronic health records; EHR; corpus; dataset; redundancy; near-duplicate; deduplication; Czech.},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Brno, Czech Republic},
   isbn = {978-80-263-1858-3},
   pages = {117-126},
   publisher = {Tribun EU},
   title = {Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis},
   url = {https://nlp.fi.muni.cz/raslan/2025/},
   year = {2025}
}

TY  - CONF
ID  - 2546678
AU  - Anetta, Krištof - Horák, Aleš
PY  - 2025
TI  - Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis
PB  - Tribun EU
CY  - Brno, Czech Republic
SN  - 9788026318583
KW  - Electronic health records
KW  - EHR
KW  - corpus
KW  - dataset
KW  - redundancy
KW  - near-duplicate
KW  - deduplication
KW  - Czech.
UR  - https://nlp.fi.muni.cz/raslan/2025/
N2  - Electronic health records (EHRs) contain extensive repetition arising fromtemplatedstructures,copy-pastepractices, andrecurrentclinical phrasing. While such redundancy facilitates documentation consistency, it also affects the efficiency of data processing and downstream natural language processing applications. This study investigates the internal textual redundancy of a Czech dataset of narrative parts of oncology health records using a fast near-duplicate detection method and a subsequent clustering analysis. We quantify the degree and distribution of repeated content across documents, visualize the resulting clusters to identify patterns, and experiment with creating cluster-aware pruned datasets for more efficient language model training. For comparison, we report baseline redundancy measures on a Czech literary corpus, illustrating the contrast between natural and clinical text. Inadditiontoprovidinginsightintohowredundancyshapesthelinguistic and informational landscape of Czech EHRs, we discuss our findings in the context of state-of-the-art clinical LLMs for English, making a case not only for continued development of redundancy-mitigating approaches, but also for the use of synthetic health record data.
ER  -

ANETTA, Krištof a Aleš HORÁK. Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis. Online. In A. Horák, P. Rychlý, A. Rambousek (eds.). \textit{Recent Advances in Slavonic Natural Language Processing, RASLAN 2025}. Brno, Czech Republic: Tribun EU, 2025, s.~117-126. ISBN~978-80-263-1858-3.

Přehled o publikaci