D 2021

Transferability of General Polish NER to Electronic Health Records

ANETTA, Krištof and Mahmut ARSLAN

Basic information

Original name

Transferability of General Polish NER to Electronic Health Records

Authors

ANETTA, Krištof (203 Czech Republic, guarantor, belonging to the institution) and Mahmut ARSLAN (792 Turkey)

Edition

Brno, Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), p. 151-159, 9 pp. 2021

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

RIV identification code

RIV/00216224:14330/21:00123253

Organization unit

Faculty of Informatics

ISBN

978-80-263-1670-1

ISSN

Keywords in English

EHR; Electronic health records; Healthcare texts; NER; Named entity recognition; NLP; Natural language processing; Slavic languages; Polish; PolDeepNer2; spaCy; Spark NLP
Změněno: 15/5/2024 10:23, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

This paper investigates the transferability of general Polish named entity recognition tools to the analysis of Polish health records. The tools, namely PolDeepNer2, spaCy’s pl_core_news_lg pipeline and Spark NLP’s entity_recognizer_md pipeline for Polish, were run on the pl_ehr_cardio corpus and their results were analyzed, paying special atten- tion to their performance when processing these highly specific texts and to the applicability of the results in the healthcare domain. Even though the precision of PolDeepNer2 proved to be superior to both spaCy and Spark NLP, the paper concludes that without additional training, general named entity recognition tools for Polish have very limited use in the medi- cal analysis of electronic health records. However, they could be helpful in partial tasks ranging from de-identification to entity disambiguation and discovery of mistyped entities or candidate entities that are not present in medical dictionaries.

Links

LM2018101, research and development project
Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR
MUNI/IGA/1505/2020, interní kód MU
Name: Electronic Health Record Analysis using Deep Learning (Acronym: Health Record Analysis with Deep Learning)
Investor: Masaryk University