D 2019

Structured Information Extraction from Pharmaceutical Records

BAMBUROVÁ, Michaela and Zuzana NEVĚŘILOVÁ

Basic information

Original name

Structured Information Extraction from Pharmaceutical Records

Authors

BAMBUROVÁ, Michaela (703 Slovakia, belonging to the institution) and Zuzana NEVĚŘILOVÁ (203 Czech Republic, belonging to the institution)

Edition

Brno, Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019, p. 55-62, 8 pp. 2019

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

RIV identification code

RIV/00216224:14330/19:00111627

Organization unit

Faculty of Informatics

ISBN

978-80-263-1530-8

ISSN

UT WoS

000604899800007

Keywords in English

structured information extraction; table understanding; entity recognition
Změněno: 16/5/2022 15:23, Mgr. Michal Petr

Abstract

V originále

The paper presents an iterative approach to understanding semi-structured or unstructured tabular data with pharmaceutical records. Thetask is to split records with entities such as drug name, dosage strength,dosage form, and package size into the appropriate columns. The data isprovided by many suppliers, and so it is very diverse in terms of structure.Some of the records are easy to parse using regular expressions; othersare difficult and need advanced methods. We used regular expressionsfor the easy-to-parse data and conditional random fields for the morecomplex records. We iteratively extend the training data set using theabove methods together with manual corrections. Currently, the F1 scorefor correct classification into 5 classes is 95%.

Links

EF16_013/0001781, research and development project
Name: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity
LM2015071, research and development project
Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR