Structured Information Extraction from Pharmaceutical Records

BAMBUROVÁ, Michaela and Zuzana NEVĚŘILOVÁ. Structured Information Extraction from Pharmaceutical Records. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. Brno: Tribun EU, 2019, p. 55-62. ISBN 978-80-263-1530-8.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Structured Information Extraction from Pharmaceutical Records
Authors	BAMBUROVÁ, Michaela (703 Slovakia, belonging to the institution) and Zuzana NEVĚŘILOVÁ (203 Czech Republic, belonging to the institution).
Edition	Brno, Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019, p. 55-62, 8 pp. 2019.
Publisher	Tribun EU

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10200 1.2 Computer and information sciences
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/19:00111627
Organization unit	Faculty of Informatics
ISBN	978-80-263-1530-8
ISSN	2336-4289
UT WoS	000604899800007
Keywords in English	structured information extraction; table understanding; entity recognition
Changed by	Changed by: Mgr. Michal Petr, učo 65024. Changed: 16/5/2022 15:23.

Abstract

The paper presents an iterative approach to understanding semi-structured or unstructured tabular data with pharmaceutical records. Thetask is to split records with entities such as drug name, dosage strength,dosage form, and package size into the appropriate columns. The data isprovided by many suppliers, and so it is very diverse in terms of structure.Some of the records are easy to parse using regular expressions; othersare difficult and need advanced methods. We used regular expressionsfor the easy-to-parse data and conditional random fields for the morecomplex records. We iteratively extend the training data set using theabove methods together with manual corrections. Currently, the F1 scorefor correct classification into 5 classes is 95%.

Links
EF16_013/0001781, research and development project	Name: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity
LM2015071, research and development project	Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
LM2015071, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 30/8/2024 16:24

Structured Information Extraction from Pharmaceutical Records

Other applications