Structured Information Extraction from Pharmaceutical Records

D 2019

Structured Information Extraction from Pharmaceutical Records

BAMBUROVÁ, Michaela a Zuzana NEVĚŘILOVÁ

Základní údaje

Originální název

Structured Information Extraction from Pharmaceutical Records

Autoři

BAMBUROVÁ, Michaela (703 Slovensko, domácí) a Zuzana NEVĚŘILOVÁ (203 Česká republika, domácí)

Vydání

Brno, Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019, od s. 55-62, 8 s. 2019

Nakladatel

Tribun EU

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Odkazy

URL

Kód RIV

RIV/00216224:14330/19:00111627

Organizační jednotka

Fakulta informatiky

ISBN

978-80-263-1530-8

ISSN

UT WoS

000604899800007

Klíčová slova anglicky

structured information extraction; table understanding; entity recognition

Změněno: 16. 5. 2022 15:23, Mgr. Michal Petr

Anotace

V originále

The paper presents an iterative approach to understanding semi-structured or unstructured tabular data with pharmaceutical records. Thetask is to split records with entities such as drug name, dosage strength,dosage form, and package size into the appropriate columns. The data isprovided by many suppliers, and so it is very diverse in terms of structure.Some of the records are easy to parse using regular expressions; othersare difficult and need advanced methods. We used regular expressionsfor the easy-to-parse data and conditional random fields for the morecomplex records. We iteratively extend the training data set using theabove methods together with manual corrections. Currently, the F1 scorefor correct classification into 5 classes is 95%.

Návaznosti

EF16_013/0001781, projekt VaV

Název: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity

LM2015071, projekt VaV

Název: Jazyková výzkumná infrastruktura v České republice (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

Citovat

BAMBUROVÁ, Michaela a Zuzana NEVĚŘILOVÁ. Structured Information Extraction from Pharmaceutical Records. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. Brno: Tribun EU, 2019, s. 55-62. ISBN 978-80-263-1530-8.

@inproceedings{1590018,
   author = {Bamburová, Michaela and Nevěřilová, Zuzana},
   address = {Brno},
   booktitle = {Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019},
   editor = {Aleš Horák, Pavel Rychlý, Adam Rambousek},
   keywords = {structured information extraction; table understanding; entity recognition},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Brno},
   isbn = {978-80-263-1530-8},
   pages = {55-62},
   publisher = {Tribun EU},
   title = {Structured Information Extraction from Pharmaceutical Records},
   url = {https://nlp.fi.muni.cz/raslan/2019/paper09-bamburova.pdf},
   year = {2019}
}

TY  - JOUR
ID  - 1590018
AU  - Bamburová, Michaela - Nevěřilová, Zuzana
PY  - 2019
TI  - Structured Information Extraction from Pharmaceutical Records
PB  - Tribun EU
CY  - Brno
SN  - 9788026315308
KW  - structured information extraction
KW  - table understanding
KW  - entity recognition
UR  - https://nlp.fi.muni.cz/raslan/2019/paper09-bamburova.pdf
N2  - The paper presents an iterative approach to understanding semi-structured or unstructured tabular data with pharmaceutical records. Thetask is to split records with entities such as drug name, dosage strength,dosage form, and package size into the appropriate columns. The data isprovided by many suppliers, and so it is very diverse in terms of structure.Some of the records are easy to parse using regular expressions; othersare difficult and need advanced methods. We used regular expressionsfor the easy-to-parse data and conditional random fields for the morecomplex records. We iteratively extend the training data set using theabove methods together with manual corrections. Currently, the F1 scorefor correct classification into 5 classes is 95%.
ER  -

BAMBUROVÁ, Michaela a Zuzana NEVĚŘILOVÁ. Structured Information Extraction from Pharmaceutical Records. In Aleš Horák, Pavel Rychlý, Adam Rambousek. \textit{Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019}. Brno: Tribun EU, 2019, s.~55-62. ISBN~978-80-263-1530-8.

Podrobný výpis o publikaci