HA, Hien Thi, Aleš HORÁK, Marek MEDVEĎ and Zuzana NEVĚŘILOVÁ. Recognition of OCR Invoice Metadata Block Types. In P. Sojka, A. Horák, I. Kopeček, K. Pala. Text, Speech, and Dialogue, 21st International Conference, TSD 2018. Switzerland: Springer International Publishing, 2018, p. 304-312. ISBN 978-3-030-00793-5. Available from: https://dx.doi.org/10.1007/978-3-030-00794-2_33.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Recognition of OCR Invoice Metadata Block Types
Authors HA, Hien Thi (704 Viet Nam, belonging to the institution), Aleš HORÁK (203 Czech Republic, guarantor, belonging to the institution), Marek MEDVEĎ (703 Slovakia, belonging to the institution) and Zuzana NEVĚŘILOVÁ (203 Czech Republic, belonging to the institution).
Edition Switzerland, Text, Speech, and Dialogue, 21st International Conference, TSD 2018, p. 304-312, 9 pp. 2018.
Publisher Springer International Publishing
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Switzerland
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
Impact factor Impact factor: 0.402 in 2005
RIV identification code RIV/00216224:14330/18:00103049
Organization unit Faculty of Informatics
ISBN 978-3-030-00793-5
ISSN 0302-9743
Doi http://dx.doi.org/10.1007/978-3-030-00794-2_33
UT WoS 000611532300033
Keywords in English OCR;scanned documents;document metadata;invoice metadata extraction
Tags firank_B
Tags International impact, Reviewed
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 30/4/2019 07:42.
Abstract
Automatically cataloging of thousands of paper-based structured documents is a crucial fund-saving task for future document management systems. Current optical character recognition (OCR) systems process the tabular data with a sufficient level of character-level accuracy; however, the overall structure of the document metadata is still an open practical task. In this paper, we introduce the OCRMiner system designed to extract the indexing metadata of structured documents obtained from an image scanning process and OCR. We present the details of the system modular architecture and evaluate the detection of text block types that appear within invoice documents. The system is based on text analysis in combination of layout features, and is developed and tested in cooperation with a renowned copy machine producer. The system uses an open source OCR and reaches the overall accuracy of 80.1%.
Links
MUNI/A/0854/2017, interní kód MUName: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace VII.
Investor: Masaryk University, Category A
MUNI/33/55939/2017, interní kód MUName: Ověření úspěšnosti technik zpracování přirozeného jazyka pro extrakci informací ze skenovaných dokumentů
Investor: Masaryk University
PrintDisplayed: 25/8/2024 16:02