Detailed Information on Publication Record
2018
Recognition of OCR Invoice Metadata Block Types
HA, Hien Thi, Aleš HORÁK, Marek MEDVEĎ and Zuzana NEVĚŘILOVÁBasic information
Original name
Recognition of OCR Invoice Metadata Block Types
Authors
HA, Hien Thi (704 Viet Nam, belonging to the institution), Aleš HORÁK (203 Czech Republic, guarantor, belonging to the institution), Marek MEDVEĎ (703 Slovakia, belonging to the institution) and Zuzana NEVĚŘILOVÁ (203 Czech Republic, belonging to the institution)
Edition
Switzerland, Text, Speech, and Dialogue, 21st International Conference, TSD 2018, p. 304-312, 9 pp. 2018
Publisher
Springer International Publishing
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
Switzerland
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
printed version "print"
Impact factor
Impact factor: 0.402 in 2005
RIV identification code
RIV/00216224:14330/18:00103049
Organization unit
Faculty of Informatics
ISBN
978-3-030-00793-5
ISSN
UT WoS
000611532300033
Keywords in English
OCR;scanned documents;document metadata;invoice metadata extraction
Tags
Tags
International impact, Reviewed
Změněno: 30/4/2019 07:42, RNDr. Pavel Šmerk, Ph.D.
Abstract
V originále
Automatically cataloging of thousands of paper-based structured documents is a crucial fund-saving task for future document management systems. Current optical character recognition (OCR) systems process the tabular data with a sufficient level of character-level accuracy; however, the overall structure of the document metadata is still an open practical task. In this paper, we introduce the OCRMiner system designed to extract the indexing metadata of structured documents obtained from an image scanning process and OCR. We present the details of the system modular architecture and evaluate the detection of text block types that appear within invoice documents. The system is based on text analysis in combination of layout features, and is developed and tested in cooperation with a renowned copy machine producer. The system uses an open source OCR and reaches the overall accuracy of 80.1%.
Links
MUNI/A/0854/2017, interní kód MU |
| ||
MUNI/33/55939/2017, interní kód MU |
|