Recognition of OCR Invoice Metadata Block Types

HA, Hien Thi, Aleš HORÁK, Marek MEDVEĎ and Zuzana NEVĚŘILOVÁ. Recognition of OCR Invoice Metadata Block Types. In P. Sojka, A. Horák, I. Kopeček, K. Pala. Text, Speech, and Dialogue, 21st International Conference, TSD 2018. Switzerland: Springer International Publishing, 2018, p. 304-312. ISBN 978-3-030-00793-5. Available from: https://dx.doi.org/10.1007/978-3-030-00794-2_33.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Recognition of OCR Invoice Metadata Block Types
Authors	HA, Hien Thi (704 Viet Nam, belonging to the institution), Aleš HORÁK (203 Czech Republic, guarantor, belonging to the institution), Marek MEDVEĎ (703 Slovakia, belonging to the institution) and Zuzana NEVĚŘILOVÁ (203 Czech Republic, belonging to the institution).
Edition	Switzerland, Text, Speech, and Dialogue, 21st International Conference, TSD 2018, p. 304-312, 9 pp. 2018.
Publisher	Springer International Publishing

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Switzerland
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
Impact factor	Impact factor: 0.402 in 2005
RIV identification code	RIV/00216224:14330/18:00103049
Organization unit	Faculty of Informatics
ISBN	978-3-030-00793-5
ISSN	0302-9743
Doi	http://dx.doi.org/10.1007/978-3-030-00794-2_33
UT WoS	000611532300033
Keywords in English	OCR;scanned documents;document metadata;invoice metadata extraction
Tags	firank_B
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 30/4/2019 07:42.

Abstract

Automatically cataloging of thousands of paper-based structured documents is a crucial fund-saving task for future document management systems. Current optical character recognition (OCR) systems process the tabular data with a sufficient level of character-level accuracy; however, the overall structure of the document metadata is still an open practical task. In this paper, we introduce the OCRMiner system designed to extract the indexing metadata of structured documents obtained from an image scanning process and OCR. We present the details of the system modular architecture and evaluate the detection of text block types that appear within invoice documents. The system is based on text analysis in combination of layout features, and is developed and tested in cooperation with a renowned copy machine producer. The system uses an open source OCR and reaches the overall accuracy of 80.1%.

Links
MUNI/A/0854/2017, interní kód MU	Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace VII.
MUNI/A/0854/2017, interní kód MU	Investor: Masaryk University, Category A
MUNI/33/55939/2017, interní kód MU	Name: Ověření úspěšnosti technik zpracování přirozeného jazyka pro extrakci informací ze skenovaných dokumentů
MUNI/33/55939/2017, interní kód MU	Investor: Masaryk University

PrintDisplayed: 25/8/2024 16:02

Recognition of OCR Invoice Metadata Block Types

Other applications