HA, Hien Thi and Aleš HORÁK. Who is Selling to Whom – Feature Evaluation for Multi-block Classification in Invoice Information Extraction. Online. In Karpov A., Potapova R. SPECOM 2021: 23rd International Conference on Speech and Computer. St. Petersburg, Russia: Springer, 2021, p. 250-261. ISBN 978-3-030-87801-6. Available from: https://dx.doi.org/10.1007/978-3-030-87802-3_23.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Who is Selling to Whom – Feature Evaluation for Multi-block Classification in Invoice Information Extraction
Authors HA, Hien Thi (704 Viet Nam, belonging to the institution) and Aleš HORÁK (203 Czech Republic, belonging to the institution).
Edition St. Petersburg, Russia, SPECOM 2021: 23rd International Conference on Speech and Computer, p. 250-261, 12 pp. 2021.
Publisher Springer
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW URL
Impact factor Impact factor: 0.402 in 2005
RIV identification code RIV/00216224:14330/21:00123275
Organization unit Faculty of Informatics
ISBN 978-3-030-87801-6
ISSN 0302-9743
Doi http://dx.doi.org/10.1007/978-3-030-87802-3_23
Keywords in English OCR; Invoice; Block type classification; Seller; Buyer; Delivery address
Tags firank_B
Tags International impact, Reviewed
Changed by Changed by: doc. RNDr. Aleš Horák, Ph.D., učo 1648. Changed: 10/10/2022 10:26.
Abstract
The invoice information extraction task aims at unifying the automatized processing of invoices in structured forms and in the form of a scanned image. Recognizing the pieces of information where a specific value is identified with a keyword (such as the invoice date) is a relatively well-managed task. On the other hand, identification of multi-block information on the invoice, such as distinguishing the seller, buyer, and the delivery address, is much more challenging due to versatile invoice layouts. In this work, we present a new technique of feature extraction and classification to recognize the seller, buyer, and delivery address text blocks in scanned invoices based on a combination of complex layout and annotated text features. The method does not only consider the block positional features but also the relation between blocks and block contents at a higher level. The technique is implemented as a module of the OCRMiner system. We offer its detailed evaluation and error analysis with a dataset of more than five hundred Czech invoices reaching the overall macro average F1-score of 94%.
Links
LM2018101, research and development projectName: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR
MUNI/A/1195/2021, interní kód MUName: Aplikovaný výzkum v oblastech vyhledávání, analýz a vizualizací rozsáhlých dat, zpracování přirozeného jazyka a aplikované umělé inteligence
Investor: Masaryk University
PrintDisplayed: 26/5/2024 20:36