HA, Hien Thi, Aleš HORÁK and BUi MINH TUAN. Contract Metadata Identification in Czech Scanned Documents. Online. In Ana Paula Rocha ; Luc Steels and Jaap van den Herik. Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART. Portugal: The SciTePress Digital Library, 2021, p. 795-802. ISBN 978-989-758-484-8. Available from: https://dx.doi.org/10.5220/0010243807950802.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Contract Metadata Identification in Czech Scanned Documents
Authors HA, Hien Thi (704 Viet Nam, belonging to the institution), Aleš HORÁK (203 Czech Republic, guarantor, belonging to the institution) and BUi MINH TUAN (704 Viet Nam).
Edition Portugal, Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, p. 795-802, 8 pp. 2021.
Publisher The SciTePress Digital Library
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW URL
RIV identification code RIV/00216224:14330/21:00121131
Organization unit Faculty of Informatics
ISBN 978-989-758-484-8
Doi http://dx.doi.org/10.5220/0010243807950802
UT WoS 000661455800087
Keywords in English Information Extraction; Scanned Documents; Document Metadata; Contract Metadata Extraction; Czech
Tags firank_B
Tags International impact, Reviewed
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 23/5/2022 14:21.
Abstract
Although nowadays digital-born documents are generally prevalent, exchange of business documents often consists in processing their scanned image form as a general human-readable format with one-to-one correspondence to paper documents. Bulk processing of such scanned documents then requires human intervention to extract and enter the main document metadata. In this paper, we present the design and evaluation of a contract processing module in the OCRMiner system. The information extraction process allows to combine layout properties with text analysis as input to a rule-based extraction with confidence score propagation. The first results are evaluated with public Czech contract documents reaching the item extraction accuracy of almost 88%.
Links
LM2018101, research and development projectName: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 20/7/2024 19:13