NOVOTNÝ, Vít, Kristýna SEIDLOVÁ, Tereza VRABCOVÁ and Aleš HORÁK. When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU, 2021, p. 29-39. ISBN 978-80-263-1670-1.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts
Authors NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution), Kristýna SEIDLOVÁ (203 Czech Republic, belonging to the institution), Tereza VRABCOVÁ (203 Czech Republic, belonging to the institution) and Aleš HORÁK (203 Czech Republic, belonging to the institution).
Edition Brno, Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), p. 29-39, 11 pp. 2021.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10200 1.2 Computer and information sciences
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW Full text PDF Domovská stránka workshopu
RIV identification code RIV/00216224:14330/21:00119901
Organization unit Faculty of Informatics
ISBN 978-80-263-1670-1
ISSN 2336-4289
Keywords in English Optical character recognition · Layout analysis; Language identification; Image super-resolution; Medieval texts
Changed by Changed by: doc. RNDr. Aleš Horák, Ph.D., učo 1648. Changed: 1/12/2022 15:38.
Abstract
The aim of the AHISTO project is to make documents from the Hussite era (1419–1436) available to the general public through a web-hosted searchable database. Although scanned images of letterpress reprints from the 19th and 20th century are available, accurate optical character recognition (OCR) algorithms are required to extract searchable text from the scanned images. In our previous article [15], we have shown that the Tesseract 4 OCR algorithm was the second fastest and the most accurate among five different OCR algorithms. In this article, we investigate the impact of six preprocessing techniques on the accuracy of Tesseract 4. Additionally, we compare Tesseract 4 with three other OCR algorithms on the language identification task. Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification. In Section 2, we describe the related work in OCR preprocessing. In Section 3, we describe our three preprocessing techniques and our two evaluation tasks. In Section 4, we discuss the results of our evaluation. In Section 5, we offer concluding remarks and ideas for future work in the OCR of medieval texts.
Links
LM2018101, research and development projectName: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 2/5/2024 02:55