NOVOTNÝ, Vít and Aleš HORÁK. When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. Brno: Tribun EU, 2022, p. 157-161. ISBN 978-80-263-1752-4.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts
Authors NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution) and Aleš HORÁK (203 Czech Republic, belonging to the institution).
Edition Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. p. 157-161, 5 pp. 2022.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10200 1.2 Computer and information sciences
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW Plný text Domovská stránka workshopu
RIV identification code RIV/00216224:14330/22:00127481
Organization unit Faculty of Informatics
ISBN 978-80-263-1752-4
ISSN 2336-4289
Keywords in English optical character recognition; OCR; medieval texts; AHISTO project
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 15/5/2024 09:24.
Abstract
Conversion of scanned images to the text form, denoted as optical character recognition or OCR, for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of medieval texts remains an open challenge. In our previous work, we developed an end-to-end image-to-text pipeline (via optical character recognition) for medieval texts, named AHISTO OCR, and we released it together with our test dataset under open licenses. However, the published system relied on the closed-source Google Vision AI service as one component, which made the experiments less reproducible. In this work, we replace Google Vision AI with an open-source OCR algorithm named PERO and we show that this not only makes the AHISTO OCR pipeline open, but also improves the performance of the system. We release the updated AHISTO OCR system and its test results again under open licenses.
Links
LM2018101, research and development projectName: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 26/7/2024 00:18