D 2022

When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts

NOVOTNÝ, Vít and Aleš HORÁK

Basic information

Original name

When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts

Authors

NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution) and Aleš HORÁK (203 Czech Republic, belonging to the institution)

Edition

Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. p. 157-161, 5 pp. 2022

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

RIV identification code

RIV/00216224:14330/22:00127481

Organization unit

Faculty of Informatics

ISBN

978-80-263-1752-4

ISSN

Keywords in English

optical character recognition; OCR; medieval texts; AHISTO project
Změněno: 15/5/2024 09:24, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

Conversion of scanned images to the text form, denoted as optical character recognition or OCR, for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of medieval texts remains an open challenge. In our previous work, we developed an end-to-end image-to-text pipeline (via optical character recognition) for medieval texts, named AHISTO OCR, and we released it together with our test dataset under open licenses. However, the published system relied on the closed-source Google Vision AI service as one component, which made the experiments less reproducible. In this work, we replace Google Vision AI with an open-source OCR algorithm named PERO and we show that this not only makes the AHISTO OCR pipeline open, but also improves the performance of the system. We release the updated AHISTO OCR system and its test results again under open licenses.

Links

LM2018101, research and development project
Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR