2022
When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts
NOVOTNÝ, Vít a Aleš HORÁKZákladní údaje
Originální název
When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts
Autoři
NOVOTNÝ, Vít (203 Česká republika, garant, domácí) a Aleš HORÁK (203 Česká republika, domácí)
Vydání
Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. od s. 157-161, 5 s. 2022
Nakladatel
Tribun EU
Další údaje
Jazyk
angličtina
Typ výsledku
Stať ve sborníku
Obor
10200 1.2 Computer and information sciences
Stát vydavatele
Česká republika
Utajení
není předmětem státního či obchodního tajemství
Forma vydání
tištěná verze "print"
Kód RIV
RIV/00216224:14330/22:00127481
Organizační jednotka
Fakulta informatiky
ISBN
978-80-263-1752-4
ISSN
Klíčová slova anglicky
optical character recognition; OCR; medieval texts; AHISTO project
Změněno: 15. 5. 2024 09:24, RNDr. Pavel Šmerk, Ph.D.
Anotace
V originále
Conversion of scanned images to the text form, denoted as optical character recognition or OCR, for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of medieval texts remains an open challenge. In our previous work, we developed an end-to-end image-to-text pipeline (via optical character recognition) for medieval texts, named AHISTO OCR, and we released it together with our test dataset under open licenses. However, the published system relied on the closed-source Google Vision AI service as one component, which made the experiments less reproducible. In this work, we replace Google Vision AI with an open-source OCR algorithm named PERO and we show that this not only makes the AHISTO OCR pipeline open, but also improves the performance of the system. We release the updated AHISTO OCR system and its test results again under open licenses.
Návaznosti
LM2018101, projekt VaV |
|