When Tesseract Meets PERO : Open-Source Optical Character
Recognition of Medieval Texts

D 2022

When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts

NOVOTNÝ, Vít a Aleš HORÁK

Základní údaje

Originální název

When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts

Autoři

NOVOTNÝ, Vít (203 Česká republika, garant, domácí) a Aleš HORÁK (203 Česká republika, domácí)

Vydání

Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. od s. 157-161, 5 s. 2022

Nakladatel

Tribun EU

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Odkazy

Plný text Domovská stránka workshopu

Kód RIV

RIV/00216224:14330/22:00127481

Organizační jednotka

Fakulta informatiky

ISBN

978-80-263-1752-4

ISSN

Klíčová slova anglicky

optical character recognition; OCR; medieval texts; AHISTO project

Změněno: 15. 5. 2024 09:24, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

Conversion of scanned images to the text form, denoted as optical character recognition or OCR, for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of medieval texts remains an open challenge. In our previous work, we developed an end-to-end image-to-text pipeline (via optical character recognition) for medieval texts, named AHISTO OCR, and we released it together with our test dataset under open licenses. However, the published system relied on the closed-source Google Vision AI service as one component, which made the experiments less reproducible. In this work, we replace Google Vision AI with an open-source OCR algorithm named PERO and we show that this not only makes the AHISTO OCR pipeline open, but also improves the performance of the system. We release the updated AHISTO OCR system and its test results again under open licenses.

Návaznosti

LM2018101, projekt VaV

Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Akronym: LINDAT/CLARIAH-CZ)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Citovat

NOVOTNÝ, Vít a Aleš HORÁK. When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. Brno: Tribun EU, 2022, s. 157-161. ISBN 978-80-263-1752-4.

@inproceedings{2240146,
   author = {Novotný, Vít and Horák, Aleš},
   address = {Brno},
   booktitle = {Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022.},
   editor = {Aleš Horák, Pavel Rychlý, Adam Rambousek},
   keywords = {optical character recognition; OCR; medieval texts; AHISTO project},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Brno},
   isbn = {978-80-263-1752-4},
   pages = {157-161},
   publisher = {Tribun EU},
   title = {When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts},
   url = {https://nlp.fi.muni.cz/raslan/2022/paper12.pdf},
   year = {2022}
}

TY  - JOUR
ID  - 2240146
AU  - Novotný, Vít - Horák, Aleš
PY  - 2022
TI  - When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts
PB  - Tribun EU
CY  - Brno
SN  - 9788026317524
KW  - optical character recognition
KW  - OCR
KW  - medieval texts
KW  - AHISTO project
UR  - https://nlp.fi.muni.cz/raslan/2022/paper12.pdf
N2  - Conversion of scanned images to the text form, denoted as optical character recognition or OCR, for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of medieval texts remains an open challenge. In our previous work, we developed an end-to-end image-to-text pipeline (via optical character recognition) for medieval texts, named AHISTO OCR, and we released it together with our test dataset under open licenses. However, the published system relied on the closed-source Google Vision AI service as one component, which made the experiments less reproducible. In this work, we replace Google Vision AI with an open-source OCR algorithm named PERO and we show that this not only makes the AHISTO OCR pipeline open, but also improves the performance of the system. We release the updated AHISTO OCR system and its test results again under open licenses.
ER  -

NOVOTNÝ, Vít a Aleš HORÁK. When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. \textit{Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022.}. Brno: Tribun EU, 2022, s.~157-161. ISBN~978-80-263-1752-4.

Podrobný výpis o publikaci