Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge.
In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms.
We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.
Návaznosti
MUNI/A/1076/2019, interní kód MU
Název: Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity 20 (Akronym: SKOMU)
Investor: Masarykova univerzita, Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity 20, DO R. 2020_Kategorie A - Specifický výzkum - Studentské výzkumné projekty
MUNI/A/1411/2019, interní kód MU
Název: Aplikovaný výzkum: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, zpracování přirozeného jazyka a jazykové inženýrství, vizualizaci velkých dat a rozšířená realita.
Investor: Masarykova univerzita, Aplikovaný výzkum: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, zpracování přirozeného jazyka a jazykové inženýrství, vizualizaci velkých dat a rozšířená realita., DO R. 2020_Kategorie A - Specifický výzkum - Studentské výzkumné projekty
NOVOTNÝ, Vít. When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020. Brno: Tribun EU, 2020, s. 3-12. ISBN 978-80-263-1600-8.
@inproceedings{1699697, author = {Novotný, Vít}, address = {Brno}, booktitle = {Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020}, editor = {Aleš Horák, Pavel Rychlý, Adam Rambousek}, keywords = {Optical character recognition; OCR; Historical texts}, howpublished = {tištěná verze "print"}, language = {eng}, location = {Brno}, isbn = {978-80-263-1600-8}, pages = {3-12}, publisher = {Tribun EU}, title = {When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts}, url = {http://raslan2020.nlp-consulting.net/}, year = {2020} }
TY - JOUR ID - 1699697 AU - Novotný, Vít PY - 2020 TI - When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts PB - Tribun EU CY - Brno SN - 9788026316008 KW - Optical character recognition KW - OCR KW - Historical texts UR - http://raslan2020.nlp-consulting.net/ N2 -
Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge.
In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms.
We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.
ER -
NOVOTNÝ, Vít. When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. \textit{Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020}. Brno: Tribun EU, 2020, s.~3-12. ISBN~978-80-263-1600-8.