NOVOTNÝ, Vít. When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020. Brno: Tribun EU, 2020, p. 3-12. ISBN 978-80-263-1600-8.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts
Authors NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition Brno, Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020, p. 3-12, 10 pp. 2020.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW Domovská stránka workshopu PDF
RIV identification code RIV/00216224:14330/20:00117104
Organization unit Faculty of Informatics
ISBN 978-80-263-1600-8
ISSN 2336-4289
UT WoS 000655471300001
Keywords in English Optical character recognition; OCR; Historical texts
Tags OCR, Optical Character Recognition
Tags International impact
Changed by Changed by: Mgr. Michal Petr, učo 65024. Changed: 16/5/2022 15:06.
Abstract

Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge.

In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms.

We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.

Links
MUNI/A/1076/2019, interní kód MUName: Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity 20 (Acronym: SKOMU)
Investor: Masaryk University, Category A
MUNI/A/1411/2019, interní kód MUName: Aplikovaný výzkum: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, zpracování přirozeného jazyka a jazykové inženýrství, vizualizaci velkých dat a rozšířená realita.
Investor: Masaryk University, Category A
PrintDisplayed: 21/8/2024 08:17