Detailed Information on Publication Record
When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts
NOVOTNÝ, VítBasic information
Original name
Authors
Edition
Publisher
Other information
Language
Type of outcome
Field of Study
Country of publisher
Confidentiality degree
Publication form
References:
RIV identification code
Organization unit
ISBN
ISSN
UT WoS
Keywords in English
Tags
Abstract
V originále
Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge.
In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms.
We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.
Links
MUNI/A/1076/2019, interní kód MU |
| ||
MUNI/A/1411/2019, interní kód MU |
|