When Tesseract Does It Alone: Optical Character Recognition of
Medieval Texts

D 2020

When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts

NOVOTNÝ, Vít

Basic information

Original name

When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts

Authors

NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution)

Edition

Brno, Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020, p. 3-12, 10 pp. 2020

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

Domovská stránka workshopu PDF

RIV identification code

RIV/00216224:14330/20:00117104

Organization unit

Faculty of Informatics

ISBN

978-80-263-1600-8

ISSN

UT WoS

000655471300001

Keywords in English

Optical character recognition; OCR; Historical texts

Abstract

V originále

Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge.

In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms.

We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.

Links

MUNI/A/1076/2019, interní kód MU

Name: Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity 20 (Acronym: SKOMU)

Investor: Masaryk University, Category A

MUNI/A/1411/2019, interní kód MU

Name: Aplikovaný výzkum: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, zpracování přirozeného jazyka a jazykové inženýrství, vizualizaci velkých dat a rozšířená realita.

Investor: Masaryk University, Category A

Detailed Information on Publication Record