When Tesseract Brings Friends: Layout Analysis, Language
Identification, and Super-Resolution in the Optical Character
Recognition of Medieval Texts

D 2021

When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts

NOVOTNÝ, Vít, Kristýna SEIDLOVÁ, Tereza VRABCOVÁ and Aleš HORÁK

Basic information

Original name

When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts

Authors

NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution), Kristýna SEIDLOVÁ (203 Czech Republic, belonging to the institution), Tereza VRABCOVÁ (203 Czech Republic, belonging to the institution) and Aleš HORÁK (203 Czech Republic, belonging to the institution)

Edition

Brno, Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), p. 29-39, 11 pp. 2021

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

Full text PDF Domovská stránka workshopu

RIV identification code

RIV/00216224:14330/21:00119901

Organization unit

Faculty of Informatics

ISBN

978-80-263-1670-1

ISSN

Keywords in English

Optical character recognition · Layout analysis; Language identification; Image super-resolution; Medieval texts

Změněno: 15/5/2024 09:25, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

The aim of the AHISTO project is to make documents from the Hussite era (1419–1436) available to the general public through a web-hosted searchable database. Although scanned images of letterpress reprints from the 19th and 20th century are available, accurate optical character recognition (OCR) algorithms are required to extract searchable text from the scanned images. In our previous article [15], we have shown that the Tesseract 4 OCR algorithm was the second fastest and the most accurate among five different OCR algorithms. In this article, we investigate the impact of six preprocessing techniques on the accuracy of Tesseract 4. Additionally, we compare Tesseract 4 with three other OCR algorithms on the language identification task. Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification. In Section 2, we describe the related work in OCR preprocessing. In Section 3, we describe our three preprocessing techniques and our two evaluation tasks. In Section 4, we discuss the results of our evaluation. In Section 5, we offer concluding remarks and ideas for future work in the OCR of medieval texts.

Links

LM2018101, research and development project

Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)

Investor: Ministry of Education, Youth and Sports of the CR

Citovat

NOVOTNÝ, Vít, Kristýna SEIDLOVÁ, Tereza VRABCOVÁ and Aleš HORÁK. When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU, 2021, p. 29-39. ISBN 978-80-263-1670-1.

@inproceedings{1809738,
   author = {Novotný, Vít and Seidlová, Kristýna and Vrabcová, Tereza and Horák, Aleš},
   address = {Brno},
   booktitle = {Recent Advances in Slavonic Natural Language Processing (RASLAN 2021)},
   editor = {Horák, Rychlý, Rambousek},
   keywords = {Optical character recognition · Layout analysis; Language identification; Image super-resolution; Medieval texts},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Brno},
   isbn = {978-80-263-1670-1},
   pages = {29-39},
   publisher = {Tribun EU},
   title = {When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts},
   url = {https://nlp.fi.muni.cz/raslan/raslan21.pdf#page=37},
   year = {2021}
}

TY  - JOUR
ID  - 1809738
AU  - Novotný, Vít - Seidlová, Kristýna - Vrabcová, Tereza - Horák, Aleš
PY  - 2021
TI  - When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts
PB  - Tribun EU
CY  - Brno
SN  - 9788026316701
KW  - Optical character recognition · Layout analysis
KW  - Language identification
KW  - Image super-resolution
KW  - Medieval texts
UR  - https://nlp.fi.muni.cz/raslan/raslan21.pdf#page=37
N2  - The aim of the AHISTO project is to make documents from the Hussite era (1419–1436) available to the general public through a web-hosted searchable database. Although scanned images of letterpress reprints from the 19th and 20th century are available, accurate optical character recognition (OCR) algorithms are required to extract searchable text from the scanned images. In our previous article [15], we have shown that the Tesseract 4 OCR algorithm was the second fastest and the most accurate among five different OCR algorithms. In this article, we investigate the impact of six preprocessing techniques on the accuracy of Tesseract 4. Additionally, we compare Tesseract 4 with three other OCR algorithms on the language identification task. Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification. In Section 2, we describe the related work in OCR preprocessing. In Section 3, we describe our three preprocessing techniques and our two evaluation tasks. In Section 4, we discuss the results of our evaluation. In Section 5, we offer concluding remarks and ideas for future work in the OCR of medieval texts.
ER  -

NOVOTNÝ, Vít, Kristýna SEIDLOVÁ, Tereza VRABCOVÁ and Aleš HORÁK. When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In Horák, Rychlý, Rambousek. \textit{Recent Advances in Slavonic Natural Language Processing (RASLAN 2021)}. Brno: Tribun EU, 2021, p.~29-39. ISBN~978-80-263-1670-1.

Detailed Information on Publication Record