When Tesseract Brings Friends: Layout Analysis, Language
Identification, and Super-Resolution in the Optical Character
Recognition of Medieval Texts

NOVOTNÝ, Vít, Kristýna SEIDLOVÁ, Tereza VRABCOVÁ a Aleš HORÁK. When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU. s. 29-39. ISBN 978-80-263-1670-1. 2021.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts
Autoři	NOVOTNÝ, Vít (203 Česká republika, garant, domácí), Kristýna SEIDLOVÁ (203 Česká republika, domácí), Tereza VRABCOVÁ (203 Česká republika, domácí) a Aleš HORÁK (203 Česká republika, domácí).
Vydání	Brno, Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), od s. 29-39, 11 s. 2021.
Nakladatel	Tribun EU

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10200 1.2 Computer and information sciences
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	tištěná verze "print"
WWW	Full text PDF Domovská stránka workshopu
Kód RIV	RIV/00216224:14330/21:00119901
Organizační jednotka	Fakulta informatiky
ISBN	978-80-263-1670-1
ISSN	2336-4289
Klíčová slova anglicky	Optical character recognition · Layout analysis; Language identification; Image super-resolution; Medieval texts
Změnil	Změnil: doc. RNDr. Aleš Horák, Ph.D., učo 1648. Změněno: 1. 12. 2022 15:38.

Anotace

The aim of the AHISTO project is to make documents from the Hussite era (1419–1436) available to the general public through a web-hosted searchable database. Although scanned images of letterpress reprints from the 19th and 20th century are available, accurate optical character recognition (OCR) algorithms are required to extract searchable text from the scanned images. In our previous article [15], we have shown that the Tesseract 4 OCR algorithm was the second fastest and the most accurate among five different OCR algorithms. In this article, we investigate the impact of six preprocessing techniques on the accuracy of Tesseract 4. Additionally, we compare Tesseract 4 with three other OCR algorithms on the language identification task. Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification. In Section 2, we describe the related work in OCR preprocessing. In Section 3, we describe our three preprocessing techniques and our two evaluation tasks. In Section 4, we discuss the results of our evaluation. In Section 5, we offer concluding remarks and ideas for future work in the OCR of medieval texts.

Návaznosti
LM2018101, projekt VaV	Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Akronym: LINDAT/CLARIAH-CZ)
LM2018101, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

VytisknoutZobrazeno: 16. 4. 2024 12:55

When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in ...

Další aplikace