Semi-automatic building of large-scale digital dictionaries

D 2021

Semi-automatic building of large-scale digital dictionaries

BLAHUŠ, Marek; Michal CUKR; Ondřej HERMAN; Miloš JAKUBÍČEK; Vojtěch KOVÁŘ et. al.

Basic information

Original name

Semi-automatic building of large-scale digital dictionaries

Authors

BLAHUŠ, Marek (203 Czech Republic); Michal CUKR (203 Czech Republic); Ondřej HERMAN (203 Czech Republic, belonging to the institution); Miloš JAKUBÍČEK (203 Czech Republic, belonging to the institution); Vojtěch KOVÁŘ (203 Czech Republic, guarantor, belonging to the institution) and Marek MEDVEĎ (703 Slovakia, belonging to the institution)

Edition

Brno, Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference, p. 396-407, 12 pp. 2021

Publisher

Lexical Computing CZ s.r.o.

Other information

Language

English

Type of outcome

Proceedings paper

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

is not subject to a state or trade secret

Publication form

electronic version available online

References:

URL

RIV identification code

RIV/00216224:14330/21:00136004

Organization unit

Faculty of Informatics

ISSN

EID Scopus

2-s2.0-85137098115

Keywords in English

post-editing lexicography; dictionary drafting; Sketch Engine

Abstract

In the original language

This paper presents a novel way of creating dictionaries by using a particular post-editing workflow, all of which is carried out in the context of building a set of three bilingual dictionaries - Tagalog, Urdu and Lao dictionaries with translations into English and Korean. The dictionaries were created completely from scratch without reusing any existing content and in a completely automatic manner, amounting to 50, 000 headwords, out of which 15, 000 headwords were subject to subsequent manual post-editing. In the paper we discuss the post-editing methodology that we used and its impact on the overall lexicographic workflow. We describe the web corpora that were built specifically for the purpose of building these three dictionaries as well as their annotations (such as PoS tagging and lemmatisation) and tools that were used for the corpus annotation and for automating individual entry parts and the post-editing thereof. Most of the automatic drafting and post-editing relied on a backbone consisting of the Sketch Engine corpus management system and Lexonomy dictionary editor We also detail the overall amount of work involved in each post-editing step, the technical and managerial difficulties faced alongside in the project, and the major technological issues that still need improvement in the post-editing scenario.

Links

LM2018101, research and development project

Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)

Investor: Ministry of Education, Youth and Sports of the CR

Cite

BLAHUŠ, Marek; Michal CUKR; Ondřej HERMAN; Miloš JAKUBÍČEK; Vojtěch KOVÁŘ and Marek MEDVEĎ. Semi-automatic building of large-scale digital dictionaries. Online. In Kosem, Iztok; Cukr, Michal; Jakubíček, Miloš; Kallas, Jelena; Krek, Simon; Tiberius, Carole. Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference. Brno: Lexical Computing CZ s.r.o., 2021, p. 396-407. ISSN 2533-5626.

@inproceedings{2401742,
   author = {Blahuš, Marek and Cukr, Michal and Herman, Ondřej and Jakubíček, Miloš and Kovář, Vojtěch and Medveď, Marek},
   address = {Brno},
   booktitle = {Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference},
   editor = {Kosem, Iztok; Cukr, Michal; Jakubíček, Miloš; Kallas, Jelena; Krek, Simon; Tiberius, Carole},
   keywords = {post-editing lexicography; dictionary drafting; Sketch Engine},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Brno},
   pages = {396-407},
   publisher = {Lexical Computing CZ s.r.o.},
   title = {Semi-automatic building of large-scale digital dictionaries},
   url = {https://elex.link/elex2021/wp-content/uploads/2021/08/eLex_2021_23_pp396-407.pdf},
   year = {2021}
}

TY  - CONF
ID  - 2401742
AU  - Blahuš, Marek - Cukr, Michal - Herman, Ondřej - Jakubíček, Miloš - Kovář, Vojtěch - Medveď, Marek
PY  - 2021
TI  - Semi-automatic building of large-scale digital dictionaries
PB  - Lexical Computing CZ s.r.o.
CY  - Brno
KW  - post-editing lexicography
KW  - dictionary drafting
KW  - Sketch Engine
UR  - https://elex.link/elex2021/wp-content/uploads/2021/08/eLex_2021_23_pp396-407.pdf
N2  - This paper presents a novel way of creating dictionaries by using a particular post-editing workflow, all of which is carried out in the context of building a set of three bilingual dictionaries - Tagalog, Urdu and Lao dictionaries with translations into English and Korean. The dictionaries were created completely from scratch without reusing any existing content and in a completely automatic manner, amounting to 50, 000 headwords, out of which 15, 000 headwords were subject to subsequent manual post-editing. In the paper we discuss the post-editing methodology that we used and its impact on the overall lexicographic workflow. We describe the web corpora that were built specifically for the purpose of building these three dictionaries as well as their annotations (such as PoS tagging and lemmatisation) and tools that were used for the corpus annotation and for automating individual entry parts and the post-editing thereof. Most of the automatic drafting and post-editing relied on a backbone consisting of the Sketch Engine corpus management system and Lexonomy dictionary editor We also detail the overall amount of work involved in each post-editing step, the technical and managerial difficulties faced alongside in the project, and the major technological issues that still need improvement in the post-editing scenario.
ER  -

BLAHUŠ, Marek; Michal CUKR; Ondřej HERMAN; Miloš JAKUBÍČEK; Vojtěch KOVÁŘ and Marek MEDVEĎ. Semi-automatic building of large-scale digital dictionaries. Online. In Kosem, Iztok; Cukr, Michal; Jakubíček, Miloš; Kallas, Jelena; Krek, Simon; Tiberius, Carole. \textit{Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference}. Brno: Lexical Computing CZ s.r.o., 2021, p.~396-407. ISSN~2533-5626.

Přehled o publikaci