D 2020

Current Challenges in Web Corpus Building

JAKUBÍČEK, Miloš, Vojtěch KOVÁŘ, Pavel RYCHLÝ and Vít SUCHOMEL

Basic information

Original name

Current Challenges in Web Corpus Building

Authors

JAKUBÍČEK, Miloš (203 Czech Republic, guarantor, belonging to the institution), Vojtěch KOVÁŘ (203 Czech Republic, belonging to the institution), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution) and Vít SUCHOMEL (203 Czech Republic, belonging to the institution)

Edition

Marseille, France, Proceedings of the 12th Web as Corpus Workshop, p. 1-4, 4 pp. 2020

Publisher

European Language Resources Association

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

France

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

RIV identification code

RIV/00216224:14330/20:00114153

Organization unit

Faculty of Informatics

ISBN

979-10-95546-68-9

Keywords in English

Web corpora; corpus building

Tags

International impact, Reviewed
Změněno: 28/5/2020 13:06, RNDr. Vít Suchomel, Ph.D.

Abstract

V originále

In this paper we discuss some of the current challenges in web corpus building that we faced in the recent years when expanding the corpora in Sketch Engine. The purpose of the paper is to provide an overview and raise discussion on possible solutions, rather than bringing ready solutions to the readers. For every issue we try to assess its severity and briefly discuss possible mitigation options.

Links

GA18-23891S, research and development project
Name: Hyperintensionální usuzování nad texty přirozeného jazyka
Investor: Czech Science Foundation
LM2018101, research and development project
Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR