D 2018

csTenTen17, a Recent Czech Web Corpus

SUCHOMEL, Vít

Basic information

Original name

csTenTen17, a Recent Czech Web Corpus

Authors

SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution)

Edition

Brno, Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018, p. 111-123, 13 pp. 2018

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

RIV identification code

RIV/00216224:14330/18:00105270

Organization unit

Faculty of Informatics

ISBN

978-80-263-1517-9

ISSN

UT WoS

000612420300014

Keywords in English

Czech corpus; web corpus; text processing

Tags

International impact
Změněno: 16/5/2022 15:44, Mgr. Michal Petr

Abstract

V originále

This article introduces a very large Czech text corpus for language research – csTenTen17 compiled from texts downloaded in 2015, 2016 and 2017. The corpus is consisting of 10.5 billion words reaching double the size of its predecessor from 2012. A brief comparison with other recent Czech corpora follows.

Links

LM2015071, research and development project
Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR