Detailed Information on Publication Record
2018
csTenTen17, a Recent Czech Web Corpus
SUCHOMEL, VítBasic information
Original name
csTenTen17, a Recent Czech Web Corpus
Authors
SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution)
Edition
Brno, Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018, p. 111-123, 13 pp. 2018
Publisher
Tribun EU
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10200 1.2 Computer and information sciences
Country of publisher
Czech Republic
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
printed version "print"
References:
RIV identification code
RIV/00216224:14330/18:00105270
Organization unit
Faculty of Informatics
ISBN
978-80-263-1517-9
ISSN
UT WoS
000612420300014
Keywords in English
Czech corpus; web corpus; text processing
Tags
International impact
Změněno: 16/5/2022 15:44, Mgr. Michal Petr
Abstract
V originále
This article introduces a very large Czech text corpus for language research – csTenTen17 compiled from texts downloaded in 2015, 2016 and 2017. The corpus is consisting of 10.5 billion words reaching double the size of its predecessor from 2012. A brief comparison with other recent Czech corpora follows.
Links
LM2015071, research and development project |
|