csTenTen17, a Recent Czech Web Corpus

SUCHOMEL, Vít. csTenTen17, a Recent Czech Web Corpus. In Aleš Horák, Pavel Rychlý and Adam Rambousek. Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018. Brno: Tribun EU, 2018, p. 111-123. ISBN 978-80-263-1517-9.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	csTenTen17, a Recent Czech Web Corpus
Authors	SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition	Brno, Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018, p. 111-123, 13 pp. 2018.
Publisher	Tribun EU

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10200 1.2 Computer and information sciences
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/18:00105270
Organization unit	Faculty of Informatics
ISBN	978-80-263-1517-9
ISSN	2336-4289
UT WoS	000612420300014
Keywords in English	Czech corpus; web corpus; text processing
Tags	International impact
Changed by	Changed by: Mgr. Michal Petr, učo 65024. Changed: 16/5/2022 15:44.

Abstract
This article introduces a very large Czech text corpus for language research – csTenTen17 compiled from texts downloaded in 2015, 2016 and 2017. The corpus is consisting of 10.5 billion words reaching double the size of its predecessor from 2012. A brief comparison with other recent Czech corpora follows.

Links
LM2015071, research and development project	Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
LM2015071, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 26/7/2024 07:23

csTenTen17, a Recent Czech Web Corpus

Other applications