The TenTen Corpus Family

JAKUBÍČEK, Miloš, Adam KILGARRIFF, Vojtěch KOVÁŘ, Pavel RYCHLÝ and Vít SUCHOMEL. The TenTen Corpus Family. Online. In 7th International Corpus Linguistics Conference CL 2013. Lancaster, 2013, p. 125-127.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	The TenTen Corpus Family
Authors	JAKUBÍČEK, Miloš, Adam KILGARRIFF, Vojtěch KOVÁŘ, Pavel RYCHLÝ and Vít SUCHOMEL.
Edition	Lancaster, 7th International Corpus Linguistics Conference CL 2013, p. 125-127, 3 pp. 2013.

Other information
Type of outcome	Proceedings paper
Confidentiality degree	is not subject to a state or trade secret
Publication form	electronic version available online
WWW	Webové stránky konference Konferenční sborník abstraktů
Organization unit	Faculty of Informatics
Tags	best
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 5/3/2024 11:47.

Abstract

Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. In this paper we describe out programme to build ever better corpora along these lines for all of the world’s major languages (plus some others). Baroni and Kilgarriff (2006), Sharoff (2006), Baroni et al (2009), and Kilgarriff et al (2010) present the case for web corpora and programmes in which a number of them have been developed. TenTens are a development from them -- a new family of corpora of the order of 10 billion words. We describe how we are building them, what we have built so far, and how we shall continue maintaining them and keeping them up to date in the years ahead. While, as yet, they have very little metadata, we are working out how to gather and add metadata attribute by attribute. The corpora are all available for research at http://www.sketchengine.co.uk.

PrintDisplayed: 26/4/2024 21:43

The TenTen Corpus Family

Other applications