Detailed Information on Publication Record
2012
Building a 70 billion word corpus of English from ClueWeb
POMIKÁLEK, Jan, Pavel RYCHLÝ and Miloš JAKUBÍČEKBasic information
Original name
Building a 70 billion word corpus of English from ClueWeb
Authors
POMIKÁLEK, Jan (203 Czech Republic, belonging to the institution), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution) and Miloš JAKUBÍČEK (203 Czech Republic, guarantor, belonging to the institution)
Edition
Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), p. 502-506, 5 pp. 2012
Publisher
European Language Resources Association (ELRA)
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
Czech Republic
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
printed version "print"
References:
RIV identification code
RIV/00216224:14330/12:00057572
Organization unit
Faculty of Informatics
ISBN
978-2-9517408-7-7
UT WoS
000323927700080
Keywords in English
corpus; clueweb; English; encoding; word sketch
Tags
International impact, Reviewed
Změněno: 9/4/2013 11:19, RNDr. Miloš Jakubíček, Ph.D.
Abstract
V originále
This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.
Links
GAP401/10/0792, research and development project |
| ||
LM2010013, research and development project |
| ||
248307, interní kód MU |
|