D 2012

Building a 70 billion word corpus of English from ClueWeb

POMIKÁLEK, Jan, Pavel RYCHLÝ and Miloš JAKUBÍČEK

Basic information

Original name

Building a 70 billion word corpus of English from ClueWeb

Authors

POMIKÁLEK, Jan (203 Czech Republic, belonging to the institution), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution) and Miloš JAKUBÍČEK (203 Czech Republic, guarantor, belonging to the institution)

Edition

Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), p. 502-506, 5 pp. 2012

Publisher

European Language Resources Association (ELRA)

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

RIV identification code

RIV/00216224:14330/12:00057572

Organization unit

Faculty of Informatics

ISBN

978-2-9517408-7-7

UT WoS

000323927700080

Keywords in English

corpus; clueweb; English; encoding; word sketch

Tags

International impact, Reviewed
Změněno: 9/4/2013 11:19, RNDr. Miloš Jakubíček, Ph.D.

Abstract

V originále

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.

Links

GAP401/10/0792, research and development project
Name: Temporální aspekty znalostí a informací
Investor: Czech Science Foundation
LM2010013, research and development project
Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
248307, interní kód MU
Name: Pattern Recognition-based Statistically Enhanced MT (Acronym: PRESEMT)
Investor: European Union, Pattern Recognition-based Statistically Enhanced MT, Cooperation

Files attached