Building a 70 billion word corpus of English from ClueWeb

D 2012

Building a 70 billion word corpus of English from ClueWeb

POMIKÁLEK, Jan, Pavel RYCHLÝ and Miloš JAKUBÍČEK

Basic information

Original name

Building a 70 billion word corpus of English from ClueWeb

Authors

POMIKÁLEK, Jan (203 Czech Republic, belonging to the institution), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution) and Miloš JAKUBÍČEK (203 Czech Republic, guarantor, belonging to the institution)

Edition

Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), p. 502-506, 5 pp. 2012

Publisher

European Language Resources Association (ELRA)

Other information

Language

English

Type of outcome

Proceedings paper

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

is not subject to a state or trade secret

Publication form

printed version "print"

References:

URL

RIV identification code

RIV/00216224:14330/12:00057572

Organization unit

Faculty of Informatics

ISBN

978-2-9517408-7-7

UT WoS

000323927700080

Keywords in English

corpus; clueweb; English; encoding; word sketch

Abstract

V originále

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.

Links

GAP401/10/0792, research and development project

Name: Temporální aspekty znalostí a informací

Investor: Czech Science Foundation

LM2010013, research and development project

Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)

Investor: Ministry of Education, Youth and Sports of the CR

248307, interní kód MU

Name: Pattern Recognition-based Statistically Enhanced MT (Acronym: PRESEMT)

Investor: European Union, Pattern Recognition-based Statistically Enhanced MT, Cooperation

Files attached

lrec2012.pdf File version

Přehled o publikaci