Building a 70 billion word corpus of English from ClueWeb

D 2012

Building a 70 billion word corpus of English from ClueWeb

POMIKÁLEK, Jan, Pavel RYCHLÝ a Miloš JAKUBÍČEK

Základní údaje

Originální název

Building a 70 billion word corpus of English from ClueWeb

Autoři

POMIKÁLEK, Jan (203 Česká republika, domácí), Pavel RYCHLÝ (203 Česká republika, domácí) a Miloš JAKUBÍČEK (203 Česká republika, garant, domácí)

Vydání

Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), od s. 502-506, 5 s. 2012

Nakladatel

European Language Resources Association (ELRA)

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Odkazy

URL

Kód RIV

RIV/00216224:14330/12:00057572

Organizační jednotka

Fakulta informatiky

ISBN

978-2-9517408-7-7

UT WoS

000323927700080

Klíčová slova anglicky

corpus; clueweb; English; encoding; word sketch

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 9. 4. 2013 11:19, RNDr. Miloš Jakubíček, Ph.D.

Anotace

V originále

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.

Návaznosti

GAP401/10/0792, projekt VaV

Název: Temporální aspekty znalostí a informací

Investor: Grantová agentura ČR, Temporální aspekty znalostí a informací

LM2010013, projekt VaV

Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

248307, interní kód MU

Název: Pattern Recognition-based Statistically Enhanced MT (Akronym: PRESEMT)

Investor: Evropská unie, Pattern Recognition-based Statistically Enhanced MT, Spolupráce

Přiložené soubory

lrec2012.pdf

Vyhledat podobné dokumenty

Citovat

POMIKÁLEK, Jan, Pavel RYCHLÝ a Miloš JAKUBÍČEK. Building a 70 billion word corpus of English from ClueWeb. In Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), 2012, s. 502-506. ISBN 978-2-9517408-7-7.

@inproceedings{991165,
   author = {Pomikálek, Jan and Rychlý, Pavel and Jakubíček, Miloš},
   address = {Istanbul, Turkey},
   booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
   editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
   keywords = {corpus; clueweb; English; encoding; word sketch},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Istanbul, Turkey},
   isbn = {978-2-9517408-7-7},
   pages = {502-506},
   publisher = {European Language Resources Association (ELRA)},
   title = {Building a 70 billion word corpus of English from ClueWeb},
   url = {http://nlp.fi.muni.cz/publications/lrec2012_xpomikal_pary_xjakub/lrec2012.pdf},
   year = {2012}
}

TY  - JOUR
ID  - 991165
AU  - Pomikálek, Jan - Rychlý, Pavel - Jakubíček, Miloš
PY  - 2012
TI  - Building a 70 billion word corpus of English from ClueWeb
PB  - European Language Resources Association (ELRA)
CY  - Istanbul, Turkey
SN  - 9782951740877
KW  - corpus
KW  - clueweb
KW  - English
KW  - encoding
KW  - word sketch
UR  - http://nlp.fi.muni.cz/publications/lrec2012_xpomikal_pary_xjakub/lrec2012.pdf
L2  - http://nlp.fi.muni.cz/publications/lrec2012_xpomikal_pary_xjakub/lrec2012.pdf
N2  - This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.
ER  -

POMIKÁLEK, Jan, Pavel RYCHLÝ a Miloš JAKUBÍČEK. Building a 70 billion word corpus of English from ClueWeb. In Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis. \textit{Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}. Istanbul, Turkey: European Language Resources Association (ELRA), 2012, s.~502-506. ISBN~978-2-9517408-7-7.

Podrobný výpis o publikaci