Scaling to Billion-plus Word Corpora

POMIKÁLEK, Jan, Pavel RYCHLÝ and Adam KILGARRIFF. Scaling to Billion-plus Word Corpora. Advances in Computational Linguistics. Mexiko: Instituto Politécnico Nacional, 2009, vol. 41, zima 2009, p. 3-13, 14 pp. ISSN 1870-4069.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Scaling to Billion-plus Word Corpora
Name in Czech	Miliardové korpusy
Authors	POMIKÁLEK, Jan (203 Czech Republic, guarantor), Pavel RYCHLÝ (203 Czech Republic) and Adam KILGARRIFF (826 United Kingdom of Great Britain and Northern Ireland).
Edition	Advances in Computational Linguistics, Mexiko, Instituto Politécnico Nacional, 2009, 1870-4069.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Mexico
Confidentiality degree	is not subject to a state or trade secret
RIV identification code	RIV/00216224:14330/09:00035368
Organization unit	Faculty of Informatics
Keywords in English	word corpora; web as corpus; duplicate detection
Tags	duplicate detection, web as corpus, word corpora
Tags	International impact, Reviewed
Changed by	Changed by: doc. Mgr. Pavel Rychlý, Ph.D., učo 3692. Changed: 30/3/2010 11:46.

Abstract

Most phenomena in natural languages are distributed in accordance with Zipf's law, so many words, phrases and other items occur rarely and we need very large corpora to provide evidence about them. Previous work shows that it is possible to create very large (multi-billion word) corpora from the web. The usability of such corpora is often limited by duplicate contents and a lack of efficient query tools. This paper describes BiWeC, a Big Web Corpus of English texts currently comprising 5.5b words fully processed, and with a target size of 20b. We present a method for detecting near-duplicate text documents in multi-billion-word text collections and describe how one corpus query tool, the Sketch Engine, has been re-engineered to efficiently encode, process and query such corpora on low-cost hardware.

Abstract (in Czech)

Většina jevů v přirozených jazycích je rozložena v souladu se Zipfovým zákonem, takže mnoho slov a frází se vyskytuje řídce. Abychom tato slova a fráze mohli studovat, potřebujeme velmi velké textové korpusy. V předchozí práci bylo ukázáno, že je možné vytvořit velmi velké korpusy (v řádu miliard slov) z webu. Takové korpusy však často obsahují duplicitní dokumenty, což snižuje jejich užitnost. Dalším problémem bývá nedostupnost efektivních nástrojů pro dotazování nad tak velkými korpusy. Tento článek popisuje BiWeC, velký webový korpus (Big Web Corpus) anglických textů, plně zpracovaný a v současnosti obsahující 5,5 mld. slov. Cílová velikost korpusu je 20 mld. slov. Představujeme metodu pro detekci blízkých textových dokumentů v textových kolekcích obsahujících několik miliard slov. Dále popisujeme, jak jsme přepracovali korpusový manažer Sketch Engine, abychom umožnili efektivní zpracování miliardových korpusů s použitím běžně dostupného hardwaru.

Links
LC536, research and development project	Name: Centrum komputační lingvistiky
LC536, research and development project	Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky
2C06009, research and development project	Name: Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce (Acronym: COT-SEWing)
2C06009, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 27/5/2024 09:28

Scaling to Billion-plus Word Corpora

Other applications