Efficient Web Crawling for Large Text Corpora

SUCHOMEL, Vít and Jan POMIKÁLEK. Efficient Web Crawling for Large Text Corpora. Online. In Adam Kilgarriff, Serge Sharoff. Proceedings of the seventh Web as Corpus Workshop (WAC7). Lyon, 2012, p. 39-43.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Efficient Web Crawling for Large Text Corpora
Name in Czech	Efektivní automatické stahování z webu pro velké textové korpusy
Authors	SUCHOMEL, Vít and Jan POMIKÁLEK.
Edition	Lyon, Proceedings of the seventh Web as Corpus Workshop (WAC7), p. 39-43, 5 pp. 2012.

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	electronic version available online
WWW	Proceedings of the seventh Web as Corpus Workshop (WAC7)
Organization unit	Faculty of Informatics
Keywords (in Czech)	crawler; automatické stahování z webu; korpus; webový korpus; textový korpus
Keywords in English	crawler; web crawling; corpus; web corpus; text corpus
Tags	best
Tags	International impact
Changed by	Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 9/4/2013 11:49.

Abstract

Many researchers use texts from the web, an easy source of linguistic data in a great variety of languages. Building both large and good quality text corpora is the challenge we face nowadays. We describe how to deal with inefficient data downloading and how to focus crawling on text rich web domains. We present efficiency figures from crawling texts in American Spanish, Czech, Japanese, Russian, Tajik Persian, Turkish and the sizes of the resulting corpora. The idea has been successfully applied for building billions of words scale corpora in six languages. Texts in the Russian corpus, consisting of 20.2 billions tokens, were downloaded in just 13 days.

PrintDisplayed: 21/9/2024 11:19

Efficient Web Crawling for Large Text Corpora

Other applications