Efficient Web Crawling for Large Text Corpora

SUCHOMEL, Vít a Jan POMIKÁLEK. Efficient Web Crawling for Large Text Corpora. Online. In Adam Kilgarriff, Serge Sharoff. Proceedings of the seventh Web as Corpus Workshop (WAC7). Lyon, 2012. s. 39-43. [citováno 2024-04-24]

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Efficient Web Crawling for Large Text Corpora
Název česky	Efektivní automatické stahování z webu pro velké textové korpusy
Autoři	SUCHOMEL, Vít a Jan POMIKÁLEK
Vydání	Lyon, Proceedings of the seventh Web as Corpus Workshop (WAC7), od s. 39-43, 5 s. 2012.

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	elektronická verze "online"
WWW	Proceedings of the seventh Web as Corpus Workshop (WAC7)
Organizační jednotka	Fakulta informatiky
Klíčová slova česky	crawler; automatické stahování z webu; korpus; webový korpus; textový korpus
Klíčová slova anglicky	crawler; web crawling; corpus; web corpus; text corpus
Štítky	best
Příznaky	Mezinárodní význam
Změnil	Změnil: RNDr. Vít Suchomel, Ph.D., učo 139723. Změněno: 9. 4. 2013 11:49.

Anotace

Many researchers use texts from the web, an easy source of linguistic data in a great variety of languages. Building both large and good quality text corpora is the challenge we face nowadays. We describe how to deal with inefficient data downloading and how to focus crawling on text rich web domains. We present efficiency figures from crawling texts in American Spanish, Czech, Japanese, Russian, Tajik Persian, Turkish and the sizes of the resulting corpora. The idea has been successfully applied for building billions of words scale corpora in six languages. Texts in the Russian corpus, consisting of 20.2 billions tokens, were downloaded in just 13 days.

VytisknoutZobrazeno: 24. 4. 2024 04:46

Efficient Web Crawling for Large Text Corpora

Další aplikace