Website Properties in Relation to the Quality of Text Extracted
for Web Corpora

SUCHOMEL, Vít and Jan KRAUS. Website Properties in Relation to the Quality of Text Extracted for Web Corpora. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU, 2021, p. 167-175. ISBN 978-80-263-1670-1.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Website Properties in Relation to the Quality of Text Extracted for Web Corpora
Authors	SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution) and Jan KRAUS (203 Czech Republic).
Edition	Brno, Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), p. 167-175, 9 pp. 2021.
Publisher	Tribun EU

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10200 1.2 Computer and information sciences
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	Full text PDF Domovská stránka workshopu
RIV identification code	RIV/00216224:14330/21:00123254
Organization unit	Faculty of Informatics
ISBN	978-80-263-1670-1
ISSN	2336-4289
Keywords in English	Web crawling; Web spam; Text corpus; Text processing
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 15/5/2024 02:16.

Abstract

In this paper we present our research concerning the relation between two properties of websites and the quality of the text extracted from a website in the context of crawling the web and building large web corpora. A manual classification of text quality of 18 thousand websites from 21 European languages was used to verify our assumption that certain web domain properties can be used to identify potential sources of bad quality content. The first property is the distance of a web domain from the seed domains in a web crawl. The second property studied in this work is the length of the website name. Although these properties were recommended to help identify good quality websites in our previous work, in this paper we show there is only a small difference between the quality of text-rich web domains with various seed distances or name lengths. This conclusion holds for the post-crawling text processing when starting the web crawl with a large amount of seed domains.

Links
LM2018101, research and development project	Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
LM2018101, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 23/7/2024 02:37

Website Properties in Relation to the Quality of Text Extracted for Web Corpora

Other applications