Removing Spam from Web Corpora Through Supervised Learning and
Semi-manual Classification of Web Sites

SUCHOMEL, Vít. Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites. In Aleš Horák. Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020. Brno: Tribun 2020, 2020, p. 113-123. ISBN 978-80-263-1600-8.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites
Authors	SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition	Brno, Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020, p. 113-123, 11 pp. 2020.
Publisher	Tribun 2020

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	Domovská stránka workshopu PDF ve sborníku
RIV identification code	RIV/00216224:14330/20:00117841
Organization unit	Faculty of Informatics
ISBN	978-80-263-1600-8
ISSN	2336-4289
Keywords in English	web corpora; web spam; supervised learning
Tags	machine learning, spam, web corpora
Tags	International impact
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 10/5/2021 06:19.

Abstract

Internet spam is a major issue hindering the usefulness of web corpora. Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be cleaned. In this paper, two experiments of non-text removal based on supervised learning are presented. First, an improvement of corpus based language analyses of selected words achieved by a supervised classifier is shown on an English web corpus. Then, a semi-manual approach of obtaining samples of non-text web pages in Estonian is introduced. This strategy makes the supervised learning process more efficient. The result spam classifiers are tuned for high recall at the cost of precision to remove as much non-text as possible. The evaluation shows the classifiers reached the recall of 71 % and 97 % for English and Estonian web corpus, respectively. A technique for avoiding spammed web sites by measuring the distance of web pages from trustworthy sites is studied too.

Links
LM2018101, research and development project	Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
LM2018101, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 1/5/2024 00:22

Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web ...

Other applications