D 2020

Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites

SUCHOMEL, Vít

Basic information

Original name

Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites

Authors

SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution)

Edition

Brno, Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020, p. 113-123, 11 pp. 2020

Publisher

Tribun 2020

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

RIV identification code

RIV/00216224:14330/20:00117841

Organization unit

Faculty of Informatics

ISBN

978-80-263-1600-8

ISSN

UT WoS

000655471300012

Keywords in English

web corpora; web spam; supervised learning

Tags

International impact
Změněno: 13/5/2024 17:45, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

Internet spam is a major issue hindering the usefulness of web corpora. Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be cleaned. In this paper, two experiments of non-text removal based on supervised learning are presented. First, an improvement of corpus based language analyses of selected words achieved by a supervised classifier is shown on an English web corpus. Then, a semi-manual approach of obtaining samples of non-text web pages in Estonian is introduced. This strategy makes the supervised learning process more efficient. The result spam classifiers are tuned for high recall at the cost of precision to remove as much non-text as possible. The evaluation shows the classifiers reached the recall of 71 % and 97 % for English and Estonian web corpus, respectively. A technique for avoiding spammed web sites by measuring the distance of web pages from trustworthy sites is studied too.

Links

LM2018101, research and development project
Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR