Removing Spam from Web Corpora Through Supervised Learning and
Semi-manual Classification of Web Sites

D 2020

Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites

SUCHOMEL, Vít

Základní údaje

Originální název

Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites

Autoři

SUCHOMEL, Vít (203 Česká republika, garant, domácí)

Vydání

Brno, Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020, od s. 113-123, 11 s. 2020

Nakladatel

Tribun 2020

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Odkazy

PDF ve sborníku Domovská stránka workshopu

Kód RIV

RIV/00216224:14330/20:00117841

Organizační jednotka

Fakulta informatiky

ISBN

978-80-263-1600-8

ISSN

UT WoS

000655471300012

EID Scopus

2-s2.0-85103628303

Klíčová slova anglicky

web corpora; web spam; supervised learning

Štítky

machine learning, spam, web corpora

Příznaky

Mezinárodní význam

Změněno: 13. 5. 2024 17:45, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

Internet spam is a major issue hindering the usefulness of web corpora. Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be cleaned. In this paper, two experiments of non-text removal based on supervised learning are presented. First, an improvement of corpus based language analyses of selected words achieved by a supervised classifier is shown on an English web corpus. Then, a semi-manual approach of obtaining samples of non-text web pages in Estonian is introduced. This strategy makes the supervised learning process more efficient. The result spam classifiers are tuned for high recall at the cost of precision to remove as much non-text as possible. The evaluation shows the classifiers reached the recall of 71 % and 97 % for English and Estonian web corpus, respectively. A technique for avoiding spammed web sites by measuring the distance of web pages from trustworthy sites is studied too.

Návaznosti

LM2018101, projekt VaV

Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Akronym: LINDAT/CLARIAH-CZ)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Citovat

SUCHOMEL, Vít. Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites. In Aleš Horák. Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020. Brno: Tribun 2020, 2020, s. 113-123. ISBN 978-80-263-1600-8.

@inproceedings{1729500,
   author = {Suchomel, Vít},
   address = {Brno},
   booktitle = {Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020},
   editor = {Aleš Horák},
   keywords = {web corpora; web spam; supervised learning},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Brno},
   isbn = {978-80-263-1600-8},
   pages = {113-123},
   publisher = {Tribun 2020},
   title = {Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites},
   url = {https://nlp.fi.muni.cz/raslan/raslan20.pdf#page=121},
   year = {2020}
}

TY  - CONF
ID  - 1729500
AU  - Suchomel, Vít
PY  - 2020
TI  - Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites
PB  - Tribun 2020
CY  - Brno
SN  - 9788026316008
KW  - web corpora
KW  - web spam
KW  - supervised learning
UR  - https://nlp.fi.muni.cz/raslan/raslan20.pdf#page=121
N2  - Internet spam is a major issue hindering the usefulness of web corpora. Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be cleaned. In this paper, two experiments of non-text removal based on supervised learning are presented. First, an improvement of corpus based language analyses of selected words achieved by a supervised classifier is shown on an English web corpus. Then, a semi-manual approach of obtaining samples of non-text web pages in Estonian is introduced. This strategy makes the supervised learning process more efficient. The result spam classifiers are tuned for high recall at the cost of precision to remove as much non-text as possible. The evaluation shows the classifiers reached the recall of 71 % and 97 % for English and Estonian web corpus, respectively. A technique for avoiding spammed web sites by measuring the distance of web pages from trustworthy sites is studied too.
ER  -

SUCHOMEL, Vít. Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites. In Aleš Horák. \textit{Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020}. Brno: Tribun 2020, 2020, s.~113-123. ISBN~978-80-263-1600-8.

Přehled o publikaci