D 2017

Removing spam from web corpora through supervised learning using FastText

SUCHOMEL, Vít

Basic information

Original name

Removing spam from web corpora through supervised learning using FastText

Authors

Edition

Birmingham, 2017

Other information

Language

English

Type of outcome

Proceedings paper

Country of publisher

Germany

Confidentiality degree

is not subject to a state or trade secret

References:

Organization unit

Faculty of Informatics

Keywords in English

Text corpora;Web spam;Supervised learning;FastText

Tags

International impact, Reviewed
Changed: 27/11/2018 13:34, RNDr. Vít Suchomel, Ph.D.

Abstract

In the original language

Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer ge- nerated text from web corpora. The paper also presents a keyword com- parison of an unfiltered corpus with the same collection of texts cleaned by a su- pervised classifier trained using FastText. The classifier was able to recognise 71 % of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.

Links

LM2015071, research and development project
Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR