2017
Removing spam from web corpora through supervised learning using FastText
SUCHOMEL, VítBasic information
Original name
Removing spam from web corpora through supervised learning using FastText
Authors
Edition
Birmingham, 2017
Other information
Language
English
Type of outcome
Proceedings paper
Country of publisher
Germany
Confidentiality degree
is not subject to a state or trade secret
References:
Organization unit
Faculty of Informatics
Keywords in English
Text corpora;Web spam;Supervised learning;FastText
Tags
International impact, Reviewed
Changed: 27/11/2018 13:34, RNDr. Vít Suchomel, Ph.D.
Abstract
In the original language
Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer ge- nerated text from web corpora. The paper also presents a keyword com- parison of an unfiltered corpus with the same collection of texts cleaned by a su- pervised classifier trained using FastText. The classifier was able to recognise 71 % of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.
Links
LM2015071, research and development project |
|