SUCHOMEL, Vít. Removing spam from web corpora through supervised learning using FastText. Birmingham, 2017.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Removing spam from web corpora through supervised learning using FastText
Authors SUCHOMEL, Vít.
Edition Birmingham, 2017.
Other information
Original language English
Type of outcome Proceedings paper
Country of publisher Germany
Confidentiality degree is not subject to a state or trade secret
WWW Sborník konference
Organization unit Faculty of Informatics
Keywords in English Text corpora;Web spam;Supervised learning;FastText
Tags International impact, Reviewed
Changed by Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 27/11/2018 13:34.
Abstract
Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer ge- nerated text from web corpora. The paper also presents a keyword com- parison of an unfiltered corpus with the same collection of texts cleaned by a su- pervised classifier trained using FastText. The classifier was able to recognise 71 % of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.
Links
LM2015071, research and development projectName: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 23/7/2024 02:32