Detecting Spam in Web Corpora

D 2012

Detecting Spam in Web Corpora

BAISA, Vít and Vít SUCHOMEL

Basic information

Original name

Detecting Spam in Web Corpora

Authors

BAISA, Vít (203 Czech Republic, guarantor, belonging to the institution) and Vít SUCHOMEL (203 Czech Republic, belonging to the institution)

Edition

Brno, 6th Workshop on Recent Advances in Slavonic Natural Language Processing, p. 69-76, 8 pp. 2012

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

URL

RIV identification code

RIV/00216224:14330/12:00062284

Organization unit

Faculty of Informatics

ISBN

978-80-263-0313-8

Keywords in English

spam detection; web corpora; n-gram

Změněno: 25/5/2021 19:21, RNDr. Vít Suchomel, Ph.D.

Abstract

V originále

To increase the search result rank of a website, many fake websites full of generated or semigenerated texts have been made in last years. Since we do not want this garbage in our text corpora, this is a becoming problem. This paper describes generated texts observed in the recently crawled web corpora and proposes a new way to detect such unwanted contents. The main idea of the presented approach is based on comparing frequencies of n-grams of words from the potentially forged texts with n-grams of words from a trusted corpus. As a source of spam text, fake webpages concerning loans from an English web corpus as an example of data aimed to fool search engines were used. The results show this approach is able to detect properly certain kind of forged texts with accuracy reaching almost 70 %.

Links

LM2010013, research and development project

Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)

Investor: Ministry of Education, Youth and Sports of the CR

248307, interní kód MU

Name: Pattern Recognition-based Statistically Enhanced MT (Acronym: PRESEMT)

Investor: European Union, Pattern Recognition-based Statistically Enhanced MT, Cooperation

Detailed Information on Publication Record