Detailed Information on Publication Record
2015
Determining Window Size from Plagiarism Corpus for Stylometric Features
SUCHOMEL, Šimon and Michal BRANDEJSBasic information
Original name
Determining Window Size from Plagiarism Corpus for Stylometric Features
Authors
SUCHOMEL, Šimon (203 Czech Republic, belonging to the institution) and Michal BRANDEJS (203 Czech Republic, guarantor, belonging to the institution)
Edition
Toulouse, France, Experimental IR Meets Multilinguality, Multimodality, and Interaction, p. 293-299, 7 pp. 2015
Publisher
Springer International Publishing
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
France
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
printed version "print"
References:
Impact factor
Impact factor: 0.402 in 2005
RIV identification code
RIV/00216224:14330/15:00084706
Organization unit
Faculty of Informatics
ISBN
978-3-319-24026-8
ISSN
UT WoS
000364677800034
Keywords in English
plagiarism; average word frequency class; stylometry; text classification; intrinsic plagiarism
Tags
Tags
International impact, Reviewed
Změněno: 16/11/2015 11:33, RNDr. Šimon Suchomel, Ph.D.
Abstract
V originále
The sliding window concept is a common method for computing a profile of a document with unknown structure. This paper outlines an experiment with stylometric word-based feature in order to determine an optimal size of the sliding window. It was conducted for a vocabulary richness method called ‘average word frequency class’ using the PAN 2015 source retrieval training corpus for plagiarism detection. The paper shows the pros and cons of the stop words removal for the sliding window document profiling and discusses the utilization of the selected feature for intrinsic plagiarism detection. The experiment resulted in the recommendation of setting the sliding windows to around 100 words in length for computing the text profile using the average word frequency class stylometric feature.
Links
LG13010, research and development project |
|