Building Corpora for Stylometric Research

ŠVEC, Ján and Jan RYGL. Building Corpora for Stylometric Research. In Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala. Text, Speech, and Dialogue - 19th International Conference. Germany: Springer International Publishing, 2016, p. 20-27. ISBN 978-3-319-45509-9. Available from: https://dx.doi.org/10.1007/978-3-319-45510-5_3.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Building Corpora for Stylometric Research
Authors	ŠVEC, Ján (703 Slovakia, belonging to the institution) and Jan RYGL (203 Czech Republic, guarantor, belonging to the institution).
Edition	Germany, Text, Speech, and Dialogue - 19th International Conference, p. 20-27, 8 pp. 2016.
Publisher	Springer International Publishing

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Germany
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
Impact factor	Impact factor: 0.402 in 2005
RIV identification code	RIV/00216224:14330/16:00090841
Organization unit	Faculty of Informatics
ISBN	978-3-319-45509-9
ISSN	0302-9743
Doi	http://dx.doi.org/10.1007/978-3-319-45510-5_3
UT WoS	000389707400003
Keywords (in Czech)	korpus; stylometrie; autorství; crawler
Keywords in English	corpus; stylometry; authorship; crawler
Tags	firank_B
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 12/5/2017 05:06.

Abstract

Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

PrintDisplayed: 25/4/2024 14:37

Building Corpora for Stylometric Research

Other applications