ŠVEC, Ján and Jan RYGL. Building Corpora for Stylometric Research. In Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala. Text, Speech, and Dialogue - 19th International Conference. Germany: Springer International Publishing, 2016, p. 20-27. ISBN 978-3-319-45509-9. Available from: https://dx.doi.org/10.1007/978-3-319-45510-5_3.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Building Corpora for Stylometric Research
Authors ŠVEC, Ján (703 Slovakia, belonging to the institution) and Jan RYGL (203 Czech Republic, guarantor, belonging to the institution).
Edition Germany, Text, Speech, and Dialogue - 19th International Conference, p. 20-27, 8 pp. 2016.
Publisher Springer International Publishing
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Germany
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
Impact factor Impact factor: 0.402 in 2005
RIV identification code RIV/00216224:14330/16:00090841
Organization unit Faculty of Informatics
ISBN 978-3-319-45509-9
ISSN 0302-9743
Doi http://dx.doi.org/10.1007/978-3-319-45510-5_3
UT WoS 000389707400003
Keywords (in Czech) korpus; stylometrie; autorství; crawler
Keywords in English corpus; stylometry; authorship; crawler
Tags firank_B
Tags International impact, Reviewed
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 12/5/2017 05:06.
Abstract
Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.
PrintDisplayed: 25/4/2024 14:37