Slavonic Corpus for Stylometry Research

ŠVEC, Ján and Jan RYGL. Slavonic Corpus for Stylometry Research. In Proceedings of Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015. 1st ed. Brno (Czech Republic. Brno: Tribun EU, 2015, p. 11-21. ISBN 978-80-263-0974-1.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Slavonic Corpus for Stylometry Research
Authors	ŠVEC, Ján (703 Slovakia, guarantor, belonging to the institution) and Jan RYGL (203 Czech Republic, belonging to the institution).
Edition	1st ed. Brno (Czech Republic. Brno, Proceedings of Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015. p. 11-21, 11 pp. 2015.
Publisher	Tribun EU

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	conference page article
RIV identification code	RIV/00216224:14330/15:00085135
Organization unit	Faculty of Informatics
ISBN	978-80-263-0974-1
ISSN	2336-4289
Keywords in English	stylometry; slavonic corpus; web structure detection; corpora building
Changed by	Changed by: RNDr. Jan Rygl, učo 208072. Changed: 7/6/2021 17:57.

Abstract

Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

Links
LM2010013, research and development project	Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
LM2010013, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 24/8/2024 17:21

Slavonic Corpus for Stylometry Research

Other applications