Building Corpora for Stylometric Research

D 2016

Building Corpora for Stylometric Research

ŠVEC, Ján a Jan RYGL

Základní údaje

Originální název

Building Corpora for Stylometric Research

Autoři

ŠVEC, Ján (703 Slovensko, domácí) a Jan RYGL (203 Česká republika, garant, domácí)

Vydání

Germany, Text, Speech, and Dialogue - 19th International Conference, od s. 20-27, 8 s. 2016

Nakladatel

Springer International Publishing

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Německo

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Impakt faktor

Impact factor: 0.402 v roce 2005

Kód RIV

RIV/00216224:14330/16:00090841

Organizační jednotka

Fakulta informatiky

ISBN

978-3-319-45509-9

ISSN

DOI

http://dx.doi.org/10.1007/978-3-319-45510-5_3

UT WoS

000389707400003

Klíčová slova česky

korpus; stylometrie; autorství; crawler

Klíčová slova anglicky

corpus; stylometry; authorship; crawler

Štítky

firank_B

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 12. 5. 2017 05:06, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

Citovat

ŠVEC, Ján a Jan RYGL. Building Corpora for Stylometric Research. In Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala. Text, Speech, and Dialogue - 19th International Conference. Germany: Springer International Publishing, 2016, s. 20-27. ISBN 978-3-319-45509-9. Dostupné z: https://dx.doi.org/10.1007/978-3-319-45510-5_3.

@inproceedings{1354197,
   author = {Švec, Ján and Rygl, Jan},
   address = {Germany},
   booktitle = {Text, Speech, and Dialogue - 19th International Conference},
   doi = {http://dx.doi.org/10.1007/978-3-319-45510-5_3},
   editor = {Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala},
   keywords = {corpus; stylometry; authorship; crawler},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Germany},
   isbn = {978-3-319-45509-9},
   pages = {20-27},
   publisher = {Springer International Publishing},
   title = {Building Corpora for Stylometric Research},
   year = {2016}
}

TY  - JOUR
ID  - 1354197
AU  - Švec, Ján - Rygl, Jan
PY  - 2016
TI  - Building Corpora for Stylometric Research
PB  - Springer International Publishing
CY  - Germany
SN  - 9783319455099
KW  - corpus
KW  - stylometry
KW  - authorship
KW  - crawler
N2  - Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.
ER  -

ŠVEC, Ján a Jan RYGL. Building Corpora for Stylometric Research. In Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala. \textit{Text, Speech, and Dialogue - 19th International Conference}. Germany: Springer International Publishing, 2016, s.~20-27. ISBN~978-3-319-45509-9. Dostupné z: https://dx.doi.org/10.1007/978-3-319-45510-5\_{}3.

Podrobný výpis o publikaci