Další formáty:
BibTeX
LaTeX
RIS
@inproceedings{1354197, author = {Švec, Ján and Rygl, Jan}, address = {Germany}, booktitle = {Text, Speech, and Dialogue - 19th International Conference}, doi = {http://dx.doi.org/10.1007/978-3-319-45510-5_3}, editor = {Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala}, keywords = {corpus; stylometry; authorship; crawler}, howpublished = {tištěná verze "print"}, language = {eng}, location = {Germany}, isbn = {978-3-319-45509-9}, pages = {20-27}, publisher = {Springer International Publishing}, title = {Building Corpora for Stylometric Research}, year = {2016} }
TY - JOUR ID - 1354197 AU - Švec, Ján - Rygl, Jan PY - 2016 TI - Building Corpora for Stylometric Research PB - Springer International Publishing CY - Germany SN - 9783319455099 KW - corpus KW - stylometry KW - authorship KW - crawler N2 - Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones. ER -
ŠVEC, Ján a Jan RYGL. Building Corpora for Stylometric Research. In Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala. \textit{Text, Speech, and Dialogue - 19th International Conference}. Germany: Springer International Publishing, 2016, s.~20-27. ISBN~978-3-319-45509-9. Dostupné z: https://dx.doi.org/10.1007/978-3-319-45510-5\_{}3.
|