SUCHOMEL, Vít. Discriminating Between Similar Languages Using Large Web Corpora. In Horák, Aleš and Rychlý, Pavel and Rambousek, Adam. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019. Brno: Tribun EU, 2019, p. 129-135. ISBN 978-80-263-1530-8.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Discriminating Between Similar Languages Using Large Web Corpora
Authors SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition Brno, Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019, p. 129-135, 7 pp. 2019.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10200 1.2 Computer and information sciences
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW URL
RIV identification code RIV/00216224:14330/19:00111666
Organization unit Faculty of Informatics
ISBN 978-80-263-1530-8
ISSN 2336-4289
UT WoS 000604899800015
Keywords in English language identification; discriminating similar languages; building web corpora
Changed by Changed by: Mgr. Michal Petr, učo 65024. Changed: 16/5/2022 15:28.
Abstract
This paper presents a method for discriminating similar lan-guages based on wordlists from large web corpora. The main benefits ofthe approach are language independency, a measure of confidence of theclassification and an easy-to-maintain implementation.The method is evaluated on VarDial 2014 workshop data set. The resultaccuracy is comparable to other methods successfully performing at theworkshop.A tool implementing the method in Python can be obtained from web sitehttp://corpus.tools/.
Links
LM2015071, research and development projectName: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 30/4/2024 10:57