HERMAN, Ondřej, Vít SUCHOMEL, Vít BAISA and Pavel RYCHLÝ. DSL Shared task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model. Online. In Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3). Osaka: Association for Natural Language Processing (ANLP), Osaka, Japan, 2016, p. 114-118. ISBN 978-4-87974-716-7.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name DSL Shared task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model
Authors HERMAN, Ondřej (203 Czech Republic, guarantor, belonging to the institution), Vít SUCHOMEL (203 Czech Republic, belonging to the institution), Vít BAISA (203 Czech Republic, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution).
Edition Osaka, Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), p. 114-118, 5 pp. 2016.
Publisher Association for Natural Language Processing (ANLP), Osaka, Japan
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW URL
RIV identification code RIV/00216224:14330/16:00092557
Organization unit Faculty of Informatics
ISBN 978-4-87974-716-7
Keywords in English language discrimination;expectation maximization;language model
Tags best
Tags International impact, Reviewed
Changed by Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 1/11/2017 12:13.
Abstract
In this paper we investigate two approaches to discrimination of similar languages: Expectation--maximization algorithm for estimating conditional probability P(word|language) and byte level language models similar to compression-based language modelling methods. The accuracy of these methods reached respectively 86.6 % and 88.3 % on set A of the DSL Shared task 2016 competition.
Links
MUNI/A/0945/2015, interní kód MUName: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.
Investor: Masaryk University, Category A
7F14047, research and development projectName: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 24/8/2024 14:10