D 2016

DSL Shared task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model

HERMAN, Ondřej, Vít SUCHOMEL, Vít BAISA and Pavel RYCHLÝ

Basic information

Original name

DSL Shared task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model

Authors

HERMAN, Ondřej (203 Czech Republic, guarantor, belonging to the institution), Vít SUCHOMEL (203 Czech Republic, belonging to the institution), Vít BAISA (203 Czech Republic, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution)

Edition

Osaka, Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), p. 114-118, 5 pp. 2016

Publisher

Association for Natural Language Processing (ANLP), Osaka, Japan

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

RIV identification code

RIV/00216224:14330/16:00092557

Organization unit

Faculty of Informatics

ISBN

978-4-87974-716-7

Keywords in English

language discrimination;expectation maximization;language model

Tags

Tags

International impact, Reviewed
Změněno: 1/11/2017 12:13, RNDr. Vít Suchomel, Ph.D.

Abstract

V originále

In this paper we investigate two approaches to discrimination of similar languages: Expectation--maximization algorithm for estimating conditional probability P(word|language) and byte level language models similar to compression-based language modelling methods. The accuracy of these methods reached respectively 86.6 % and 88.3 % on set A of the DSL Shared task 2016 competition.

Links

MUNI/A/0945/2015, interní kód MU
Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.
Investor: Masaryk University, Category A
7F14047, research and development project
Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
Investor: Ministry of Education, Youth and Sports of the CR