D 2009

Language Identification on the Web: Extending the Dictionary Method

ŘEHŮŘEK, Radim and Milan KOLKUS

Basic information

Original name

Language Identification on the Web: Extending the Dictionary Method

Name in Czech

Language Identification on the Web: Extending the Dictionary Method

Authors

ŘEHŮŘEK, Radim (203 Czech Republic, guarantor, belonging to the institution) and Milan KOLKUS (703 Slovakia)

Edition

první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. p. 357-368, 12 pp. 2009

Publisher

Springer-Verlag

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Mexico

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

Impact factor

Impact factor: 0.402 in 2005

RIV identification code

RIV/00216224:14330/09:00067120

Organization unit

Faculty of Informatics

ISBN

978-3-642-00381-3

ISSN

UT WoS

000265681200029

Keywords in English

machine learning; language segmentation; language identification

Tags

International impact, Reviewed
Změněno: 30/4/2014 05:56, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

In Czech

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Links

LC536, research and development project
Name: Centrum komputační lingvistiky
Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky