D 2009

Language Identification on the Web: Extending the Dictionary Method

ŘEHŮŘEK, Radim a Milan KOLKUS

Základní údaje

Originální název

Language Identification on the Web: Extending the Dictionary Method

Název česky

Language Identification on the Web: Extending the Dictionary Method

Autoři

ŘEHŮŘEK, Radim (203 Česká republika, garant, domácí) a Milan KOLKUS (703 Slovensko)

Vydání

první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. od s. 357-368, 12 s. 2009

Nakladatel

Springer-Verlag

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Mexiko

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Impakt faktor

Impact factor: 0.402 v roce 2005

Kód RIV

RIV/00216224:14330/09:00067120

Organizační jednotka

Fakulta informatiky

ISBN

978-3-642-00381-3

ISSN

UT WoS

000265681200029

Klíčová slova anglicky

machine learning; language segmentation; language identification

Příznaky

Mezinárodní význam, Recenzováno
Změněno: 30. 4. 2014 05:56, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Česky

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Návaznosti

LC536, projekt VaV
Název: Centrum komputační lingvistiky
Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Centrum komputační lingvistiky