2009
Language Identification on the Web: Extending the Dictionary Method
ŘEHŮŘEK, Radim a Milan KOLKUSZákladní údaje
Originální název
Language Identification on the Web: Extending the Dictionary Method
Název česky
Language Identification on the Web: Extending the Dictionary Method
Autoři
ŘEHŮŘEK, Radim (203 Česká republika, garant, domácí) a Milan KOLKUS (703 Slovensko)
Vydání
první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. od s. 357-368, 12 s. 2009
Nakladatel
Springer-Verlag
Další údaje
Jazyk
angličtina
Typ výsledku
Stať ve sborníku
Obor
10201 Computer sciences, information science, bioinformatics
Stát vydavatele
Mexiko
Utajení
není předmětem státního či obchodního tajemství
Forma vydání
tištěná verze "print"
Odkazy
Impakt faktor
Impact factor: 0.402 v roce 2005
Kód RIV
RIV/00216224:14330/09:00067120
Organizační jednotka
Fakulta informatiky
ISBN
978-3-642-00381-3
ISSN
UT WoS
000265681200029
Klíčová slova anglicky
machine learning; language segmentation; language identification
Příznaky
Mezinárodní význam, Recenzováno
Změněno: 30. 4. 2014 05:56, RNDr. Pavel Šmerk, Ph.D.
V originále
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Česky
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Návaznosti
LC536, projekt VaV |
|