ŘEHŮŘEK, Radim a Milan KOLKUS. Language Identification on the Web: Extending the Dictionary Method. In Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. první. Mexico City, Mexico: Springer-Verlag, 2009. s. 357-368, 12 s. ISBN 978-3-642-00381-3. doi:10.1007/978-3-642-00382-0_29.
Další formáty:   BibTeX LaTeX RIS
Základní údaje
Originální název Language Identification on the Web: Extending the Dictionary Method
Název česky Language Identification on the Web: Extending the Dictionary Method
Autoři ŘEHŮŘEK, Radim (203 Česká republika, garant, domácí) a Milan KOLKUS (703 Slovensko).
Vydání první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. od s. 357-368, 12 s. 2009.
Nakladatel Springer-Verlag
Další údaje
Originální jazyk angličtina
Typ výsledku Stať ve sborníku
Obor 10201 Computer sciences, information science, bioinformatics
Stát vydavatele Mexiko
Utajení není předmětem státního či obchodního tajemství
Forma vydání tištěná verze "print"
WWW conference website paper URL
Impakt faktor Impact factor: 0.402 v roce 2005
Kód RIV RIV/00216224:14330/09:00067120
Organizační jednotka Fakulta informatiky
ISBN 978-3-642-00381-3
ISSN 0302-9743
Doi http://dx.doi.org/10.1007/978-3-642-00382-0_29
UT WoS 000265681200029
Klíčová slova anglicky machine learning; language segmentation; language identification
Štítky language identification, language segmentation, machine learning
Příznaky Mezinárodní význam, Recenzováno
Změnil Změnil: RNDr. Pavel Šmerk, Ph.D., učo 3880. Změněno: 30. 4. 2014 05:56.
Anotace
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Anotace česky
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Návaznosti
LC536, projekt VaVNázev: Centrum komputační lingvistiky
Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Centra základního výzkumu
VytisknoutZobrazeno: 5. 4. 2020 16:56