Detailed Information on Publication Record
2009
Language Identification on the Web: Extending the Dictionary Method
ŘEHŮŘEK, Radim and Milan KOLKUSBasic information
Original name
Language Identification on the Web: Extending the Dictionary Method
Name in Czech
Language Identification on the Web: Extending the Dictionary Method
Authors
ŘEHŮŘEK, Radim (203 Czech Republic, guarantor, belonging to the institution) and Milan KOLKUS (703 Slovakia)
Edition
první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. p. 357-368, 12 pp. 2009
Publisher
Springer-Verlag
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
Mexico
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
printed version "print"
References:
Impact factor
Impact factor: 0.402 in 2005
RIV identification code
RIV/00216224:14330/09:00067120
Organization unit
Faculty of Informatics
ISBN
978-3-642-00381-3
ISSN
UT WoS
000265681200029
Keywords in English
machine learning; language segmentation; language identification
Tags
International impact, Reviewed
Změněno: 30/4/2014 05:56, RNDr. Pavel Šmerk, Ph.D.
V originále
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
In Czech
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Links
LC536, research and development project |
|