Informační systém MU
ŘEHŮŘEK, Radim and Milan KOLKUS. Language Identification on the Web: Extending the Dictionary Method. In Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. první. Mexico City, Mexico: Springer-Verlag, 2009, p. 357-368. ISBN 978-3-642-00381-3. Available from: https://dx.doi.org/10.1007/978-3-642-00382-0_29.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Language Identification on the Web: Extending the Dictionary Method
Name in Czech Language Identification on the Web: Extending the Dictionary Method
Authors ŘEHŮŘEK, Radim (203 Czech Republic, guarantor, belonging to the institution) and Milan KOLKUS (703 Slovakia).
Edition první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. p. 357-368, 12 pp. 2009.
Publisher Springer-Verlag
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Mexico
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW conference website paper URL
Impact factor Impact factor: 0.402 in 2005
RIV identification code RIV/00216224:14330/09:00067120
Organization unit Faculty of Informatics
ISBN 978-3-642-00381-3
ISSN 0302-9743
Doi http://dx.doi.org/10.1007/978-3-642-00382-0_29
UT WoS 000265681200029
Keywords in English machine learning; language segmentation; language identification
Tags language identification, language segmentation, machine learning
Tags International impact, Reviewed
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 30/4/2014 05:56.
Abstract
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Abstract (in Czech)
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Links
LC536, research and development projectName: Centrum komputační lingvistiky
Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky
Displayed: 25/4/2024 06:13