Detailed Information on Publication Record

ŘEHŮŘEK, Radim and Milan KOLKUS. Language Identification on the Web: Extending the Dictionary Method. In Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. první. Mexico City, Mexico: Springer-Verlag, 2009, p. 357-368. ISBN 978-3-642-00381-3. Available from: https://dx.doi.org/10.1007/978-3-642-00382-0_29.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Language Identification on the Web: Extending the Dictionary Method
Name in Czech	Language Identification on the Web: Extending the Dictionary Method
Authors	ŘEHŮŘEK, Radim (203 Czech Republic, guarantor, belonging to the institution) and Milan KOLKUS (703 Slovakia).
Edition	první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. p. 357-368, 12 pp. 2009.
Publisher	Springer-Verlag

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Mexico
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	conference website paper URL
Impact factor	Impact factor: 0.402 in 2005
RIV identification code	RIV/00216224:14330/09:00067120
Organization unit	Faculty of Informatics
ISBN	978-3-642-00381-3
ISSN	0302-9743
Doi	http://dx.doi.org/10.1007/978-3-642-00382-0_29
UT WoS	000265681200029
Keywords in English	machine learning; language segmentation; language identification
Tags	language identification, language segmentation, machine learning
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 30/4/2014 05:56.

Abstract
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Abstract

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Abstract (in Czech)
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Abstract (in Czech)

Links
LC536, research and development project	Name: Centrum komputační lingvistiky
LC536, research and development project	Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky