Podrobný výpis o publikaci

ŘEHŮŘEK, Radim a Milan KOLKUS. Language Identification on the Web: Extending the Dictionary Method. In Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. první. Mexico City, Mexico: Springer-Verlag, 2009, s. 357-368. ISBN 978-3-642-00381-3. Dostupné z: https://dx.doi.org/10.1007/978-3-642-00382-0_29.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Language Identification on the Web: Extending the Dictionary Method
Název česky	Language Identification on the Web: Extending the Dictionary Method
Autoři	ŘEHŮŘEK, Radim (203 Česká republika, garant, domácí) a Milan KOLKUS (703 Slovensko).
Vydání	první. Mexico City, Mexico, Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. od s. 357-368, 12 s. 2009.
Nakladatel	Springer-Verlag

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Mexiko
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	tištěná verze "print"
WWW	conference website paper URL
Impakt faktor	Impact factor: 0.402 v roce 2005
Kód RIV	RIV/00216224:14330/09:00067120
Organizační jednotka	Fakulta informatiky
ISBN	978-3-642-00381-3
ISSN	0302-9743
Doi	http://dx.doi.org/10.1007/978-3-642-00382-0_29
UT WoS	000265681200029
Klíčová slova anglicky	machine learning; language segmentation; language identification
Štítky	language identification, language segmentation, machine learning
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnil: RNDr. Pavel Šmerk, Ph.D., učo 3880. Změněno: 30. 4. 2014 05:56.

Anotace
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Anotace

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Anotace česky
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

Anotace česky

Návaznosti
LC536, projekt VaV	Název: Centrum komputační lingvistiky
LC536, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Centrum komputační lingvistiky