Large Corpora for Turkic Languages and Unsupervised
Morphological Analysis

BAISA, Vít and Vít SUCHOMEL. Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. Online. In Seniz Demir, Ilknur Durgar El-Kahlout, Mehmet Ugur Dogan. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), 2012, p. 28-32. ISBN 978-2-9517408-7-7.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Large Corpora for Turkic Languages and Unsupervised Morphological Analysis
Authors	BAISA, Vít (203 Czech Republic, guarantor, belonging to the institution) and Vít SUCHOMEL (203 Czech Republic, belonging to the institution).
Edition	Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), p. 28-32, 5 pp. 2012.
Publisher	European Language Resources Association (ELRA)

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	60200 6.2 Languages and Literature
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	electronic version available online
WWW	URL
RIV identification code	RIV/00216224:14330/12:00059944
Organization unit	Faculty of Informatics
ISBN	978-2-9517408-7-7
Keywords in English	corpus; turkic languages; unsupervised morphological analysis
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 9/4/2013 11:30.

Abstract

In this article we describe six new web corpora for Turkish, Azerbaijani, Kazakh, Turkmen, Kyrgyz and Uzbek languages. The data for these corpora was automatically crawled from the web by SpiderLing. Only minimal knowledge of these languages was required to obtain the data in raw form. Corpora are tokenized only since morphological analyzers and disambiguators for these languages are not available (except for Turkish). Subsequent experiment with unsupervised morphological segmentation was carried out on the Turkish corpus. In this experiment we achieved encouraging results. We used data provided for MorphoChallenge competition for the purpose of evaluation.

Links
LM2010013, research and development project	Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
LM2010013, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 28/4/2024 12:27

Large Corpora for Turkic Languages and Unsupervised Morphological Analysis

Other applications