BAISA, Vít and Vít SUCHOMEL. Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. Online. In Seniz Demir, Ilknur Durgar El-Kahlout, Mehmet Ugur Dogan. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), 2012, p. 28-32. ISBN 978-2-9517408-7-7.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Large Corpora for Turkic Languages and Unsupervised Morphological Analysis
Authors BAISA, Vít (203 Czech Republic, guarantor, belonging to the institution) and Vít SUCHOMEL (203 Czech Republic, belonging to the institution).
Edition Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), p. 28-32, 5 pp. 2012.
Publisher European Language Resources Association (ELRA)
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 60200 6.2 Languages and Literature
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW URL
RIV identification code RIV/00216224:14330/12:00059944
Organization unit Faculty of Informatics
ISBN 978-2-9517408-7-7
Keywords in English corpus; turkic languages; unsupervised morphological analysis
Tags International impact, Reviewed
Changed by Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 9/4/2013 11:30.
Abstract
In this article we describe six new web corpora for Turkish, Azerbaijani, Kazakh, Turkmen, Kyrgyz and Uzbek languages. The data for these corpora was automatically crawled from the web by SpiderLing. Only minimal knowledge of these languages was required to obtain the data in raw form. Corpora are tokenized only since morphological analyzers and disambiguators for these languages are not available (except for Turkish). Subsequent experiment with unsupervised morphological segmentation was carried out on the Turkish corpus. In this experiment we achieved encouraging results. We used data provided for MorphoChallenge competition for the purpose of evaluation.
Links
LM2010013, research and development projectName: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 28/4/2024 12:27