Large Corpora for Turkic Languages and Unsupervised
Morphological Analysis

BAISA, Vít a Vít SUCHOMEL. Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. Online. In Seniz Demir, Ilknur Durgar El-Kahlout, Mehmet Ugur Dogan. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), 2012, s. 28-32. ISBN 978-2-9517408-7-7.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Large Corpora for Turkic Languages and Unsupervised Morphological Analysis
Autoři	BAISA, Vít (203 Česká republika, garant, domácí) a Vít SUCHOMEL (203 Česká republika, domácí).
Vydání	Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), od s. 28-32, 5 s. 2012.
Nakladatel	European Language Resources Association (ELRA)

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	60200 6.2 Languages and Literature
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	elektronická verze "online"
WWW	URL
Kód RIV	RIV/00216224:14330/12:00059944
Organizační jednotka	Fakulta informatiky
ISBN	978-2-9517408-7-7
Klíčová slova anglicky	corpus; turkic languages; unsupervised morphological analysis
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnil: RNDr. Vít Suchomel, Ph.D., učo 139723. Změněno: 9. 4. 2013 11:30.

Anotace

In this article we describe six new web corpora for Turkish, Azerbaijani, Kazakh, Turkmen, Kyrgyz and Uzbek languages. The data for these corpora was automatically crawled from the web by SpiderLing. Only minimal knowledge of these languages was required to obtain the data in raw form. Corpora are tokenized only since morphological analyzers and disambiguators for these languages are not available (except for Turkish). Subsequent experiment with unsupervised morphological segmentation was carried out on the Turkish corpus. In this experiment we achieved encouraging results. We used data provided for MorphoChallenge competition for the purpose of evaluation.

Návaznosti
LM2010013, projekt VaV	Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)
LM2010013, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

VytisknoutZobrazeno: 17. 8. 2024 02:36

Large Corpora for Turkic Languages and Unsupervised Morphological Analysis

Další aplikace