Large Corpora for Turkic Languages and Unsupervised
Morphological Analysis

D 2012

Large Corpora for Turkic Languages and Unsupervised Morphological Analysis

BAISA, Vít a Vít SUCHOMEL

Základní údaje

Originální název

Large Corpora for Turkic Languages and Unsupervised Morphological Analysis

Autoři

BAISA, Vít (203 Česká republika, garant, domácí) a Vít SUCHOMEL (203 Česká republika, domácí)

Vydání

Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), od s. 28-32, 5 s. 2012

Nakladatel

European Language Resources Association (ELRA)

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

60200 6.2 Languages and Literature

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

URL

Kód RIV

RIV/00216224:14330/12:00059944

Organizační jednotka

Fakulta informatiky

ISBN

978-2-9517408-7-7

Klíčová slova anglicky

corpus; turkic languages; unsupervised morphological analysis

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 9. 4. 2013 11:30, RNDr. Vít Suchomel, Ph.D.

Anotace

V originále

In this article we describe six new web corpora for Turkish, Azerbaijani, Kazakh, Turkmen, Kyrgyz and Uzbek languages. The data for these corpora was automatically crawled from the web by SpiderLing. Only minimal knowledge of these languages was required to obtain the data in raw form. Corpora are tokenized only since morphological analyzers and disambiguators for these languages are not available (except for Turkish). Subsequent experiment with unsupervised morphological segmentation was carried out on the Turkish corpus. In this experiment we achieved encouraging results. We used data provided for MorphoChallenge competition for the purpose of evaluation.

Návaznosti

LM2010013, projekt VaV

Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

Citovat

BAISA, Vít a Vít SUCHOMEL. Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. Online. In Seniz Demir, Ilknur Durgar El-Kahlout, Mehmet Ugur Dogan. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), 2012, s. 28-32. ISBN 978-2-9517408-7-7.

@inproceedings{982494,
   author = {Baisa, Vít and Suchomel, Vít},
   address = {Istanbul, Turkey},
   booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
   editor = {Seniz Demir, Ilknur Durgar El-Kahlout, Mehmet Ugur Dogan},
   keywords = {corpus; turkic languages; unsupervised morphological analysis},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Istanbul, Turkey},
   isbn = {978-2-9517408-7-7},
   pages = {28-32},
   publisher = {European Language Resources Association (ELRA)},
   title = {Large Corpora for Turkic Languages and Unsupervised Morphological Analysis},
   url = {http://www.lrec-conf.org/proceedings/lrec2012/workshops/02.Turkic%20Languages%20Proceedings.pdf},
   year = {2012}
}

TY  - CONF
ID  - 982494
AU  - Baisa, Vít - Suchomel, Vít
PY  - 2012
TI  - Large Corpora for Turkic Languages and Unsupervised Morphological Analysis
PB  - European Language Resources Association (ELRA)
CY  - Istanbul, Turkey
SN  - 9782951740877
KW  - corpus
KW  - turkic languages
KW  - unsupervised morphological analysis
UR  - http://www.lrec-conf.org/proceedings/lrec2012/workshops/02.Turkic%20Languages%20Proceedings.pdf
N2  - In this article we describe six new web corpora for Turkish, Azerbaijani, Kazakh, Turkmen, Kyrgyz and Uzbek languages. The data for these corpora was automatically crawled from the web by SpiderLing. Only minimal knowledge of these languages was required to obtain the data in raw form. Corpora are tokenized only since morphological analyzers and disambiguators for these languages are not available (except for Turkish). Subsequent experiment with unsupervised morphological segmentation was carried out on the Turkish corpus. In this experiment we achieved encouraging results. We used data provided for MorphoChallenge competition for the purpose of evaluation.
ER  -

BAISA, Vít a Vít SUCHOMEL. Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. Online. In Seniz Demir, Ilknur Durgar El-Kahlout, Mehmet Ugur Dogan. \textit{Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}. Istanbul, Turkey: European Language Resources Association (ELRA), 2012, s.~28-32. ISBN~978-2-9517408-7-7.

Přehled o publikaci