Další formáty:
BibTeX
LaTeX
RIS
@inproceedings{1096012, author = {Dovudov, Gulshan and Suchomel, Vít and Šmerk, Pavel}, address = {Brno}, booktitle = {Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012}, editor = {Aleš Horák, Pavel Rychlý}, keywords = {web corpora; Tajik}, howpublished = {tištěná verze "print"}, language = {eng}, location = {Brno}, isbn = {978-80-263-0313-8}, pages = {91-94}, publisher = {Tribun EU}, title = {Towards 100M Morphologically Annotated Corpus of Tajik}, url = {https://nlp.fi.muni.cz/raslan/2012/paper15.pdf}, year = {2012} }
TY - JOUR ID - 1096012 AU - Dovudov, Gulshan - Suchomel, Vít - Šmerk, Pavel PY - 2012 TI - Towards 100M Morphologically Annotated Corpus of Tajik PB - Tribun EU CY - Brno SN - 9788026303138 KW - web corpora KW - Tajik UR - https://nlp.fi.muni.cz/raslan/2012/paper15.pdf N2 - The paper presents a work in progress: building morphologically annotated corpus of Tajik language of the size more than 100 million tokens. The corpus is and will be by far the largest available computer corpus of Tajik: even its current size is almost 85 million tokens. Because the available text sources are rather scarce, to achieve the goal also the texts of a lower quality have to be included. This short paper briefly reviews the current state of the corpus and analyzer, discusses problems with either “normalization” or at least categorization of low quality texts and finally also the perspectives for the nearest future. ER -
DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. Towards 100M Morphologically Annotated Corpus of Tajik. In Aleš Horák, Pavel Rychlý. \textit{Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012}. Brno: Tribun EU, 2012, s.~91-94. ISBN~978-80-263-0313-8.
|