DOVUDOV, Gulshan, Vít SUCHOMEL and Pavel ŠMERK. Towards 100M Morphologically Annotated Corpus of Tajik. In Aleš Horák, Pavel Rychlý. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. Brno: Tribun EU, 2012, p. 91-94. ISBN 978-80-263-0313-8.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Towards 100M Morphologically Annotated Corpus of Tajik
Authors DOVUDOV, Gulshan (762 Tajikistan, belonging to the institution), Vít SUCHOMEL (203 Czech Republic, belonging to the institution) and Pavel ŠMERK (203 Czech Republic, guarantor, belonging to the institution).
Edition Brno, Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012, p. 91-94, 4 pp. 2012.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 60200 6.2 Languages and Literature
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW URL
RIV identification code RIV/00216224:14330/12:00064722
Organization unit Faculty of Informatics
ISBN 978-80-263-0313-8
Keywords in English web corpora; Tajik
Tags International impact, Reviewed
Changed by Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 25/5/2021 19:21.
Abstract
The paper presents a work in progress: building morphologically annotated corpus of Tajik language of the size more than 100 million tokens. The corpus is and will be by far the largest available computer corpus of Tajik: even its current size is almost 85 million tokens. Because the available text sources are rather scarce, to achieve the goal also the texts of a lower quality have to be included. This short paper briefly reviews the current state of the corpus and analyzer, discusses problems with either “normalization” or at least categorization of low quality texts and finally also the perspectives for the nearest future.
Links
LM2010013, research and development projectName: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 25/5/2024 06:29