RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Petr Sojka, Aleš Horák, Ivan Kopeček, Karel Pala. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer International Publishing, 2016, p. 295-302. ISBN 978-3-319-45509-9. Available from: https://dx.doi.org/10.1007/978-3-319-45510-5_34.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Annotated Amharic Corpora
Authors RYCHLÝ, Pavel (203 Czech Republic, belonging to the institution) and Vít SUCHOMEL (203 Czech Republic, guarantor, belonging to the institution).
Edition Switzerland, Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings, p. 295-302, 8 pp. 2016.
Publisher Springer International Publishing
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 60200 6.2 Languages and Literature
Country of publisher Switzerland
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW Plný text výsledku
Impact factor Impact factor: 0.402 in 2005
RIV identification code RIV/00216224:14330/16:00088120
Organization unit Faculty of Informatics
ISBN 978-3-319-45509-9
ISSN 0302-9743
Doi http://dx.doi.org/10.1007/978-3-319-45510-5_34
UT WoS 000389707400034
Keywords in English Amharic; text corpus; web corpus; under-resourced language; corpus annotation; morphological tagger
Tags firank_B
Tags International impact, Reviewed
Changed by Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 1/11/2017 11:02.
Abstract
Amharic is one of under-resourced languages. The paper presents two text corpora. The first one is a substantially cleaned version of existing morphologically annotated WIC Corpus (210,000 words). The second one is the largest Amharic text corpus (17 million words). It was created from Web pages automatically crawled in 2013, 2015 and 2016. It is part-of-speech annotated by a tagger trained and evaluated on the WIC Corpus.
Links
GA15-13277S, research and development projectName: Hyperintensionální logika pro analýzu přirozeného jazyka
Investor: Czech Science Foundation
7F14047, research and development projectName: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 24/4/2024 14:04