Annotated Amharic Corpora

RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Petr Sojka, Aleš Horák, Ivan Kopeček, Karel Pala. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer International Publishing, 2016, p. 295-302. ISBN 978-3-319-45509-9. Available from: https://dx.doi.org/10.1007/978-3-319-45510-5_34.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Annotated Amharic Corpora
Authors	RYCHLÝ, Pavel (203 Czech Republic, belonging to the institution) and Vít SUCHOMEL (203 Czech Republic, guarantor, belonging to the institution).
Edition	Switzerland, Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings, p. 295-302, 8 pp. 2016.
Publisher	Springer International Publishing

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	60200 6.2 Languages and Literature
Country of publisher	Switzerland
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	Plný text výsledku
Impact factor	Impact factor: 0.402 in 2005
RIV identification code	RIV/00216224:14330/16:00088120
Organization unit	Faculty of Informatics
ISBN	978-3-319-45509-9
ISSN	0302-9743
Doi	http://dx.doi.org/10.1007/978-3-319-45510-5_34
UT WoS	000389707400034
Keywords in English	Amharic; text corpus; web corpus; under-resourced language; corpus annotation; morphological tagger
Tags	firank_B
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 1/11/2017 11:02.

Abstract

Amharic is one of under-resourced languages. The paper presents two text corpora. The first one is a substantially cleaned version of existing morphologically annotated WIC Corpus (210,000 words). The second one is the largest Amharic text corpus (17 million words). It was created from Web pages automatically crawled in 2013, 2015 and 2016. It is part-of-speech annotated by a tagger trained and evaluated on the WIC Corpus.

Links
GA15-13277S, research and development project	Name: Hyperintensionální logika pro analýzu přirozeného jazyka
GA15-13277S, research and development project	Investor: Czech Science Foundation
7F14047, research and development project	Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
7F14047, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 1/9/2024 01:38

Annotated Amharic Corpora

Other applications