SUCHOMEL, Vít and Pavel RYCHLÝ. Set of Ethiopian Web Corpora. 2016.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Set of Ethiopian Web Corpora
Authors SUCHOMEL, Vít (203 Czech Republic, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution).
Edition 2016.
Other information
Original language English
Type of outcome Software
Field of Study 60200 6.2 Languages and Literature
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
WWW URL
RIV identification code RIV/00216224:14330/16:00096851
Organization unit Faculty of Informatics
Keywords in English text corpora; Ethiopian languages
Technical parameters Amharic WIC corpus, 200 thousand tokens; amWaC16 Amharic corpus, 20 million tokens; orWaC16 Oromo corpus, 5.1 million tokens; soWaC16 Somali corpus, 80 million tokens; tiWaC16 Tigrinya corpus, 2.5 million tokens.
Changed by Changed by: doc. Mgr. Pavel Rychlý, Ph.D., učo 3692. Changed: 1/6/2017 15:52.
Abstract
A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)
Links
7F14047, research and development projectName: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 25/8/2024 15:43