Set of Ethiopian Web Corpora

SUCHOMEL, Vít and Pavel RYCHLÝ. Set of Ethiopian Web Corpora. 2016.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Set of Ethiopian Web Corpora
Authors	SUCHOMEL, Vít (203 Czech Republic, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution).
Edition	2016.

Other information
Original language	English
Type of outcome	Software
Field of Study	60200 6.2 Languages and Literature
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
RIV identification code	RIV/00216224:14330/16:00096851
Organization unit	Faculty of Informatics
Keywords in English	text corpora; Ethiopian languages
Technical parameters	Amharic WIC corpus, 200 thousand tokens; amWaC16 Amharic corpus, 20 million tokens; orWaC16 Oromo corpus, 5.1 million tokens; soWaC16 Somali corpus, 80 million tokens; tiWaC16 Tigrinya corpus, 2.5 million tokens.
Changed by	Changed by: doc. Mgr. Pavel Rychlý, Ph.D., učo 3692. Changed: 1/6/2017 15:52.

Abstract

A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)

Links
7F14047, research and development project	Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
7F14047, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 25/8/2024 15:43

Set of Ethiopian Web Corpora

Other applications