Set of Ethiopian Web Corpora

SUCHOMEL, Vít a Pavel RYCHLÝ. Set of Ethiopian Web Corpora. Online. 2016, [citováno 2024-04-24]

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Set of Ethiopian Web Corpora
Autoři	SUCHOMEL, Vít (203 Česká republika, domácí) a Pavel RYCHLÝ (203 Česká republika, domácí)
Vydání	2016.

Další údaje
Originální jazyk	angličtina
Typ výsledku	Software
Obor	60200 6.2 Languages and Literature
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
WWW	URL
Kód RIV	RIV/00216224:14330/16:00096851
Organizační jednotka	Fakulta informatiky
Klíčová slova anglicky	text corpora; Ethiopian languages
Technické parametry	Amharic WIC corpus, 200 thousand tokens; amWaC16 Amharic corpus, 20 million tokens; orWaC16 Oromo corpus, 5.1 million tokens; soWaC16 Somali corpus, 80 million tokens; tiWaC16 Tigrinya corpus, 2.5 million tokens.
Změnil	Změnil: doc. Mgr. Pavel Rychlý, Ph.D., učo 3692. Změněno: 1. 6. 2017 15:52.

Anotace

A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)

Návaznosti
7F14047, projekt VaV	Název: Harvesting big text data for under-resourced languages (Akronym: HaBiT)
7F14047, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Harvesting big text data for under-resourced languages

VytisknoutZobrazeno: 24. 4. 2024 09:09

Set of Ethiopian Web Corpora

Další aplikace