Text Tokenisation Using unitok

SUCHOMEL, Vít, Jan MICHELFEIT a Jan POMIKÁLEK. Text Tokenisation Using unitok. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, s. 71-75. ISSN 2336-4289.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Text Tokenisation Using unitok
Autoři	SUCHOMEL, Vít (203 Česká republika, garant, domácí), Jan MICHELFEIT (203 Česká republika, domácí) a Jan POMIKÁLEK (203 Česká republika, domácí).
Vydání	Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, od s. 71-75, 5 s. 2014.
Nakladatel	Tribun EU

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	tištěná verze "print"
WWW	URL
Kód RIV	RIV/00216224:14330/14:00077514
Organizační jednotka	Fakulta informatiky
ISSN	2336-4289
UT WoS	000374560500009
Klíčová slova anglicky	tokenisation; corpus tool
Příznaky	Mezinárodní význam
Změnil	Změnil: RNDr. Vít Suchomel, Ph.D., učo 139723. Změněno: 25. 5. 2021 19:20.

Anotace

This paper presents unitok, a tool for tokenisation of text in many languages. Although a simple idea – exploiting spaces in the text to separate tokens – works well most of the time, the rest of observed cases is quite complicated, language dependent and requires a special treatment. The paper covers the overall design of unitok as well as the way the tool deals with some language or web data specific tokenisation cases. The rule what to consider a token is briefly described. The tool is compared to two other tokenisers in terms of output token count and tokenising speed. unitok is publicly available under the GPL licence at http://corpus.tools.

Návaznosti
LM2010013, projekt VaV	Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)
LM2010013, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum
7F14047, projekt VaV	Název: Harvesting big text data for under-resourced languages (Akronym: HaBiT)
7F14047, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Harvesting big text data for under-resourced languages

VytisknoutZobrazeno: 26. 4. 2024 20:45

Text Tokenisation Using unitok

Další aplikace