Text Tokenisation Using unitok

D 2014

Text Tokenisation Using unitok

SUCHOMEL, Vít; Jan MICHELFEIT a Jan POMIKÁLEK

Základní údaje

Originální název

Text Tokenisation Using unitok

Autoři

SUCHOMEL, Vít (203 Česká republika, garant, domácí); Jan MICHELFEIT (203 Česká republika, domácí) a Jan POMIKÁLEK (203 Česká republika, domácí)

Vydání

Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, od s. 71-75, 5 s. 2014

Nakladatel

Tribun EU

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Odkazy

URL

Kód RIV

RIV/00216224:14330/14:00077514

Organizační jednotka

Fakulta informatiky

ISSN

UT WoS

000374560500009

Klíčová slova anglicky

tokenisation; corpus tool

Příznaky

Mezinárodní význam

Změněno: 25. 5. 2021 19:20, RNDr. Vít Suchomel, Ph.D.

Anotace

V originále

This paper presents unitok, a tool for tokenisation of text in many languages. Although a simple idea – exploiting spaces in the text to separate tokens – works well most of the time, the rest of observed cases is quite complicated, language dependent and requires a special treatment. The paper covers the overall design of unitok as well as the way the tool deals with some language or web data specific tokenisation cases. The rule what to consider a token is briefly described. The tool is compared to two other tokenisers in terms of output token count and tokenising speed. unitok is publicly available under the GPL licence at http://corpus.tools.

Návaznosti

LM2010013, projekt VaV

Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

7F14047, projekt VaV

Název: Harvesting big text data for under-resourced languages (Akronym: HaBiT)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Harvesting big text data for under-resourced languages

Přehled o publikaci