D 2014

Text Tokenisation Using unitok

SUCHOMEL, Vít; Jan MICHELFEIT and Jan POMIKÁLEK

Basic information

Original name

Text Tokenisation Using unitok

Authors

SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution); Jan MICHELFEIT (203 Czech Republic, belonging to the institution) and Jan POMIKÁLEK (203 Czech Republic, belonging to the institution)

Edition

Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, p. 71-75, 5 pp. 2014

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Proceedings paper

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

is not subject to a state or trade secret

Publication form

printed version "print"

References:

RIV identification code

RIV/00216224:14330/14:00077514

Organization unit

Faculty of Informatics

ISSN

UT WoS

000374560500009

Keywords in English

tokenisation; corpus tool

Tags

International impact
Changed: 25/5/2021 19:20, RNDr. Vít Suchomel, Ph.D.

Abstract

In the original language

This paper presents unitok, a tool for tokenisation of text in many languages. Although a simple idea – exploiting spaces in the text to separate tokens – works well most of the time, the rest of observed cases is quite complicated, language dependent and requires a special treatment. The paper covers the overall design of unitok as well as the way the tool deals with some language or web data specific tokenisation cases. The rule what to consider a token is briefly described. The tool is compared to two other tokenisers in terms of output token count and tokenising speed. unitok is publicly available under the GPL licence at http://corpus.tools.

Links

LM2010013, research and development project
Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
7F14047, research and development project
Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
Investor: Ministry of Education, Youth and Sports of the CR