2014
Text Tokenisation Using unitok
SUCHOMEL, Vít; Jan MICHELFEIT and Jan POMIKÁLEKBasic information
Original name
Text Tokenisation Using unitok
Authors
SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution); Jan MICHELFEIT (203 Czech Republic, belonging to the institution) and Jan POMIKÁLEK (203 Czech Republic, belonging to the institution)
Edition
Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, p. 71-75, 5 pp. 2014
Publisher
Tribun EU
Other information
Language
English
Type of outcome
Proceedings paper
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
Czech Republic
Confidentiality degree
is not subject to a state or trade secret
Publication form
printed version "print"
References:
RIV identification code
RIV/00216224:14330/14:00077514
Organization unit
Faculty of Informatics
ISSN
UT WoS
000374560500009
Keywords in English
tokenisation; corpus tool
Tags
International impact
Changed: 25/5/2021 19:20, RNDr. Vít Suchomel, Ph.D.
Abstract
In the original language
This paper presents unitok, a tool for tokenisation of text in many languages. Although a simple idea – exploiting spaces in the text to separate tokens – works well most of the time, the rest of observed cases is quite complicated, language dependent and requires a special treatment. The paper covers the overall design of unitok as well as the way the tool deals with some language or web data specific tokenisation cases. The rule what to consider a token is briefly described. The tool is compared to two other tokenisers in terms of output token count and tokenising speed. unitok is publicly available under the GPL licence at http://corpus.tools.
Links
LM2010013, research and development project |
| ||
7F14047, research and development project |
|