D 2022

Utok: The Fast Rule-based Tokenizer

RYCHLÝ, Pavel and Samuel ŠPALEK

Basic information

Original name

Utok: The Fast Rule-based Tokenizer

Authors

RYCHLÝ, Pavel (203 Czech Republic, guarantor, belonging to the institution) and Samuel ŠPALEK (703 Slovakia, belonging to the institution)

Edition

Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 149-154, 6 pp. 2022

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

RIV identification code

RIV/00216224:14330/22:00127488

Organization unit

Faculty of Informatics

ISBN

978-80-263-1752-4

ISSN

Keywords in English

tokenizer; tokenization; text processing
Změněno: 15/5/2024 10:07, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed.

Links

LM2018101, research and development project
Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR