Detailed Information on Publication Record
2022
Utok: The Fast Rule-based Tokenizer
RYCHLÝ, Pavel and Samuel ŠPALEKBasic information
Original name
Utok: The Fast Rule-based Tokenizer
Authors
RYCHLÝ, Pavel (203 Czech Republic, guarantor, belonging to the institution) and Samuel ŠPALEK (703 Slovakia, belonging to the institution)
Edition
Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 149-154, 6 pp. 2022
Publisher
Tribun EU
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10200 1.2 Computer and information sciences
Country of publisher
Czech Republic
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
printed version "print"
References:
RIV identification code
RIV/00216224:14330/22:00127488
Organization unit
Faculty of Informatics
ISBN
978-80-263-1752-4
ISSN
Keywords in English
tokenizer; tokenization; text processing
Změněno: 15/5/2024 10:07, RNDr. Pavel Šmerk, Ph.D.
Abstract
V originále
Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed.
Links
LM2018101, research and development project |
|