RYCHLÝ, Pavel and Samuel ŠPALEK. Utok: The Fast Rule-based Tokenizer. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. Brno: Tribun EU, 2022, p. 149-154. ISBN 978-80-263-1752-4.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Utok: The Fast Rule-based Tokenizer
Authors RYCHLÝ, Pavel (203 Czech Republic, guarantor, belonging to the institution) and Samuel ŠPALEK (703 Slovakia, belonging to the institution).
Edition Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 149-154, 6 pp. 2022.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10200 1.2 Computer and information sciences
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW Plný text Domovská stránka workshopu
RIV identification code RIV/00216224:14330/22:00127488
Organization unit Faculty of Informatics
ISBN 978-80-263-1752-4
ISSN 2336-4289
Keywords in English tokenizer; tokenization; text processing
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 15/5/2024 10:07.
Abstract
Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed.
Links
LM2018101, research and development projectName: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 28/5/2024 11:53