BAISA, Vít. Character-based Language Model. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, p. 3-10. ISSN 2336-4289.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Character-based Language Model
Authors BAISA, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, p. 3-10, 8 pp. 2014.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 60200 6.2 Languages and Literature
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW URL
RIV identification code RIV/00216224:14330/14:00077506
Organization unit Faculty of Informatics
ISSN 2336-4289
UT WoS 000374560500001
Keywords in English language model; suffix array; LCP; trie; character-based; random text generator; corpus
Tags International impact, Reviewed
Changed by Changed by: Mgr. et Mgr. Vít Baisa, Ph.D., učo 139654. Changed: 27/5/2021 09:10.
Abstract
Language modelling and also other natural language processing tasks are usually based on words. I present here a more general yet simpler approach to language modelling using much smaller units of text data: character-based language model (CBLM). In this paper I describe the underlying data structure of the model, evaluate the model using standard measures (entropy, perplexity). As a proof-of-concept and an extrinsic evaluation I present also a random sentence generator based on this model.
Links
LM2010013, research and development projectName: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 30/8/2024 16:37