Character-based Language Model

BAISA, Vít. Character-based Language Model. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, p. 3-10. ISSN 2336-4289.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Character-based Language Model
Authors	BAISA, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition	Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, p. 3-10, 8 pp. 2014.
Publisher	Tribun EU

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	60200 6.2 Languages and Literature
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/14:00077506
Organization unit	Faculty of Informatics
ISSN	2336-4289
UT WoS	000374560500001
Keywords in English	language model; suffix array; LCP; trie; character-based; random text generator; corpus
Tags	International impact, Reviewed
Changed by	Changed by: Mgr. et Mgr. Vít Baisa, Ph.D., učo 139654. Changed: 27/5/2021 09:10.

Abstract

Language modelling and also other natural language processing tasks are usually based on words. I present here a more general yet simpler approach to language modelling using much smaller units of text data: character-based language model (CBLM). In this paper I describe the underlying data structure of the model, evaluate the model using standard measures (entropy, perplexity). As a proof-of-concept and an extrinsic evaluation I present also a random sentence generator based on this model.

Links
LM2010013, research and development project	Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
LM2010013, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 30/8/2024 16:37

Character-based Language Model

Other applications