JAKUBÍČEK, Miloš and Pavel RYCHLÝ. Optimization of Regular Expression Evaluation within the Manatee Corpus Management System. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, p. 37-48. ISSN 2336-4289.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Optimization of Regular Expression Evaluation within the Manatee Corpus Management System
Authors JAKUBÍČEK, Miloš (203 Czech Republic, guarantor, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution).
Edition Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, p. 37-48, 12 pp. 2014.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
RIV identification code RIV/00216224:14330/14:00077511
Organization unit Faculty of Informatics
ISSN 2336-4289
Keywords in English text corpus; regular expression; Manatee
Tags International impact, Reviewed
Changed by Changed by: Mgr. Lucia Kocincová, učo 374080. Changed: 28/11/2014 06:56.
Abstract
This paper is concerned with searching large text corpora – electronic collections of texts. Often these are subject to queries specified by means of regular expressions. Such queries go beyond a simple keyword search that can be quickly evaluated using an inverted index, usually they are rather processed by third-party regular expression libraries and take significantly more time to evaluate. In this paper we present an index-based approach for optimization of regular expression evaluation that we call n-gram prefetching. It is based on the assumption that most regular expression queries on text corpora contain at least some fixed string portions representing clues that can be used for developing heuristics that would prune the number of potentially matching strings. The presented work has been designed and implemented within the Manatee corpus management system. We show that the proposed approach can significantly speed up regular expression processing by providing evaluation on a test set of queries executed on a number of billion-word text corpora.
Links
LM2010013, research and development projectName: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 5/10/2024 14:51