Optimization of Regular Expression Evaluation within the
Manatee Corpus Management System

D 2014

Optimization of Regular Expression Evaluation within the Manatee Corpus Management System

JAKUBÍČEK, Miloš and Pavel RYCHLÝ

Basic information

Original name

Optimization of Regular Expression Evaluation within the Manatee Corpus Management System

Authors

JAKUBÍČEK, Miloš (203 Czech Republic, guarantor, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution)

Edition

Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, p. 37-48, 12 pp. 2014

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

RIV identification code

RIV/00216224:14330/14:00077511

Organization unit

Faculty of Informatics

ISSN

Keywords in English

text corpus; regular expression; Manatee

Abstract

V originále

This paper is concerned with searching large text corpora – electronic collections of texts. Often these are subject to queries specified by means of regular expressions. Such queries go beyond a simple keyword search that can be quickly evaluated using an inverted index, usually they are rather processed by third-party regular expression libraries and take significantly more time to evaluate. In this paper we present an index-based approach for optimization of regular expression evaluation that we call n-gram prefetching. It is based on the assumption that most regular expression queries on text corpora contain at least some fixed string portions representing clues that can be used for developing heuristics that would prune the number of potentially matching strings. The presented work has been designed and implemented within the Manatee corpus management system. We show that the proposed approach can significantly speed up regular expression processing by providing evaluation on a test set of queries executed on a number of billion-word text corpora.

Links

LM2010013, research and development project

Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)

Investor: Ministry of Education, Youth and Sports of the CR

Citovat

JAKUBÍČEK, Miloš and Pavel RYCHLÝ. Optimization of Regular Expression Evaluation within the Manatee Corpus Management System. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, p. 37-48. ISSN 2336-4289.

@inproceedings{1210693,
   author = {Jakubíček, Miloš and Rychlý, Pavel},
   address = {Brno},
   booktitle = {Eighth Workshop on Recent Advances in Slavonic Natural Language Processing},
   keywords = {text corpus; regular expression; Manatee},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Brno},
   pages = {37-48},
   publisher = {Tribun EU},
   title = {Optimization of Regular Expression Evaluation within the Manatee Corpus Management System},
   year = {2014}
}

TY  - JOUR
ID  - 1210693
AU  - Jakubíček, Miloš - Rychlý, Pavel
PY  - 2014
TI  - Optimization of Regular Expression Evaluation within the Manatee Corpus Management System
PB  - Tribun EU
CY  - Brno
KW  - text corpus
KW  - regular expression
KW  - Manatee
N2  - This paper is concerned with searching large text corpora – electronic collections of texts. Often these are subject to queries specified by means of regular expressions. Such queries go beyond a simple keyword search that can be quickly evaluated using an inverted index, usually they are rather processed by third-party regular expression libraries and take significantly more time to evaluate. In this paper we present an index-based approach for optimization of regular expression evaluation that we call n-gram prefetching. It is based on the assumption that most regular expression queries on text corpora contain at least some fixed string portions representing clues that can be used for developing heuristics that would prune the number of potentially matching strings. The presented work has been designed and implemented within the Manatee corpus management system. We show that the proposed approach can significantly speed up regular expression processing by providing evaluation on a test set of queries executed on a number of billion-word text corpora.
ER  -

JAKUBÍČEK, Miloš and Pavel RYCHLÝ. Optimization of Regular Expression Evaluation within the Manatee Corpus Management System. In \textit{Eighth Workshop on Recent Advances in Slavonic Natural Language Processing}. Brno: Tribun EU, 2014, p.~37-48. ISSN~2336-4289.

Detailed Information on Publication Record