Optimization of Regular Expression Evaluation within the
Manatee Corpus Management System

JAKUBÍČEK, Miloš a Pavel RYCHLÝ. Optimization of Regular Expression Evaluation within the Manatee Corpus Management System. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, s. 37-48. ISSN 2336-4289.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Optimization of Regular Expression Evaluation within the Manatee Corpus Management System
Autoři	JAKUBÍČEK, Miloš (203 Česká republika, garant, domácí) a Pavel RYCHLÝ (203 Česká republika, domácí).
Vydání	Brno, Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, od s. 37-48, 12 s. 2014.
Nakladatel	Tribun EU

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	tištěná verze "print"
Kód RIV	RIV/00216224:14330/14:00077511
Organizační jednotka	Fakulta informatiky
ISSN	2336-4289
Klíčová slova anglicky	text corpus; regular expression; Manatee
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnila: Mgr. Lucia Kocincová, učo 374080. Změněno: 28. 11. 2014 06:56.

Anotace

This paper is concerned with searching large text corpora – electronic collections of texts. Often these are subject to queries specified by means of regular expressions. Such queries go beyond a simple keyword search that can be quickly evaluated using an inverted index, usually they are rather processed by third-party regular expression libraries and take significantly more time to evaluate. In this paper we present an index-based approach for optimization of regular expression evaluation that we call n-gram prefetching. It is based on the assumption that most regular expression queries on text corpora contain at least some fixed string portions representing clues that can be used for developing heuristics that would prune the number of potentially matching strings. The presented work has been designed and implemented within the Manatee corpus management system. We show that the proposed approach can significantly speed up regular expression processing by providing evaluation on a test set of queries executed on a number of billion-word text corpora.

Návaznosti
LM2010013, projekt VaV	Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)
LM2010013, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

VytisknoutZobrazeno: 27. 9. 2024 14:46

Optimization of Regular Expression Evaluation within the Manatee Corpus Management System

Další aplikace