Distributed Corpus Search

D 2018

Distributed Corpus Search

RYCHLÝ, Pavel, Radoslav RÁBARA a Ondřej HERMAN

Základní údaje

Originální název

Distributed Corpus Search

Autoři

RYCHLÝ, Pavel (203 Česká republika, domácí), Radoslav RÁBARA (703 Slovensko, domácí) a Ondřej HERMAN (203 Česká republika, garant, domácí)

Vydání

Miyazaki, Japan, 6th Workshop on the Challenges in the Management of Large Corpora, od s. 10-13, 4 s. 2018

Nakladatel

European Language Resource Association

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Francie

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

Sborník CMLC-6

Kód RIV

RIV/00216224:14330/18:00101008

Organizační jednotka

Fakulta informatiky

ISBN

979-1-09-554614-6

Klíčová slova anglicky

distributed corpus search

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 23. 1. 2019 13:48, doc. Mgr. Pavel Rychlý, Ph.D.

Anotace

V originále

Available amount of linguistic data raises fast and so do the processing requIrements. The current trend is towards parallel and distributed systems, but corpus management systems have been slow to follow it. In this article, we describe the work in progress distributed corpus management system using a large cluster of commodity machines. The implementation is based on the Manatee corpus management system and written in the Go language. Currently, the implemented features are query evaluation, concordance building, concordance sorting and frequency distribution calculation. We evaluate the performance of the distributed system on a cluster of 65 commodity computers and compare it to the old implementation of Manatee. The performance increase for the distributed evaluation in the concordance creation task ranges from 2.4 to 69.2 compared to the old system, from 56 to 305 times for the concordance sorting task and from 27 to 614 for the frequency distribution calculation. The results show that the system scales very well.

Návaznosti

GA18-23891S, projekt VaV

Název: Hyperintensionální usuzování nad texty přirozeného jazyka

Investor: Grantová agentura ČR, Hyperintensionální usuzování nad texty přirozeného jazyka

LM2015071, projekt VaV

Název: Jazyková výzkumná infrastruktura v České republice (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

MUNI/A/0854/2017, interní kód MU

Název: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace VII.

Investor: Masarykova univerzita, Rozsáhlé výpočetní systémy: modely, aplikace a verifikace VII., DO R. 2020_Kategorie A - Specifický výzkum - Studentské výzkumné projekty

Citovat

RYCHLÝ, Pavel, Radoslav RÁBARA a Ondřej HERMAN. Distributed Corpus Search. Online. In Piotr Banski, Marc Kupietz, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Andreas Witt. 6th Workshop on the Challenges in the Management of Large Corpora. Miyazaki, Japan: European Language Resource Association, 2018, s. 10-13. ISBN 979-1-09-554614-6.

@inproceedings{1420812,
   author = {Rychlý, Pavel and Rábara, Radoslav and Herman, Ondřej},
   address = {Miyazaki, Japan},
   booktitle = {6th Workshop on the Challenges in the Management of Large Corpora},
   editor = {Piotr Banski, Marc Kupietz, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Andreas Witt},
   keywords = {distributed corpus search},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Miyazaki, Japan},
   isbn = {979-1-09-554614-6},
   pages = {10-13},
   publisher = {European Language Resource Association},
   title = {Distributed Corpus Search},
   url = {http://lrec-conf.org/workshops/lrec2018/W17/pdf/book_of_proceedings.pdf},
   year = {2018}
}

TY  - JOUR
ID  - 1420812
AU  - Rychlý, Pavel - Rábara, Radoslav - Herman, Ondřej
PY  - 2018
TI  - Distributed Corpus Search
PB  - European Language Resource Association
CY  - Miyazaki, Japan
SN  - 9791095546146
KW  - distributed corpus search
UR  - http://lrec-conf.org/workshops/lrec2018/W17/pdf/book_of_proceedings.pdf
L2  - http://lrec-conf.org/workshops/lrec2018/W17/pdf/book_of_proceedings.pdf
N2  - Available amount of linguistic data raises fast and so do the processing requIrements. The current trend is towards parallel and distributed systems, but corpus management systems have been slow to follow it. In this article, we describe the work in progress distributed corpus management system using a large cluster of commodity machines. The implementation is based on the Manatee corpus management system and written in the Go language. Currently, the implemented features are query evaluation, concordance building, concordance sorting and frequency distribution calculation. We evaluate the performance of the distributed system on a cluster of 65 commodity computers and compare it to the old implementation of Manatee. The performance increase for the distributed evaluation in the concordance creation task ranges from 2.4 to 69.2 compared to the old system, from 56 to 305 times for the concordance sorting task and from 27 to 614 for the frequency distribution calculation. The results show that the system scales very well.
ER  -

RYCHLÝ, Pavel, Radoslav RÁBARA a Ondřej HERMAN. Distributed Corpus Search. Online. In Piotr Banski, Marc Kupietz, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Andreas Witt. \textit{6th Workshop on the Challenges in the Management of Large Corpora}. Miyazaki, Japan: European Language Resource Association, 2018, s.~10-13. ISBN~979-1-09-554614-6.

Podrobný výpis o publikaci