D 2018

Distributed Corpus Search

RYCHLÝ, Pavel, Radoslav RÁBARA and Ondřej HERMAN

Basic information

Original name

Distributed Corpus Search

Authors

RYCHLÝ, Pavel (203 Czech Republic, belonging to the institution), Radoslav RÁBARA (703 Slovakia, belonging to the institution) and Ondřej HERMAN (203 Czech Republic, guarantor, belonging to the institution)

Edition

Miyazaki, Japan, 6th Workshop on the Challenges in the Management of Large Corpora, p. 10-13, 4 pp. 2018

Publisher

European Language Resource Association

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

France

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

RIV identification code

RIV/00216224:14330/18:00101008

Organization unit

Faculty of Informatics

ISBN

979-1-09-554614-6

Keywords in English

distributed corpus search

Tags

International impact, Reviewed
Změněno: 23/1/2019 13:48, doc. Mgr. Pavel Rychlý, Ph.D.

Abstract

V originále

Available amount of linguistic data raises fast and so do the processing requIrements. The current trend is towards parallel and distributed systems, but corpus management systems have been slow to follow it. In this article, we describe the work in progress distributed corpus management system using a large cluster of commodity machines. The implementation is based on the Manatee corpus management system and written in the Go language. Currently, the implemented features are query evaluation, concordance building, concordance sorting and frequency distribution calculation. We evaluate the performance of the distributed system on a cluster of 65 commodity computers and compare it to the old implementation of Manatee. The performance increase for the distributed evaluation in the concordance creation task ranges from 2.4 to 69.2 compared to the old system, from 56 to 305 times for the concordance sorting task and from 27 to 614 for the frequency distribution calculation. The results show that the system scales very well.

Links

GA18-23891S, research and development project
Name: Hyperintensionální usuzování nad texty přirozeného jazyka
Investor: Czech Science Foundation
LM2015071, research and development project
Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
MUNI/A/0854/2017, interní kód MU
Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace VII.
Investor: Masaryk University, Category A