RYCHLÝ, Pavel, Radoslav RÁBARA and Ondřej HERMAN. Distributed Corpus Search. Online. In Piotr Banski, Marc Kupietz, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Andreas Witt. 6th Workshop on the Challenges in the Management of Large Corpora. Miyazaki, Japan: European Language Resource Association, 2018, p. 10-13. ISBN 979-1-09-554614-6.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Distributed Corpus Search
Authors RYCHLÝ, Pavel (203 Czech Republic, belonging to the institution), Radoslav RÁBARA (703 Slovakia, belonging to the institution) and Ondřej HERMAN (203 Czech Republic, guarantor, belonging to the institution).
Edition Miyazaki, Japan, 6th Workshop on the Challenges in the Management of Large Corpora, p. 10-13, 4 pp. 2018.
Publisher European Language Resource Association
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10200 1.2 Computer and information sciences
Country of publisher France
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW Sborník CMLC-6
RIV identification code RIV/00216224:14330/18:00101008
Organization unit Faculty of Informatics
ISBN 979-1-09-554614-6
Keywords in English distributed corpus search
Tags International impact, Reviewed
Changed by Changed by: doc. Mgr. Pavel Rychlý, Ph.D., učo 3692. Changed: 23/1/2019 13:48.
Abstract
Available amount of linguistic data raises fast and so do the processing requIrements. The current trend is towards parallel and distributed systems, but corpus management systems have been slow to follow it. In this article, we describe the work in progress distributed corpus management system using a large cluster of commodity machines. The implementation is based on the Manatee corpus management system and written in the Go language. Currently, the implemented features are query evaluation, concordance building, concordance sorting and frequency distribution calculation. We evaluate the performance of the distributed system on a cluster of 65 commodity computers and compare it to the old implementation of Manatee. The performance increase for the distributed evaluation in the concordance creation task ranges from 2.4 to 69.2 compared to the old system, from 56 to 305 times for the concordance sorting task and from 27 to 614 for the frequency distribution calculation. The results show that the system scales very well.
Links
GA18-23891S, research and development projectName: Hyperintensionální usuzování nad texty přirozeného jazyka
Investor: Czech Science Foundation
LM2015071, research and development projectName: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
MUNI/A/0854/2017, interní kód MUName: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace VII.
Investor: Masaryk University, Category A
PrintDisplayed: 3/5/2024 04:55