Efficiency and Scalability Issues in Metric Access Methods

C 2008

Efficiency and Scalability Issues in Metric Access Methods

DOHNAL, Vlastislav; Claudio GENNARO a Pavel ZEZULA

Základní údaje

Originální název

Efficiency and Scalability Issues in Metric Access Methods

Název česky

Otázky výkonnosti a škálovatelnosti metrických vyhledávacích metod

Autoři

DOHNAL, Vlastislav; Claudio GENNARO a Pavel ZEZULA

Vydání

1. vyd. Berlin, Germany, Computational Intelligence in Medical Informatics, od s. 235-264, 30 s. Studies in Computational Intelligence, vol. 85, 2008

Nakladatel

Springer-Verlag Berlin Heidelberg

Další údaje

Jazyk

angličtina

Typ výsledku

Kapitola resp. kapitoly v odborné knize

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Německo

Utajení

není předmětem státního či obchodního tajemství

Kód RIV

RIV/00216224:14330/08:00024154

Organizační jednotka

Fakulta informatiky

ISBN

978-3-540-75766-5

Klíčová slova anglicky

similarity search; bioinformatics; scalability; centralized index structure; distributed index structure; metric space; peer-to-peer network; experimental evaluation

Štítky

bioinformatics, centralized index structure, DISA, distributed index structure, experimental evaluation, Metric Space, peer-to-peer network, scalability, similarity search

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 13. 5. 2009 14:40, doc. RNDr. Vlastislav Dohnal, Ph.D.

Anotace

ORIG CZ

V originále

The metric space paradigm has recently received attention as an important model of similarity in the area of Bioinformatics. Numerous techniques have been proposed to solve similarity (range or nearest-neighbor) queries on collections of data from metric domains. Though important representatives are outlined, this chapter is not trying to substitute existing comprehensive surveys. The main objective is to explain and prove by experiments that similarity searching is typically an expensive process which does not easily scale to very large volumes of data, thus distributed architectures able to exploit parallelism must be employed. After a review of applications using the metric space approach in the field of Bioinformatics, the chapter provides an overview of methods used for creating index structures able to speedup retrieval. In the metric space approach, only pair-wise distances between objects are quantified, so they represent the level of dissimilarity. The key idea of index structures is to partition the data into subsets so that queries are evaluated without examining entire collections -- minimizing both the number of distance computations and the number of I/O accesses. These objectives are obtained by exploiting the property of metric spaces called the triangle inequality which states that if two objects are near a third object, they cannot be too distant to one another. Unfortunately, computational costs are still high and the linear scalability of single-computer implementations prevents from searching in large and ever growing data files efficiently. For these reasons, we describe very recent parallel and distributed similarity search techniques and study performance of their implementations. Specifically, Section 12.1 presents the metric space approach and its applications in the field of Bioinformatics. Section 12.2 describes some of the most popular centralized disk-based metric indexes. Consequently, Section 12.3 concentrates on parallel and distributed access methods which can deal with data collections that for practical purposes can be arbitrary large, which is typical for Bioinformatics workloads. An experimental evaluation of the presented distributed approaches on real-life data sets is presented in 12.4. The chapter concludes in Section 12.5.

Česky

Kapitola v knize se zabývá problematikou podobnostního hledání v biologických datech. Jako model podobnosti používáme metrický prostor. V práce je shrunuta dosavadní znalost v oblasti indexování pro centralizované i distribuované výpočetní systémy.

Návaznosti

GP201/07/P240, projekt VaV

Název: Distribuované indexační struktury pro podobnostní hledání

Investor: Grantová agentura ČR, Distribuované indexační struktury pro podobnostní hledání

1ET100300419, projekt VaV

Název: Inteligentní modely, algoritmy, metody a nástroje pro vytváření sémantického webu

Investor: Akademie věd ČR, Inteligentní modely, algoritmy, metody a nástroje pro vytváření sémantického webu

Citovat

DOHNAL, Vlastislav; Claudio GENNARO a Pavel ZEZULA. Efficiency and Scalability Issues in Metric Access Methods. In Computational Intelligence in Medical Informatics. 1. vyd. Berlin, Germany: Springer-Verlag Berlin Heidelberg, 2008, s. 235-264. Studies in Computational Intelligence, vol. 85. ISBN 978-3-540-75766-5.

@inbook{756174,
   author = {Dohnal, Vlastislav and Gennaro, Claudio and Zezula, Pavel},
   address = {Berlin, Germany},
   booktitle = {Computational Intelligence in Medical Informatics},
   edition = {1},
   keywords = {similarity search; bioinformatics; scalability; centralized index structure; distributed index structure; metric space; peer-to-peer network; experimental evaluation},
   language = {eng},
   location = {Berlin, Germany},
   isbn = {978-3-540-75766-5},
   pages = {235-264},
   publisher = {Springer-Verlag Berlin Heidelberg},
   title = {Efficiency and Scalability Issues in Metric Access Methods},
   year = {2008}
}

TY  - CHAP
ID  - 756174
AU  - Dohnal, Vlastislav - Gennaro, Claudio - Zezula, Pavel
PY  - 2008
TI  - Efficiency and Scalability Issues in Metric Access Methods
VL  - Studies in Computational Intelligence, vol. 85
PB  - Springer-Verlag Berlin Heidelberg
CY  - Berlin, Germany
SN  - 9783540757665
KW  - similarity search
KW  - bioinformatics
KW  - scalability
KW  - centralized index structure
KW  - distributed index structure
KW  - metric space
KW  - peer-to-peer network
KW  - experimental evaluation
N2  - The metric space paradigm has recently received attention as an important model of similarity in the area of Bioinformatics. Numerous techniques have been proposed to solve similarity (range or nearest-neighbor) queries on collections of data from metric domains. Though important representatives are outlined, this chapter is not trying to substitute existing comprehensive surveys. The main objective is to explain and prove by experiments that similarity searching is typically an expensive process which does not easily scale to very large volumes of data, thus distributed architectures able to exploit parallelism must be employed. After a review of applications using the metric space approach in the field of Bioinformatics, the chapter provides an overview of methods used for creating index structures able to speedup retrieval. In the metric space approach, only pair-wise distances between objects are quantified, so they represent the level of dissimilarity. The key idea of index structures is to partition the data into subsets so that queries are evaluated without examining entire collections -- minimizing both the number of distance computations and the number of I/O accesses. These objectives are obtained by exploiting the property of metric spaces called the triangle inequality which states that if two objects are near a third object, they cannot be too distant to one another. Unfortunately, computational costs are still high and the linear scalability of single-computer implementations prevents from searching in large and ever growing data files efficiently. For these reasons, we describe very recent parallel and distributed similarity search techniques and study performance of their implementations. Specifically, Section 12.1 presents the metric space approach and its applications in the field of Bioinformatics. Section 12.2 describes some of the most popular centralized disk-based metric indexes. Consequently, Section 12.3 concentrates on parallel and distributed access methods which can deal with data collections that for practical purposes can be arbitrary large, which is typical for Bioinformatics workloads. An experimental evaluation of the presented distributed approaches on real-life data sets is presented in 12.4. The chapter concludes in Section 12.5.
ER  -

DOHNAL, Vlastislav; Claudio GENNARO a Pavel ZEZULA. Efficiency and Scalability Issues in Metric Access Methods. In \textit{Computational Intelligence in Medical Informatics}. 1. vyd. Berlin, Germany: Springer-Verlag Berlin Heidelberg, 2008, s.~235-264. Studies in Computational Intelligence, vol. 85. ISBN~978-3-540-75766-5.

Přehled o publikaci