Metric index: an efficient and scalable solution for precise
and approximate similarity search

NOVÁK, David, Michal BATKO and Pavel ZEZULA. Metric index: an efficient and scalable solution for precise and approximate similarity search. Information Systems. Elsevier, 2011, vol. 36, No 4, p. 721-733. ISSN 0306-4379. Available from: https://dx.doi.org/10.1016/j.is.2010.10.002.

Other formats: BibTeX LaTeX RIS

TY  - JOUR
ID  - 905348
AU  - Novák, David - Batko, Michal - Zezula, Pavel
PY  - 2011
TI  - Metric index: an efficient and scalable solution for precise and approximate similarity search
JF  - Information Systems
VL  - 36
IS  - 4
SP  - 721-733
EP  - 721-733
PB  - Elsevier
SN  - 03064379
KW  - Metric space
KW  - Similarity search
KW  - Data structure
KW  - Approximation
KW  - Scalability
N2  - Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index) that employs practically all known principles of metric space partitioning, pruning, and filtering, thus reaching high search performance while having constant building costs per object. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in established structures such as the B+ - tree or even in a distributed storage. We implemented the M-Index with the B+ - tree and performed experiments on two datasets - the first is an artificial set of vectors and the other is a real-life dataset composed of a combination of five MPEG-7 visual descriptors extracted from a database of up to several million digital images. The experiments put several M-Index variants under test and compare them with established techniques for both precise and approximate similarity search. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Further, the M-Index demonstrates excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient - maintaining practically constant response times while preserving a very high recall as the dataset grows and even beating approaches designed purely for approximate search.
ER  -

Basic information
Original name	Metric index: an efficient and scalable solution for precise and approximate similarity search
Name in Czech	Metric index: efektivní a škálovatelné řešení pro přesné i aproximované podobnostní vyhledávání
Authors	NOVÁK, David (203 Czech Republic, guarantor, belonging to the institution), Michal BATKO (203 Czech Republic, belonging to the institution) and Pavel ZEZULA (203 Czech Republic, belonging to the institution).
Edition	Information Systems, Elsevier, 2011, 0306-4379.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Netherlands
Confidentiality degree	is not subject to a state or trade secret
Impact factor	Impact factor: 1.198
RIV identification code	RIV/00216224:14330/11:00073198
Organization unit	Faculty of Informatics
Doi	http://dx.doi.org/10.1016/j.is.2010.10.002
UT WoS	000289395000003
Keywords in English	Metric space; Similarity search; Data structure; Approximation; Scalability
Tags	DISA
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 23/5/2015 11:53.

Abstract

Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index) that employs practically all known principles of metric space partitioning, pruning, and filtering, thus reaching high search performance while having constant building costs per object. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in established structures such as the B+ - tree or even in a distributed storage. We implemented the M-Index with the B+ - tree and performed experiments on two datasets - the first is an artificial set of vectors and the other is a real-life dataset composed of a combination of five MPEG-7 visual descriptors extracted from a database of up to several million digital images. The experiments put several M-Index variants under test and compare them with established techniques for both precise and approximate similarity search. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Further, the M-Index demonstrates excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient - maintaining practically constant response times while preserving a very high recall as the dataset grows and even beating approaches designed purely for approximate search.

Abstract (in Czech)

Metrický prostor je univerzální a flexibilní model podobností, kterký může být aplikován v různých oblastech zpacování informací. Představujeme nový indexační a vyhledávací mechanismus M-Index, který využívá prakticky všechny známé principy metrického dělení, prořezávání a filtrování a tak dosahuje vysoké vyhledávací účinnosti a současně má konstantní náklady na vložení jednoho objektu.

Links
GAP103/10/0886, research and development project	Name: Vizuální vyhledávání obrázků na Webu (Acronym: VisualWeb)
GAP103/10/0886, research and development project	Investor: Czech Science Foundation, Content-based Image Retrieval on the Web Scale
GPP202/10/P220, research and development project	Name: Podobnostní vyhledávání s konstantní škálovatelností (Acronym: SIM-SCALE)
GPP202/10/P220, research and development project	Investor: Czech Science Foundation
VF20102014004, research and development project	Name: Multimediální analýza (Acronym: Multimediální analýza)
VF20102014004, research and development project	Investor: Ministry of the Interior of the CR

PrintDisplayed: 14/10/2024 00:42

Metric index: an efficient and scalable solution for precise and approximate similarity search

Other applications