M-Grid: Similarity Searching in Grids
Michal Batko
Faculty of Informatics
Masaryk University
Botanick´a 68a
Brno, Czech Republic
xbatko@ﬁ.muni.cz
Vlastislav Dohnal
Faculty of Informatics
Masaryk University
Botanick´a 68a
Brno, Czech Republic
dohnal@ﬁ.muni.cz
Pavel Zezula
Faculty of Informatics
Masaryk University
Botanick´a 68a
Brno, Czech Republic
zezula@ﬁ.muni.cz
ABSTRACT
The problem of similarity searching is nowadays attracting
a lot of attention, because upcoming applications process
complex data and the traditional exact match searching is
not suﬃcient. There are eﬃcient solutions, but they are
tailored for the needs of speciﬁc data domains. General
solutions, based on the metric space abstraction, are extensible,
but they are designed to operate on a single computer
only. Therefore, their scalability is limited and they cannot
adapt to diﬀerent performance requirements. In this paper,
we propose a distributed access structure which is fully dynamic
and exploits a Grid infrastructure. We study properties
of this structure in numerous experiments. Besides, the
performance tuning is analyzed with respect to user-speciﬁc
requirements which include the maximum response time and
the number of queries executed concurrently.
Categories and Subject Descriptors
E.1 [Data Structures]: Distributed data structures; H.2.4
[Database Management - Systems]: Query processing,
Multimedia databases; H.3.4 [Information Storage and
Retrieval - Systems and Software]: Distributed systems
General Terms
Algorithms, Performance
Keywords
similarity searching, metric space, M-Grid, D-index, performance
analysis
1. INTRODUCTION
In traditional databases, the problem of searching for objects
that exactly meet speciﬁc criteria is very well deﬁned
and there are plenty of eﬃcient solutions. However, as new
digital-age applications have emerged lately, this concept
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
P2PIR’06, November 11, 2006, Arlington, Virginia, USA.
Copyright 2006 ACM 1-59593-527-4/06/0011 ...$5.00.
fails to meet the new requirements. For example, searching
for time series with similar development or groups of
customers with common buying patterns in respective data
collections cannot use the traditional approach. Instead, a
new searching paradigm is introduced: the similarity searching.
In this respect, not only objects that perfectly satisfy a
posed query in all aspects (which probably do not exist anyway)
but also objects that meet the criteria closely enough
are retrieved from the database.
Methods for eﬃcient evaluation of similarity queries over
a data collection exist (see recent surveys [21, 7]), but they
are usually designed for centralized systems where all data
are stored and the processing is done on one computer. However,
the problem of these solutions, as shown in [10], is their
scalability. Since the response time of these structures grows
linearly with the size of the indexed dataset, sooner or later,
the response time of the system becomes unacceptable for
online processing. Moreover, the centralized systems cannot
easily adapt to performance requirements such as the maximum
query processing time or the number of users accessing
the data simultaneously.
Therefore, attempts to use distributed infrastructures to
parallelize computations have arisen recently. Since the distributed
environment is not restricted to one computer only
and can employ virtually unlimited resources, the scalability
challenge is shifted to another level. Besides, the performance
can be tuned, since the load of individual nodes can
be divided among additional nodes or the overloaded ones
can be replicated.
A very recent paper [3] studies the scalability of similarity
searching in a peer-to-peer computing environment and
results show that providing enough resources, the response
time is practically unaﬀected by the volume of indexed data.
However, the peer-to-peer paradigm expects the participating
computer nodes (which provide the necessary pool of resources)
to be independent and equal in functionality. Moreover,
the computers are usually not dedicated for a speciﬁc
application. Thus, we need a more reliable distributed environment
where the properties can be guaranteed and the
performance tuned as necessary. From this point of view, a
Grid infrastructure [13] seems to satisfy our requirements.
The parallel resources, both the computational and storage,
can be requested on demand as necessary and once assigned
they are dedicated to the application.
In this paper, we provide a similarity searching structure
designed for Grid infrastructures. Our technique is based
on the D-index [11] structure, which provides a hashing-like
centralized similarity index. Since an addressing schema
(a) 2ρ
xvx0
x1
dm
x2
1)( 1 =xbps
−=)( 2xbps
0)( 0 =xbps
dm 2ρ
Separable
Set 11
Separable
Set 01
Separable
Set 10
Separable
Set 00
Exclusion
Set
(b)
dm'
Figure 1: The bps function (a) and the combination
of two bps functions (b).
of this method is static and must be provided by the user
in advance, we discuss three dynamic variants allowing the
addressing to adjust automatically as new data items are
stored. Next, we study its performance properties by deﬁning
grouping and replication techniques. They help the
structure distribute the computational load equally. Finally,
we deal with the issue of performance tuning.
2. SIMILARITY SEARCHING
A convenient way to assess similarity between two objects
is to apply metric functions to decide the closeness
of objects as a distance, that is the objects’ dissimilarity.
A metric space M = (D, d) is deﬁned by a domain of objects
(elements, points) D and a total (distance) function d
– a non-negative (d(x, y) ≥ 0 with d(x, y) = 0 iﬀ x = y)
and symmetric (d(x, y) = d(y, x)) function that satisﬁes the
triangle inequality (d(x, y) ≤ d(x, z) + d(z, y), ∀x, y, z ∈ D).
In general, the problem of indexing in metric spaces can be
deﬁned as follows: given a set X ⊆ D in the metric space M,
preprocess or structure the elements of X so that similarity
queries can be answered eﬃciently. For a query object q ∈
D, two fundamental similarity queries can be deﬁned. A
range query retrieves R(q, r) all elements within distance
r to q, that is, the set {x ∈ X, d(q, x) ≤ r}. A nearest
neighbor query retrieves NNk(q) the k closest elements to
q, that is a set R ⊆ X such that |R| = k and ∀x ∈ R,
y ∈ X − R, d(q, x) ≤ d(q, y).
Due to space constraints of this article, we consider only
similarity range search operations here. In the following, we
brieﬂy outline the D-index, on ideas of which we build our
approach.
2.1 D-index
The Distance Index (D-index) is designed as a centralized
solution for similarity searching. It consists of two parts: an
addressing structure and a storage. The addressing structure
forms a multi-level structure of ρ-split functions. It
addresses a set of search-separable buckets which are organized
in levels and which comprise the storage.
The D-index supports easy insertion and bounded search
costs because at most one bucket needs to be accessed at
each level for range queries up to a predeﬁned value of search
radius ρ. At the same time, the applied pivot-based ﬁltering
signiﬁcantly reduces the number of distance computations
in accessed buckets. In the following, we provide a brief
overview of the D-index, more details can be found in [14]
and the full speciﬁcation, as well as performance evaluations,
are available in [11].
The partitioning principles of the D-index are based on
a multiple deﬁnition of a mapping function, called the ρsplit
function. Figure 1a shows a possible implementation
of a ρ-split function, called the ball partitioning split (bps),
originally proposed in [20]. This function uses one pivot
xv and the medium distance dm to partition a data set into
three subsets. The result of the following bps function gives a
unique identiﬁcation of the set to which the object x belongs:
bps(x) =
0 if d(x, xv) ≤ dm − ρ
1 if d(x, xv) > dm + ρ
− otherwise
The subset of objects characterized by the symbol ’−’ is
called the exclusion set, while the subsets of objects characterized
by the symbols 0 and 1 are the separable sets, because
any range query with radius not larger than ρ cannot ﬁnd
qualifying objects in both the subsets.
More separable sets can be obtained as a combination of
several bps functions, where the resulting exclusion set is the
union of the exclusion sets of the original ρ-split functions.
Furthermore, new separable sets are obtained as the intersection
of all possible pairs of the separable sets of original
functions. Figure 1b gives an illustration of this idea for the
case of two ρ-split functions. The separable sets and the exclusion
set form separable buckets and an exclusion bucket
of one level of the D-index structure, respectively.
Naturally, the more separable buckets we have, the larger
the exclusion bucket is. For a large exclusion bucket, the
D-index allows an additional level of splitting by applying
a new set of ρ-split functions on the exclusion bucket. The
exclusion bucket of the last level forms the exclusion bucket
of the whole structure. The ρ-split functions of individual
levels should be diﬀerent but they must use the same ρ.
Moreover, by using a diﬀerent number of ρ-split functions
(generally decreasing with the level), the D-index’s addressing
structure can produce a diﬀerent number of buckets at
individual levels.
2.2 Search Algorithms
Before giving a sketch of search algorithm, we assume
that we have an h-level distance searching index D−index(h,
sm1,ρ
1 , . . . , smh,ρ
h ) which is determined by h independent ρsplit
functions smi,ρ
i , where mi speciﬁes the number of bps
functions combined together.
Given a range query Q = R(q, r), we deﬁne a simple
search algorithm. This algorithm, however, evaluates only
limited queries with r ≤ ρ.
Algorithm 1. Search
for i = 1 to h
return all objects x such that x ∈ Q ∩ Bi, s
mi,0
i
(q)
;
end for
return all objects x such that x ∈ Q ∩ Eh;
During the elaboration of the algorithm we manipulate the
value of parameter ρ of ρ-split functions. When we use ρ = 0
the function smi,0
i (q) always returns an identiﬁcation id of
a separable set and cannot evaluate to ’−’, i.e., the exclusion
set. Consequently, one separable bucket Bi,id on each level
i is determined. Finally, the algorithm also accesses the
exclusion bucket Eh of the whole structure. The algorithm
requires h+1 bucket accesses, which forms the upper bound
on the search. Such a behavior is correct since due to the
mathematical properties of the ρ-split functions, precisely
deﬁned in [11], an arbitrary object belonging to a separable
bucket is at distance at least 2ρ from any object of another
separable bucket of the same level.
With additional computational eﬀort, a query with r > ρ
can also be executed using a generalized range search algorithm
[11]. By manipulating the parameter ρ of a ρ-split
function again, this algorithm detects situations when the
query intersects more separable buckets on the same level
and accesses them. However, the number of accesses can
still be reduced: i) if the query region is entirely contained
in the exclusion set of a level i, no object qualifying the
query can be found in the separable buckets of this level
and the next level is considered directly, ii) if the query region
is completely contained in a separable set of a level i,
the following levels, as well as, the global exclusion bucket
need not be accessed and the search is terminated. The
D-index also supports nearest neighbor(s) queries.
2.3 Critical Study
A query evaluation in the D-index can be separated into
two stages. First, the addressing step identiﬁes buckets to
be examined by using the addressing structure. Second, the
execution step is responsible for retrieving qualifying data
objects in the buckets identiﬁed during the ﬁrst stage. In
buckets, a naive sequential scan with the pivot-based ﬁltering
is used [21]. The execution step can be easily parallelized
since the evaluation in individual buckets is totally independent.
The Grid computational environment is an ideal for
such a structure because the addressing stage can be done
on a dedicated Grid node and the execution step can be
spread over other nodes.
The D-index is a dynamic structure but the dynamicity is
ensured using elastic buckets which have unlimited capacity.
The ﬁrst disadvantage of the D-index is its ﬁxed addressing
structure that must be manually tuned by the user before
data are loaded. From the data scalability point of view,
when volume of data to be organized is doubled, the average
occupation of buckets is doubled as well, so there is a
strictly-linear correlation between the indexed volume and
the execution costs in buckets whereas the addressing costs
stay constant. This is inconvenient because the constant
addressing structure leads to a constant number of buckets,
i.e., Grid nodes. In order to guarantee the same response
time, the power of individual Grid nodes must therefore be
increased and adding further nodes will not help.
The experiments conducted on both synthetic and real-life
datasets [11] revealed that the D-Index is very eﬃcient and
outperforms the M-tree1
[8] nearly in all situations. However,
these performance trials also exemplify the second disadvantage
of the D-index: The partitioning principles are
not very optimal and are prone to produce unbalanced buckets
[9]. The unbalanced buckets would lead to a non-uniform
load of Grid nodes. The static addressing structure is also
unable to adapt to a radical change of distribution of data
being inserted.
A fully adaptive structure and a balanced partitioning
would allow expanding or shrinking to the desired number
of Grid nodes while guaranteeing the response time. In particular,
we concern the problem of selecting reference objects
(pivots). Next, we deal with the issue of combining several
rho-split functions.
The problem of choosing pivots is important for any search
1
the de facto standard for similarity searching in metric
spaces
technique in general metric spaces, because all such algorithms
need, directly or indirectly, some ”anchors” for partitioning
and search pruning. It is well known that the way
in which pivots are selected aﬀects the performance of search
algorithms. This has been recognized and demonstrated by
several researchers [19, 4]. Recently, the problem was systematically
studied in [6], and several strategies for selecting
pivots have been proposed and tested. The generic conclusion
is that good pivots are i) far away from the remaining
objects of the metric space and ii) far away from each other.
In the following, we analyze possible design strategies of
rho-split functions.
3. METRIC GRID
The Metric Grid (M-Grid) is a distributed index structure
for similarity searching that exploits a Grid infrastructure.
The idea behind this structure is straightforward and has
already been slightly depicted in the previous section. The
D-index’s schema is applied in the M-Grid. In particular, we
deﬁne a special Grid node named the master node where the
addressing structure is stored. This node is responsible for
contacting buckets to be searched. The storage is distributed
over all other Grid nodes.
Distributing the storage over Grid nodes requires a special
mapping that assigns buckets to nodes. It is called the
bucket-to-node mapping. This translation of bucket identiﬁcations
to node identiﬁcations allows to have more buckets
stored on the same node. It also permits to replicate buckets,
i.e., the same bucket can be kept on multiple nodes. We
use a simple load-balancer of replicas which is introduced in
Section 4.3.
When a user poses a similarity query, the master node
identiﬁes the buckets that might contain relevant data. The
bucket-to-node mapping is consulted to get addresses of
Grid nodes. The master node initiates the execution phase
by forwarding the query to the respective Grid nodes where
the objects qualifying the query get retrieved. The individual
sub-answers are then sent back to the master node
which merges them to form the ﬁnal answer to the query. A
schema of M-Grid and of the querying process is depicted
in Figure 2.
Grid Nodes
Addressing Schema
Mapping
&
Load−Balancing
Answer Merging
Master Node
Query
Answer
Figure 2: The schema of M-Grid and the process of
querying.
In the previous section, we have pointed out a disadvantage
of the D-index’s – its static addressing structure. The
M-Grid exploits principles of linear hashing to make the
addressing structure adaptable. The design of addressing
structure requires a speciﬁcation of several ρ-split functions
which are usually combinations of bps functions. In general,
the idealized split function should produce balanced buckets
each containing nearly the same number of objects and
Figure 3: Diﬀerent strategies of combining two bps split functions.
minimize the size of the exclusion bucket. Figure 3 presents
possible strategies to combine two bps functions.
The ﬁrst technique, depicted in (a), is utilized in the DIndex
(also refer to Figure 1b) and uses the pivot p1 and
dm = r1 to divide the space into two separable sets. Next,
these two sets are repartitioned using a diﬀerent bps function
which applies a diﬀerent pivot p2 and dm = r2, however, the
same for both the sets. As a result, we obtain four separable
buckets. We refer to this method as the strict strategy from
here and on.
The second strategy in (b) diﬀers from the ﬁrst one in one
aspect. It makes use of the pivot p2 in the second function as
well, but two diﬀerent values of dm (r1
2 and r2
2) are applied
for the left and the right set, respectively. The hypothesis
behind is that by manipulating dm, better-balanced buckets
are achieved, empty buckets are diminished, and the occupation
of exclusion sets is decreased. We refer to this as the
variable strategy.
To complete the list of split policies, the third strategy
modiﬁes also the pivot and is denoted as the loose strategy,
see Figure 3c. In details, this technique can be viewed as the
application of three independent bps functions instead of two
in the previous strategies. However, the major disadvantage
of this approach are higher memory requirements, e.g., for
obtaining 128 separable buckets we need 127 diﬀerent pivots,
which is in a sharp contrast with the former strategies
that require only log(128) = 7 pivots to deﬁne the same partitioning.
The memory requirements are not the only issue:
the more pivots we have the more distance computations
we must evaluate during search operations. Since we optimize
also the costs in terms of distance computations, such
behavior is not desired.
The experiments in [9] shows that the variable strategy
with respect to the strict one is able to diminish all empty
(or under-occupied) buckets and to reduce the number of
pivots needed at the same time. In this respect, this leads
to a more compact structure and a higher average bucket
occupation. For this reason, we will not deal with the strict
strategy in the following. We compare only the variable and
loose strategies. The global observation is that the query execution
costs with the loose strategy are by 20% to 50% less
than the query execution costs with the variable strategy
depending on the query radius. The reason is quite obvious.
Following the example above, in the case of the loose
strategy the pivot-based ﬁltering in buckets can make proﬁt
of using all 127 pivots, thus it is very eﬀective. It is in a
sharp contrast with seven pivots in the case of the variable
(or strict) strategies. In our Grid scenario, the loose strategy
however yields very high load on the master node. More
details are given in the experimental evaluation below.
Updates in M-Grid are handled in the following way. If no
change in the addressing schema is required, an object is directly
inserted to or deleted from the bucket that have been
identiﬁed by the addressing schema. In the case of a bucket
split, the master node updates its addressing schema and
contacts the corresponding node that maintains the bucket
with the necessary information about the new bucket. The
contacted node then reallocates a portion of its data. The
master node can also request to create a new replica. To
keep all replicas synchronized, similar algorithms as used in
distributed databases [17] can be applied. For space constraints,
we do not analyze this phenomenon here.
4. PERFORMANCE TRIALS
To study properties of M-Grid, we have run several groups
of experiments. First, we focused on diﬀerent building strategies,
namely the variable and loose strategies. Next, we
tackled the problem of storing several buckets on a single
node and the issue of replication. Finally, we provide the
reader with a performance tuning analysis.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 1 2 3 4 5 6 7 8 9
Frequency(%)
Distance (x 1,000)
(a)
VEC
0
0.5
1
1.5
2
2.5
0 50 100 150 200 250
Frequency(%)
Distance
(b)
TTL
Figure 4: Distance densities for the VEC and TTL
datasets.
We selected the following signiﬁcantly diﬀerent real-life
datasets to conduct the experiments on:
VEC 45-dimensional vectors of extracted color image features.
The similarity function d for comparing the vectors
is a quadratic-form distance [18]. The distance
distribution of the dataset is quite uniform and such
a high-dimensional data space is extremely sparse, see
Figure 4a.
TTL titles and subtitles of Czech books and periodicals collected
from several academic libraries. These strings
are of lengths from 3 to 200 characters and are compared
by the edit distance [15] on the level of individual
0
20
40
60
80
100
0 200 400 600 800 1000
Probabilityofaccessing(%)
Node ID (a)
r=19
0
20
40
60
80
100
0 50 100 150 200 250 300
Probabilityofaccessing(%)
Node ID (b)
r=19
Figure 5: Probability of accessing a node for TTL dataset: (a) managing single bucket and (b) managing four
buckets.
characters. The distance distribution of this dataset is
skewed, refer to Figure 4b.
The stored data volume was 1,000,000 objects. The capacity
of a bucket was set to 2,048 objects and was ﬁxed for all the
tested structures and both the datasets.
The building phase of M-Grid required 1,017 and 1,087
data reallocations, i.e., bucket splits, for VEC and TTL
datasets, respectively. The amount of transferred objects
was 1,873,767 and 1,520,359, which means that after the
initial insertion, 873,767 and 520,359 objects were reallocated
to another node, for respective datasets. We have not
implemented any bulk-loading procedure which might further
reduce the data transfer, so these numbers represent
the upper bound.
All the presented performance characteristics of query processing
have been taken as a sum over 100 range queries with
randomly chosen query objects, if not stated otherwise. The
query radii were 200, 1,000 and 2,000 for VEC dataset each
returning on average 4, 19 and 20,000 objects, respectively.
For TTL we used 7, 10 and 19 each returning on average 4,
40 and 20,000 objects, respectively.
4.1 Strategy Comparison
The parameter ρ is the only parameter of the addressing
structure left to be user speciﬁed. We studied its inﬂuence
mainly on the query costs. The value of ρ also aﬀects the
building costs. In general, the higher the value of ρ, the
more objects fall into the exclusion set of a ρ-split function.
This leads to more levels of addressing structure, so
more distance computations are evaluated to insert an object.
The reader might think the best value is zero, but it is
not always true because no index structure is built without
the aim of searching.
From the query costs point of view, the situation is not
very clear. Table 1 presents query costs in terms of distance
computations for the variable strategy. During the
addressing stage, the number of distance computations is
increasing as ρ is getting larger, so it supports the assertion
stated above. The costs in the execution stage represent the
total number of distance computations made on all nodes.
These numbers are mean values for one query. This characteristic
has a descending tendency. It is more noticeable for
greater query radii (r = 2, 000 in the table) than for smaller
ones (r = 200 in the table). Such a trend is obviously caused
by the improved eﬀectiveness of pivot-based ﬁltering since
more pivots are used in the addressing structure for larger
values of ρ.
Table 2 shows the same features but in the case of loose
var r = 200 r = 2, 000
VEC Stage Stage
ρ Addr. Exec. Addr. Exec.
0 10 157 10 404,911
50 59 97 69 374,099
100 88 119 116 345,489
Table 1: Distance computations for diﬀerent values
of ρ with the variable strategy and VEC dataset.
loose r = 200 r = 2, 000
VEC Stage Stage
ρ Addr. Exec. Addr. Exec.
0 26 55 632 321,335
50 53 50 751 329,362
100 75 47 931 314,505
Table 2: Distance computations for diﬀerent values
of ρ with the loose strategy and VEC dataset.
strategy. In overall, the trends here are the same, however,
the eﬀectiveness of pivot-based ﬁltering does not inﬂuence
the results very much. For many pivots the ﬁltering eﬀectiveness
is becoming quite stable and is not improved even
if the number of pivots is doubled. In particular, the loose
strategy used 1,000, 1,100 and 1,500 pivots in total for ρ = 0,
50 and 100, respectively. The signiﬁcantly increased number
of pivots in the loose strategy with respect to the variable
strategy leads to more eﬃcient search during execution
stage.
For M-Grid, the best setting is ρ = 0 which implies the
lowest costs in the addressing stage. Recall that these costs
cannot be parallelized and form the total costs on the master
node. In this respect, the favorable strategy for the addressing
structure is the variable one because of its negligible requirements
in terms of distance computations even for large
query radii. The worse query costs of the variable strategy
in the execution stage are not critical owing to parallelism.
4.2 Creating Groups
In this section, we focus on the probability of accessing
buckets during query elaboration. Figure 5a depicts this for
the TTL dataset and queries with radii r = 19 which return
about 20,000 objects. The observation is that buckets are
not visited very uniformly2
, which leads to an unbalanced
load of Grid nodes. However, there is no long sequence of
2
The probability of visiting varies from 1% to 53%.
0
50
100
150
200
250
300
350
8421
DistanceComputations(x1,000)
Replication Factor
TTL, r=19
No Grouping
Grouping Factor 4
Figure 6: Response time in distance computations
while changing the replication factor.
buckets that would be contacted very often or very rarely.
In our scenario, we assumed every bucket is managed by
a separate Grid node. From the storage occupation point
of view, one bucket per computer is probably not an overwhelming
volume. The idea to make the probabilities more
uniform is to group several buckets together and store them
on the same Grid node.
Figure 5b presents the probabilities when groups of four
buckets have been created – the grouping factor is four. It
is noticeable that a more uniform load of Grid nodes is
achieved3
. The negative fact about grouping is the increased
probability of contacting a node. For larger grouping factors,
the probability is obviously 100% and all nodes get
contacted for any query. As for CPU costs, the most loaded
node evaluated 194,806 distance computations to complete
100 queries with r = 19 when grouping was disabled. The
most loaded node organizing a group of four buckets computed
the distance function 323,841 times, which is by 66%
more. On the other hand, we need four times fewer computers
in this case.
In general, this technique is applicable to any dataset.
Nonetheless, the application of this technique results in increased
demands on a Grid node in terms of storage and
CPU costs. If CPU costs are crucial the following technique
– replication – can be applied.
4.3 Replication
The replication is a well-known technique in distributed
database systems that increases the availability of resources
[17, 5]. The other implication of replication is the ability of
balancing the load among more nodes that manage identical
data, which is the objective of this section.
In our environment, the contribution of replication is twofold.
First, it allows a query to be completed earlier. However,
this holds only if grouping is applied. In spite of it a
system with replication cannot answer the query more eﬃciently
than the conﬁguration with every bucket assigned to
a separate Grid node. Second, more important one, additional
queries can be completed at the same time.
Figure 6 presents the number of distance computations
to answer 100 queries issued in parallel while changing the
replication factor from one replica to eight copies. The
trends are linear, i.e., doubling the replication factor leads to
the half number of distance computations. The master node
is responsible for distributing computations over the repli-
3
The probability of visiting varies from 65% to 84%.
0
50
100
150
200
250
300
350
546
273
8712
4356
2178
1089
DistanceComputations(x1,000)
Number of Grid Nodes
TTL, r=19
No Grouping
Grouping Factor 4
Figure 7: Response time in distance computations
while changing the number of replicas.
cas. We have implemented a simple load-balancer which
selects the least loaded node among other replicated nodes
and forwards the incoming request to it. By the load of a
node we mean the number of distance computations since it
forms the main part of all costs.
Referring back to Figure 6, it seems that the grouping
technique only deteriorates the performance because the
conﬁguration with the grouping factor of 4 is systematically
slower. However, notice that the grouping factor of 4 implies
four times fewer Grid nodes. The same results but rescaled
to the real number of nodes are presented in Figure 7. It
reveals that the conﬁguration with the grouping factor of 4
and two replicas (546 nodes) outperforms the simple setting
(one bucket to one node, i.e., 1089 nodes). Moreover, it operates
on a half of Grid nodes only. On the same number of
nodes, the conﬁguration with the grouping factor of 4 and
four replicas results in only 40% distance computations of
the simple setting. These observations are, of course, a consequence
of using grouping and replication techniques which
redistribute the load more uniformly. Such results require a
deeper analysis which we call performance tuning and which
is tackled in the following section.
4.4 Performance Tuning
Performance tuning of M-Grid is very important because
it has many parameters that must be properly set in advance.
Except ρ and the bucket capacity, we have introduced
two new parameters: the grouping and replication
factors.
The user or administrator of a system usually has a clear
notion about the maximum response time of a query and the
total number of queries that will be issued simultaneously.
They also suggest maximum memory requirements that are
available on individual Grid nodes and the maximum number
of available Grid nodes. Our experimental setting was at
maximum 64 buckets per node and in total 4,100 Grid nodes
available. We used the VEC dataset here and queries with
radius r =2,000. Notice that such queries are quite demanding
because they return about 20,000 objects. The dataset
was organized in 1,022 buckets.
In order to study the performance of M-Grid, we have
temporarily ﬁxed the grouping and replication factors. Figure
8 presents costs to answer a batch of 10 or 100 queries for
diﬀerent M-Grid conﬁgurations but limited to bounds stated
above. Diﬀerent graphics distinguishes individual grouping
factors. The horizontal line in each graph constitutes the
limit on the query response time. Only the conﬁgurations
1000
10000
100000
1e+06
0
500
1000
1500
2000
2500
3000
3500
4000
4500
DistanceComputations
Number of Nodes
(a)
Variable Strategy - 10 queries
Threshold
Grouping Factor 1
Grouping Factor 2
Grouping Factor 8
Grouping Factor 64
1000
10000
100000
1e+06
1e+07
0
500
1000
1500
2000
2500
3000
3500
4000
4500
DistanceComputations Number of Nodes
(b)
Variable Strategy - 100 queries
Threshold
Grouping Factor 1
Grouping Factor 2
Grouping Factor 8
Grouping Factor 64
1000
10000
100000
1e+06
1e+07
0
500
1000
1500
2000
2500
3000
3500
4000
4500
DistanceComputations
Number of Nodes
(c)
Loose Strategy - 100 queries
Threshold
Grouping Factor 1
Grouping Factor 2
Grouping Factor 8
Grouping Factor 64
Figure 8: Response times in distance computations for diﬀerent conﬁgurations of M-Grid. (a) 10 simultaneous
queries and (b) 100 simultaneous queries for the variable strategy, and (c) 100 simultaneous queries for the
loose strategy.
under it satisfy all user restrictions.
From the ﬁgures, we can observe that the most eﬃcient
conﬁgurations are ones with the grouping factor of 64. The
M-Grid with the grouping factor greater than 64 for both
VEC and TTL datasets has the probability of accessing a
node equal to 100%. Therefore, all nodes are almost equally
loaded and used optimally in this respect. In Figure 8a, the
user requires ten simultaneous queries to be completed in
10,000 distance computations. Only conﬁgurations running
on at least 512 nodes can meet these requirements. The most
economical setting uses the grouping factor of 64 and 512
nodes, which leads to 32 replicas4
. The other conﬁguration
almost equivalent in power uses 512 nodes as well, but the
grouping factor is eight, which results in four replicas.
Figure 8b depicts the same but the user speciﬁed the
threshold of 100,000 distance evaluations and 100 simultaneous
queries. The results for various conﬁgurations are
the same but shifted up. The two conﬁgurations emphasized
above both are the most eﬃcient solutions again. The
latter conﬁguration, however, needed 102,123 distance com-
putations.
The loose strategy has a disadvantage of increased costs
during addressing. These costs form a sequential step in processing
queries, so the results of the same conﬁgurations are
worse by 62,202 distance computations than the results for
the variable strategy. Due to this fact, the desired conﬁgurations
operate on at least 2,048 nodes, as shown in Figure 8c.
The best conﬁguration has the replication factor of 128 and
the grouping factor of 64. The second best one uses the
grouping factor of 8, but the replication factor is now 16.
The trends in all these graphs have also shown that doubling
the number of Grid nodes improves the distributed
power of the system twice, i.e., a query is answered in half
distance computations. From the memory consumption point
of view, conﬁgurations with smaller grouping factors are
preferable if nodes cannot maintain too much data for some
reason.
The advantage of M-Grid’s addressing schema is that it
4
replicas = nodes·grouping
buckets
, i.e., 512·64
1022
is based on principles of linear hashing. Hence, new buckets
are allocated as the data volume is increasing. By monitoring
performance indicators, the M-Grid can increase the
grouping factor or the replication factor gracefully to adjust
the performance automatically.
5. CONCLUSIONS
We have proposed a dynamic distributed similarity searching
structure suitable for Grid infrastructures. This technique
can adapt its addressing schema as new data arrive
and, more importantly, as speciﬁc performance requirements
are updated. In performance trials, we have studied several
design and performance tuning issues of the M-Grid.
Using our prototype Java implementation, we were able to
conduct numerous experiments. Even though our technique
can be used in a pure distributed environment, there are
speciﬁc issues such as node allocation, node migration, data
replication, security and access restrictions to be solved.
This is the area where a Grid infrastructure comes into account.
It solves these problems eﬀectively and transparently
through middleware services for node and data management.
The Grid also provides embedded authentication
and authorization capabilities. Unfortunately, we have had
only a limited access to large Grid systems and we had to
simulate its infrastructure using a distributed framework library
on a huge network of common workstations that have
been available for daily use by university students. Therefore,
we did not use the actual response time as a measure,
because we could not dedicate these computers to our application
only. Instead, we report the number of evaluations
of the distance function which practically form the prevalent
part of query execution costs. However, to estimate the
actual response times, we have measured that 13,000 evaluations
of the quadratic-form distance took approximately
one second to complete on a current computer. As for communication
costs, they can be neglected, because we need
only one message from the master node to all the processing
nodes. Thus, the costs are equal to the longest round trip
time, which depends mainly on the topology of the underlying
Grid. However, even for a geographically large Grid they
can be expected to be maximally hundreds of milliseconds.
Our experiments revealed that the M-Grid can be tuned
to meet even very tight performance conditions. In the
performance tuning section, we used queries with the radii
r =2,000 each returning around 20,000 objects. Such a setting
is rather high. In practice, smaller queries giving tens
of similar objects are more likely. Also the requirement of
100 simultaneously executed queries is immoderate implying
a heavy-loaded system. The response time equivalent to
100,000 distance evaluations was about 8s in the worst case,
which is reasonable. On the other hand, the experiment
with more relaxed requirement of 10 concurrent queries had
to be completed in 10,000 distance computations (equals to
0.8s), which is real-time processing. According to the experiments
with one million data items, such high expectations
can be fulﬁlled with 512 Grid nodes. Assuming more likely
requirements of radii r =1,000 and the response time less
than 2.5s, the system would require only 30 nodes, which is
a feasible environment for any institution.
Even though we had to simulate the Grid infrastructure,
our distributed framework provided the necessary functionality
in such a way that a transition to the environment of
Czech national METACentrum Grid [1] should be easy to
implement. We plan such activity in the near future. We
also intend to compare the M-Grid with existing systems
for similarity searching based on peer-to-peer paradigms. In
particular, the GHT∗
[2], MCAN [12], and M-Chord [16] will
be considered. The ﬁrst one is based on a distributed tree
while the others exploit transformation principles to embed
a metric space to a vector space.
5.1 Acknowledgements
This research has been partially funded by the national
research project 1ET100300419.
6. REFERENCES
[1] METACentrum project, August 2006.
http://meta.cesnet.cz/.
[2] M. Batko, C. Gennaro, and P. Zezula. Similarity grid
for searching in metric spaces. In DELOS Workshop:
Digital Library Architectures, LNCS, volume
3664/2005, pages 25–44, 2005.
[3] M. Batko, D. Novak, F. Falchi, and P. Zezula. On
scalability of the similarity search in the world of
peers. In Proceedings of First International Conference
on Scalable Information Systems (INFOSCALE 2006),
Hong Kong, May 30 – June 1, 2006. ACM Press, 2006.
[4] T. Bozkaya and Z. M. ¨Ozsoyoglu. Indexing large
metric spaces for similarity search queries. ACM
Transactions on Database Systems (TODS 1999),
24(3):361–404, 1999.
[5] S. Budrean, Y. Li, and B. C. Desai. High availability
solutions for transactional database systems. In 7th
International Database Engineering and Applications
Symposium (IDEAS 2003), 16-18 July 2003, Hong
Kong, China, pages 347–357. IEEE Computer Society,
2003.
[6] B. Bustos, G. Navarro, and E. Ch´avez. Pivot selection
techniques for proximity searching in metric spaces. In
Proceedings of the 21st Conference of the Chilean
Computer Science Society (SCCC 2001), Punta
Arenas, Chile, November 6-8, 2001, pages 33–40.
IEEE Computer Society, 2001.
[7] E. Ch´avez, G. Navarro, R. A. Baeza-Yates, and J. L.
Marroqu´ın. Searching in metric spaces. ACM
Computing Surveys (CSUR 2001), 33(3):273–321,
September 2001.
[8] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An
eﬃcient access method for similarity search in metric
spaces. In Proceedings of the 23rd International
Conference on Very Large Data Bases (VLDB 1997),
Athens, Greece, August 25-29, 1997, pages 426–435.
Morgan Kaufmann, 1997.
[9] V. Dohnal. An access structure for similarity search in
metric spaces. In Current Trends in Database
Technology - EDBT 2004, volume 3268 of LNCS,
pages 133–143. Springer, 2004.
[10] V. Dohnal. Indexing Structures for Searching in
Metric Spaces. PhD thesis, Faculty of Informatics,
Masaryk University in Brno, Czech Republic, May
2004. http://www.ﬁ.muni.cz/˜xdohnal/phd-thesis.pdf.
[11] V. Dohnal, C. Gennaro, P. Savino, and P. Zezula.
D-Index: Distance searching index for metric data
sets. Multimedia Tools and Applications, 21(1):9–33,
2003.
[12] F. Falchi, C. Gennaro, and P. Zezula. A
content-addressable network for similarity search in
metric spaces. In Proceedings of DBISP2P, pages
126–137, 2005.
[13] I. Foster and C. Kesselman. The Grid: Blueprint for a
New Computing Infrastructure. Amsterdam: Elsevier,
2005.
[14] C. Gennaro, P. Savino, and P. Zezula. Similarity
search in metric databases through hashing. In
Proceedings of the 3rd ACM Multimedia 2001
Workshop on Multimedia Information Retrieval (MIR
2001), Ottawa, Ontario, Canada, October 5, 2001,
pages 1–5. ACM Press, 2001.
[15] V. I. Levenshtein. Binary codes capable of correcting
spurious insertions and deletions of ones. Problems of
Information Transmission, 1:8–17, 1965.
[16] D. Novak and P. Zezula. M-Chord: A scalable
distributed similarity search structure. In Proceedings
of First International Conference on Scalable
Information Systems (INFOSCALE 2006), Hong
Kong, May 30 - June 1. IEEE Computer Society,
2006.
[17] M. T. ¨Ozsu and P. Valduriez. Distributed and parallel
database systems. ACM Computing Surveys,
28(1):125–128, 1996.
[18] T. Seidl and H.-P. Kriegel. Eﬃcient user-adaptable
similarity search in large multimedia databases. In
Proceedings of the 23rd International Conference on
Very Large Data Bases (VLDB 1997), Athens, Greece,
August 25-29, 1997, pages 506–515. Morgan
Kaufmann, 1997.
[19] P. N. Yianilos. Data structures and algorithms for
nearest neighbor search in general metric spaces. In
Proceedings of the 4th Annual ACM Symposium on
Discrete Algorithms (SODA 1993), Austin, Texas,
USA, January 25-27, 1993, pages 311–321. ACM
Press, 1993.
[20] P. N. Yianilos. Excluded middle vantage point forests
for nearest neighbor search. In Proceedings of the 6th
DIMACS Implementation Challenge: Near Neighbor
Searches (ALENEX 1999), Baltimore, Maryland,
USA, January 15-16, 1999, 1999.
[21] P. Zezula, G. Amato, V. Dohnal, and M. Batko.
Similarity Search: The Metric Space Approach.
Springer, 2005.