Secure Metric-Based Index for Similarity Cloud
Stepan Kozak, David Novak, and Pavel Zezula
Masaryk University, Brno, Czech Republic
{xkozak1,david.novak,zezula}@fi.muni.cz
Abstract. We propose a similarity index that ensures data privacy and
thus is suitable for search systems outsourced in a cloud. The proposed
solution can exploit existing eﬃcient metric indexes based on a ﬁxed set
of reference points. The method has been fully implemented as a security
extension of an existing established approach called M-Index. This
Encrypted M-Index supports evaluation of standard range and nearest
neighbors queries both in precise and approximate manner. In the ﬁrst
part of this work, we analyze various levels of privacy in existing or future
similarity search systems; the proposed solution tries to keep a reasonable
privacy level while relocating only the necessary amount of work
from server to an authorized client. The Encrypted M-Index has been
tested on three real data sets with focus on various cost components.
1 Introduction
With more and more data being collected in all kinds of scientiﬁc processes
(medicine, astronomy, etc.) or commercial applications such as social networking
and on-line marketing, searching in large data sets became one of the key tasks
performed these days. Such data often does not provide suﬃcient meta-data
description, therefore in many applications similarity search is more important
than an exact match or keyword search.
Since similarity search itself is a very resource demanding process, the trend is
to outsource such services to 3rd party cloud providers. Outsourcing to a cloud
provides many advantages such as low initial investments, low storage costs and
a very good scalability (more storage or computational power can be added on
the ﬂy, when it’s needed – so called pay-as-you-go principle).
We can see two possible scenarios of outsourcing similarity search. In the ﬁrst
case, user has their similarity search technology and wants to use the hardare
of a cloud infrastructure provider. In the second scenario, a similarity search
service provider makes the technology available for end users so they can use the
engine without an actual knowledge of the technology. We observe an increasing
trend of the latter case and we will refer to this as similarity cloud.
In both scenarios, users might not want to expose all their data which might
be sensitive (e.g. medicine data) or valuable (e.g. data collected from a scientiﬁc
research), to a 3rd party provider, which is, in general, untrusted. In these cases,
privacy of the data is of high importance. Hence the similarity cloud has to
provide mechanisms which allow applying privacy requirements of the end users.
W. Jonker and M. Petkovi´c (Eds.): SDM 2012, LNCS 7482, pp. 130–147, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Secure Metric-Based Index for Similarity Cloud 131
The general objective of outsourcing is to move all the resource-demanding
process to the cloud and allow authorized clients to run search queries and
get answers. Intuitively, involving an encryption will negatively inﬂuence the
eﬃciency of the service. The less the cloud servers know about the data, the less
eﬃcient indexing techniques can be used on the server side. A suitable balance
between suﬃcient level of privacy and performance has to be found.
The topic of secured similarity search has been studied recently [1–4]. Some
of these works focus on keyword similarity indexing [1–3] and the work of Yiu et
al. [4] studies the indexing based on general metric principles. We consider the
paper pioneering in the area that we expect to become more important and we
consider this research direction promising.
In this work, we try to specify more precisely the application scenarios and
goals of the secure similarity search; our analysis results in a taxonomy of privacy
in similarity search services. We propose a novel metric-based technique that ensures
reasonable level of privacy while relocating only the necessary amount of
work from server to an authorized client. The proposed approach can exploit
existing eﬃcient metric indexes that are based on a ﬁxed set of reference objects
(pivots). Our method has been implemented utilizing M-Index [5, 6], an established
indexing and searching approach that is currently being used in several
real applications.
The M-Index and other similar approaches treat the data space as a metric
space (D, d), where D is a domain of objects and d is a total distance function
d : D × D −→ R satisfying metric postulates (non-negativity, identity, symmetry,
and triangle inequality) [7]. The set of indexed objects X ⊆ D is typically
searched by the query-by-example paradigm, for instance by the range query
R(q, r)= {o ∈ X | d(q, o) ≤ r}, q ∈ D or by the nearest neighbors query k-NN(q)
covering the k objects from X with the smallest distances to given q ∈ D.
Contribution and Structure of This Paper
– In Section 2, we describe the goals of secure similarity search and we deﬁne
the taxonomy of privacy in similarity clouds. In Section 3, we describe some
of the existing approaches.
– We propose a mechanism for ensuring privacy in the M-Index and similar
approaches in Section 4. Further, we discuss privacy and eﬃciency of the
proposed solution and describe its prototype implementation.
– We performed experiments on three real-life datasets in order to measure
the eﬃciency degradation caused by our privacy ensuring approach; we
also analyzed individual cost components in real client-server environment
(Section 5). The paper concludes in Section 6 with suggestions for future
work.
2 Problem Deﬁnition
In this section, we establish the terminology used throughout this paper and
deﬁne the main objectives of secured similarity search in an outsourced (clientserver)
setting.
132 S. Kozak, D. Novak, and P. Zezula
2.1 Terminology
Raw Data. The original (sensitive) data objects to be indexed and searched.
For example binary ﬁles of images or any other data collection.
Metric Space Objects (MS Objects). Metric space descriptors extracted
from the raw data, each descriptor has a reference to the corresponding
raw object; they are compared by a metric function. In some cases, the raw
data and the MS objects are identical (for instance, gene sequences or other
biomedical data); in other cases, several MS objects are extracted from one
raw object.
Secret Key. Encryption key used to encrypt the raw data and/or MS objects.
Encryption key is also used for authorized querying of the data.
Data Owner. Subject outsourcing the search service, owner of the data.
Server. Server(s) of the 3rd
party similarity cloud. From the data owner’s point
of view, server is not trusted because it is not fully controlled by the data
owner (server can be attacked and data from it leaks to an attacker). Therefore
server should not have the access to the original (unencrypted) data and
should have as less as possible information about the MS objects.
Authorized Client. Client authorized by the data owner to use the search
service (i.e. client having the secret key).
Attacker. Any potential malicious user with purpose of getting the data.
2.2 Objectives
In general, the process of outsourcing is the following: In the construction phase,
the data owner creates the MS objects from the original raw data, sends these MS
objects to a similarity cloud for indexing and the raw data to a data storage.
In the search phase, any authorized client can query the similarity cloud to
obtain IDs of the relevant objects referring to original data objects, that can
be subsequently retrieved from the raw data storage. Scheme of such outsourced
similarity system is depicted in Figure 1. General objectives of outsourced secure
similarity search can be formulated as follows:
– Resource demanding process (the search itself) should be performed on the
server side as much as possible (clients that query the server might be simple
devices without big computational power).
– Communication cost between the client and the server should be as low as
possible (in optimal case, client sends only initial search request and then
receives result from the server).
– Data should be stored on the server in a secure way so that a potential
attacker can gain as little information about the data as possible.
Intuitively, the security requirement goes against the eﬃciency objective. If most
of the computations should be performed on server side, the server has to have
enough information about the data to process such task eﬃciently. Hence, the
right balance between the security and eﬃciency should be found for each speciﬁc
application setting.
Secure Metric-Based Index for Similarity Cloud 133
Fig. 1. General scheme of outsourced similarity search
2.3 Levels of Privacy
In this section, we introduce and discuss several general approaches to a (secure)
outsourced similarity search as we see it.
No Encryption. This is the fundamental setting suitable for not sensitive data
or data stored in a completely trusted environment (own servers, etc.). It is also
the most eﬃcient solution, because all the work can be done on the server and no
additional encryption/decryption processes have to be employed. All advanced
indexing techniques can be applied on the server side without loss of eﬃciency.
Raw Data Encryption. Another approach is to extract the MS objects from
the raw data and build a standard indexing structure on these MS objects; then
the raw data can be encrypted with some symmetric cryptosystem (AES or any
other) and uploaded to the cloud data storage. The similarity search itself can
be performed without any change, the whole search process can be done on the
server (the index and MS objects are kept unencrypted). After the search, the
raw data storage returns encrypted result data to the client for decryption.
This approach is good from the performance point of view, because the resource
demanding process (similarity search with expensive distance computations)
is done on the server side, while the client decrypts the ﬁnal result (which
can also be a time consuming process, but this can be hardly omitted since the
authorized client is naturally the only party having the decryption key).
From the security point of view, this approach is not applicable if the MS
objects are also sensitive or identical to the raw data. And even if not, the
134 S. Kozak, D. Novak, and P. Zezula
knowledge of the metric space (distances between objects) might be possibly
exploited to get information about the raw data. This case is considered below.
MS Objects Encryption. In this setting, both raw data and MS objects are
encrypted. However, once the indexing structure cannot fully access the MS objects,
it cannot use pruning or ﬁltering techniques which may reduce the system
eﬃciency. Therefore, additional information about the MS objects has to be provided
to the indexing structure, so that at least some ﬁltering can be done on the
server side (otherwise the server would degrade to a simple data storage).
MS Objects and Their Distribution Encryption. Encrypting individual
metric space objects does not hide all the information from a potential attacker.
Server has unencrypted index structure with distances between objects and other
distance information between nodes within the indexing structure (the exact type
and amount of information depends on speciﬁc indexing mechanism). From this,
an attacker can possibly learn some information about distribution of the data.
The aim in some scenarios might be to encrypt also this information.
Note: Since raw data can be stored (and encrypted) separately and indexing structures
can only contain MS objects with a reference to the original resource, we
further consider the raw data always encrypted and discuss only the process of
securing MS objects within the indexing structures (therefore we focus on the last
two approaches of the above list).
3 Existing Approaches
The topic of secured similarity search is being studied and there are several
results in this ﬁeld. Several techniques for the similarity (keyword) search in the
text data has been proposed and analyzed [1–3]. However, all the solutions were
designed only for speciﬁc type of data with speciﬁc distance function to measure
the similarity. There is also a recent work concerning secure multidimensional
interval queries in over outsourced attribute data [8]. On the other hand, general
metric secure similarity search, where no additional information about the data
or the distance function is known, is a relatively new research area.
Naturally, there exists a straightforward solution: The data owner can encrypt
every object and send only the encrypted objects to the server without any
additional information. During the search phase, an authorized client downloads
all the objects from the server, decrypts them and performs the search. This
solution achieves perfect security, however it is clear that it cannot be used in
real applications because it has extremely high communication cost and it loses
all the advantages of cloud environment (especially scalability).
Few advanced techniques for outsourcing similarity search (with general metric
space approach) have been proposed. In this section, we provide brief overview
of two of the techniques of Yiu et al. [4]. Both these solutions encrypt the MS
objects and hide the data distribution (see Section 2.3).
Secure Metric-Based Index for Similarity Cloud 135
3.1 Encrypted Hierarchical Index Technique
First approach is called Encrypted hierarchical index search (EHI) and it works
as follows. Eﬃcient indexing structure is build upon the MS objects, all the nodes
of the indexing structure are encrypted with an arbitrary symmetric cipher,
address of the root node is public. The search procedure logic is implemented on
the client, which requests nodes of the structure from the server, decrypts the
node, processes the objects stored in this node and requests other node until it
completes the operation.
This approach has obvious advantage in its security, because a potential
attacker cannot gain any information either about the data itself or about
the metric space, since everything is stored encrypted. Another advantage is
a straightforward implementation and robust design, because we can use many
indexing structures practically without change of its internal structure.
However, there is a cost we have to pay for this high security level: communication
costs (a lot of traﬃc is between client and the server) and eﬃciency
of the search procedure. Since all the nodes of the indexing structure are encrypted,
server cannot traverse through the structure and can only serve as a
storage, sending the client what was requested. All the time-consuming search
operations have to be implemented on the client side, and the client has to perform
a lot of encryption/decryption operations which (in general) might be very
resource demanding as well. This approach seems to be only slightly better than
the trivial solution described above.
3.2 Metric-Preserving Transformation
EHI with the drawbacks mentioned above might not be applicable in some scenarios,
because the client can be a device with limited resources (e.g. smart phone
or even simpler device with less computational resources). Yiu et al. [4] propose
Metric-Preserving Transformation (MPT) technique which uses an orderpreserving
encryption function. For details see the paper [4]. The goal of MPT
was to reduce communication cost of EHI and pass part of the search work to
the server while preserving suﬃcient privacy of the data.
However, to achieve suﬃcient security level, one has to have a representative
sample of the data collection before the indexing structure is built and sent to
the server (it is necessary for the order-preserving encryption function to work
properly). This could be a problem in dynamic data sets where the collection is
often changing.
4 Encrypted Metric-Based Index
In this section, we propose an approach that can add privacy to metric indexes
that are based on distance permutations of a ﬁxed set of reference objects
[9–11, 6]. Our approach is introduced as an extension of a structure called MIndex
[5, 6] (because it enables both precise and approximate similarity search
and has other advantages) but it can be generalized straightforwardly to any
136 S. Kozak, D. Novak, and P. Zezula
p3
p2
p4
C3,2
C2,1
C1,2
C4,3
4,1C
C3,4
1,3C
p1
p3
p4
p1
p2
C3
C2
C1
C4
(a) (b)
C1,4
C2,3C3,1
Fig. 2. Principle of recursive Voronoi partitioning: ﬁrst level (a), second level (b)
other member of this class of metric indexes. Let us ﬁrst brieﬂy introduce the
standard M-Index.
4.1 M-Index
The M-Index [5, 6] is a dynamic disk-eﬃcient metric-based index that uses a
set of reference objects (pivots) p1, . . . , pn ∈ D in order to partition the indexed
set X ⊆ D. Namely, the recursive Voronoi partitioning is used: at the ﬁrst
level, each object o ∈ X is assigned to its closest pivot pi creating Voronoi
cell Ci; at the second level, data from Ci are partitioned further using pivots
p1, . . . , pi−1, pi+1, . . . , pn forming cells Ci,j, etc. Figure 2 depicts an example of
such partitioning with four pivots to the ﬁrst and second level, respectively. This
approach can also be formalized as pivot permutations [12, 13]: For each object
o ∈ X, let (·)o be permutation of indexes {1, . . ., n} such that ∀i, j ∈ {1, . . ., n} :
(i)o < (j)o ⇔ d(p(i)o
, o) < d(p(j)o
, o) ∨ d(p(i)o
, o) = d(p(j)o
, o) ∧ i < j.
Sequence p(1)o
, . . . , p(n)o
is then ordered with respect to distances between the
pivots and o. The M-Index uses preﬁxes of this permutation to index o.
Since neither this space partitioning nor the data distribution are uniform, the
M-Index has a dynamic variant that further partitions only the cells that exceed
certain data volume limit. The M-Index then maintains a dynamic Voronoi cell
tree to keep track of actual depth for individual cells. The schema of this tree is
sketched in Figure 3.
1C
1,3C
1,2C
1,3,C n nC ,1,n−1
2C
1,3,4C1,3,2C nC ,1,2 nC ,1,3
...
... ...
......
C1,n
C
nC
n,1
nC nC,2 ,n−1
2nd level
3rd level
1st level
Fig. 3. M-Index Voronoi cell tree
Secure Metric-Based Index for Similarity Cloud 137
The M-Index can evaluate the range and k-NN queries using several space
pruning and ﬁltering mechanisms [5] – only part of the Voronoi cells has to be
accessed to ensure returning of all the required objects. Beside this precise query
evaluation strategies that returns the precise answer AP
, the M-Index can also
evaluate k-NN(q) queries in approximate manner [6] (not all objects from AP
are
always returned while the search costs can be signiﬁcantly reduced). In this case,
the algorithm heuristically speciﬁes a candidate set SC ⊆ X of the objects that
are “close” to query q; this set SC is then reﬁned by evaluating distances d(q, o),
∀o ∈ SC and selecting the k closest objects that form the approximate search
result A. The size of the candidate set SC can be parametrized. The recall is a
common measure to quantify the quality of the result A with respect to precise
answer AP
: recall(A) = |A∩AP
|
|AP | · 100%.
4.2 Encrypted M-Index
Our solution exploits the fact that only the preﬁx of the pivot permutation is
used for indexing any object o ∈ D within M-Index. No additional distances
need to be computed during insertion of o into the index. The M-Index search
algorithms also need only query-pivot distances (or their permutation) in order
to form the candidate set SC. Therefore, the set of pivots can be part of the
private information known only by the data owner (and shared with authorized
clients). The ﬁnal reﬁning has to be done on the client, but SC (created on the
server) is supposed to be signiﬁcantly smaller than the whole collection X and it
can be pre-ranked. These operations are formalized in the following paragraphs
and schematically depicted in Figures 4 and 5.
Data Insertion. In the construction phase, for each object o ∈ X, the data
owner calculates pivot permutation, encrypts the object o (using arbitrary symmetric
cipher) and sends encrypted object e along with its pivot permutation to
the server. The server-side M-Index locates the leaf node of the dynamic cell tree
that corresponds to given pivot permutation (see Figure 3); object e together
with the pivot permutation is stored in this node and, if necessary, this leaf node
is split (again, only the permutations stored with the objects are necessary for
the split). This insert procedure is sketched in Figure 4 and its client part is
formalized in Algorithm 1.
Fig. 4. Schema of the insert operation into encrypted M-Index
138 S. Kozak, D. Novak, and P. Zezula
Algorithm 1. Encrypted M-Index insert algorithm
Input: o ∈ D, secretKey containing pivots p1, . . . , pn and cipher key
1 calculate d(o, secretKey.pi), ∀i ∈ {1, . . . , n};
2 init new enc. object e; /* e := struct {distances, permutation, data} */
3 if precise strategy is used then
4 e.distances ← d(o, secretKey.p1), . . . , d(o, secretKey.pn);
5 else
6 sort the distances to ﬁnd permutation (1)o, . . . , (n)o;
7 e.permutation ← (1)o, . . . , (n)o;
8 e.data ← secretKey.encrypt(o); /* store encrypted data only */
9 server.insert(e);
Fig. 5. Schema of the search operation in encrypted M-Index
Query Evaluation. Once the index is constructed, the data owner provides
the clients with the private information (set of pivots and key for the symmetric
cipher) and any client possessing this secret key can perform the search. To
execute any search query, authorized client ﬁrst computes distances to pivots
(part of the secret key) and sends the search request to the M-Index on the
server side (the search request consists of query object pivot permutation or
distances to pivots, the query object itself is not a part of the request). Please,
follow the diagram of this procedure in Figure 5. The server determines and sends
back the candidate set SC (set of encrypted objects), the client then decrypts the
objects from SC and ﬁnalizes the result (computes the distances of decrypted
candidate objects to the query object). In this way, all basic types of search
queries mentioned above can be evaluated (precise range and k-NN as well as
the approximate k-NN). From the client point of view, the algorithm is very
similar for all these types of queries and it is formalized in Algorithm 2.
Depending on the type of the search in Algorithm 2, either the query-pivot
distances (line 5) or only the pivot permutations (line 9) are sent to the server.
In either case, the candidate set is reﬁned after decrypting the retrieved objects.
For the approximate k-NN search, the client can choose the size CandSize of the
candidate set SC (line 10); SC retrieved from the server is pre-ranked, therefore
the client can choose to decrypt and compute distances only for candidates with
the highest rank to speed up the search process.
Secure Metric-Based Index for Similarity Cloud 139
Algorithm 2. Encrypted M-Index search algorithm (client)
Input: range R(q, r) or k-NN(q), secretKey with p1, . . . , pn and cipher key
Output: Answer set A
1 calculate d(q, secretKey.pi), ∀i ∈ {1, . . . , n};
2 init new encrypted object e ; /* e := struct {distances, permutation} */
3 initialize new encrypted object e;
4 if precise range query is used then
5 e.distances ← d(q, secretKey.p1), . . . , d(q, secretKey.pn);
6 SC ← server.rangeSearch(e, r) ; /* candidate set of enc. obj. */
7 else
8 sort the distances to ﬁnd permutation (1)q, . . . , (n)q;
9 e.permutation ← (1)q, . . . , (n)q;
10 SC ← server.approxKNN (e, CandSize) /* candidate set of enc. obj. */
11 A ← ∅;
12 foreach object c in SC do
13 o ← secretKey.decrypt(c);
14 compute distance d(q, o);
15 if o satisﬁes the query constraint then
16 A ← A ∪ {o};
Algorithm 3 formalizes the range search algorithm on the server side. It uses
the precomputed query-pivot distances and the query radius r to prune the index
tree using several metric-based constraints [6] (line 3 of Alg. 3). Furthermore,
because each indexed object contains distances to the pivots, the server can use
pivot ﬁltering [7, 6] to further reduce the candidate set size (lines 5–7).
Algorithm 4 describes the server-side procedure for the approximate k-NN
search. As mentioned in Section 4.1, the candidate set of a given size can be
determined only based on query pivot permutation or query-pivot distances:
The M-Index can order the Voronoi cells by some “promise value” (line 3).
The precise k-NN search can be realized as an approximate k-NN search, that
determines distance ρk to the k-th nearest neighbor of q, and then subsequent
precise range query R(q, ρk) is executed [7].
Algorithm 3. Range query algorithm (server.rangeSearch)
Input: encrypted query object q with distances q.distances, radius r
Output: Candidate set SC ⊆ X
1 SC ← ∅;
2 foreach Voronoi cell C traversing the M-Index do
3 if C cannot be pruned using q.distances and radius r then
4 SC ← SC ∪ C;
5 foreach object o in SC do
6 if
n
max
i=1
|q.distances[i] − o.distances[i]| > r then
7 SC ← SC \ {o};
140 S. Kozak, D. Novak, and P. Zezula
Algorithm 4. Approximate k-NN algorithm (server.approxKNN )
Input: encrypted q with q.permutation, candidate set size CandSize
Output: Pre-ranked candidate set SC
1 SC ← ∅;
2 while |SC | < CandSize do
3 C ← next promising Voronoi cell from the index;
4 SC ← SC ∪ C;
5 trim SC to required size CandSize;
4.3 Security Analysis
The secret key of authorized clients consist of the set of pivots and key for
symmetric cipher used to encrypt the data. The information which is stored on
the server (and can possibly leak to an attacker) are the pivot permutations (or
object-pivot distances) and the encrypted MS objects. Since the pivots are part
of the private key and only authorized clients know them, only they can query the
server by “meaningful” queries. An attacker can query the server index using an
arbitrarily chosen pivot permutation not knowing to which query object(s) this
permutation belongs; the received candidate objects are encrypted and without
any similarity distances or other meta-data. If the server is compromised, the
attacker could learn the index structure and thus the sets of clustered MS objects
(encrypted); nevertheless, not knowing the pivots and the metric function, it
would be diﬃcult to learn speciﬁcs about the data set. Clearly, this approach
belongs to the third privacy level deﬁned in Section 2.3.
In order to even better hide information about the data set distribution, we
would like to apply certain distance transformations that would preserve the
utmost pruning and ﬁltering eﬃciency of the server-side index. The aim is to
provide the security level described in the forth paragraph of Section 2.3 and
this belongs to our future work.
4.4 Implementation
The M-Index is implemented with the aid of MESSIF, a general and robust
framework that supports the task of building prototypes of similarity search
algorithms [14]. Encrypted M-Index is implemented as an encryption layer in
MESSIF. Even though the algorithm was designed for the M-Index as the server
algorithm, the architecture of encryption layer is general and robust. Therefore
it can be used with any MESSIF-based algorithm.
Core of the encryption layer is an encryption client which can (if supplied with
a secret key) provide a communication interface (layer) for a remote server running
a similarity search algorithm (indexing structure running in a distributed
cloud environment for example). Both client and server are Java processes communicating
via TCP/IP.
Secure Metric-Based Index for Similarity Cloud 141
5 Experimental Evaluation
The main objective of the experimental evaluation is to measure the price paid
for ensuring privacy in terms of the eﬃciency loss – the computation demands
placed upon the client due to the privacy protection and the client-server communication
costs (volume of exchanged data and communication time). The
client computations consist principally of (1) data encryption/decryption and
(2) the search algorithm fragments relocated from the server (mainly computation
of the distance function). All these cost components are quantiﬁed and
their signiﬁcance is compared for diﬀerent types of data sets (small vs. large,
simple distance function vs. complex distance function). In every experiment,
the performance of the proposed secure M-Index is compared with the basic
(non-encrypted) version of M-Index in order to properly quantify the cost paid
for the security.
5.1 Data Sets and Settings
The Encrypted M-Index was tested on three real data sets, whose properties are
summarized in Table 1.
YEAST1
A gene expression data matrix obtained from a Microarray experiment
on yeast. Each entry indicates the expression level of a speciﬁc gene
(row/tuple) at a speciﬁc condition (column/attribute) [4].
HUMAN2
A gene expression data matrix obtained from a Lymphoma/Leukemia
Molecular Proﬁling Project.
CoPhIR3
Collection of one million images downloaded from Flicker photo site.
From each image, ﬁve MPEG-7 visual descriptors were extracted and the
distance combines them [15]. We test scalability of our solution on this set.
Table 1. Data sets summary
Name # of records Data type Distance function
YEAST 2,882 17-dim. num. vectors L1
HUMAN 4,026 96-dim. num. vectors L1
CoPhIR 1,000,000 280-dim num. vectors combination of Lp
We performed all the experimental evaluations on a machine with 8 CPU cores
(double quad core) with 8GB RAM, 4 high-speed disks in RAID-5. To reduce an
inﬂuence of network communication time, both encryption client and M-Index
server were running on the same machine communicating via loopback interface.
Standard symmetric cipher AES with 128 bit key was used for encryption.
1
http://arep.med.harvard.edu/biclustering/
2
http://arep.med.harvard.edu/biclustering/
3
http://cophir.isti.cnr.it/
142 S. Kozak, D. Novak, and P. Zezula
Evaluation parameters of the M-Index running on the server side for each data
set are summarized in Table 2. The pivots used were chosen at random from
within the data set. The experimental evaluation is divided into two phases: the
construction phase (data insertion) and the search phase.
Table 2. M-Index parameters
Name Bucket capacity Storage type # of pivots
YEAST 200 Memory storage 30
HUMAN 250 Memory storage 50
CoPhIR 1,000 Disk storage 100
5.2 Index Construction
For the construction phase, we used bulk insert operations to insert the data
into M-Index running as the server via encryption client. The size of each bulk
was 1,000. Measures taken for each data set are the following (they are used for
both construction and search phases):
client time the overall client computation time: data encryption/decryption,
distance computations (object-pivot distances), and processing overhead,
server time the time to store objects in the M-Index (and to build the M-Index
cell tree) or, in the search phase, to prepare the candidate set,
communication time time spent on communication between server and client,
overall time sum of client, server and communication times.
Results for the construction phase are summarized in Table 3. We can see that
the relative importance of the client, server and communication times diﬀers for
the three data sets: for the small YEAST and HUMAN data sets, the server
time is more than 50 % of the client time and the communication time is very
important, and, for CoPhIR, the server and communication times are marginal
in comparison with the time spent by client-side computations. This is caused
by the fact that the CoPhIR distance function is more demanding than the
YEAST and HUMAN and the distances are computed on client. By analogy,
the encryption time is relatively more important for YEAST and HUMAN.
Table 3. Index construction of encrypted M-Index
YEAST HUMAN CoPhIR
Client time [s] 0.208 0.324 1,584.5
Encryption time [s] 0.117 0.155 32.2
Dist. comp. time [s] 0.026 0.101 1,541.4
Server time [s] 0.116 0.188 70.2
Communication time [s] 0.182 0.288 52.9
Overall time [s] 0.506 0.800 1,707.7
Secure Metric-Based Index for Similarity Cloud 143
Let us compare these results with the basic non-encrypted M-Index. All the
settings (computational power, network, M-Index parameters, client-server architecture,
etc.) were the same, the only diﬀerence was the absence of the encryption
layer. The index construction results for non-encrypted M-Index are
summarized in Table 4. For the small YEAST and HUMAN data sets, we can
see that the overall time increase caused by the encryption (Table 3) was about
60 % (e.g. 0.315 s vs 0.506 s, for YEAST). In the case of CoPhIR, the overall time
is practically the same for both variants because the encryption time is marginal
comparing to the distance computation time.
Table 4. Index construction of the basic (non-encrypted) M-Index
YEAST HUMAN CoPhIR
Client time [s] 0.001 0.009 0.300
Server time [s] 0.144 0.216 1,563.4
Dist. comp. time [s] 0.026 0.101 1,541.4
Communication time [s] 0.170 0.265 141.4
Overall time [s] 0.315 0.490 1,705.2
5.3 Approximate Search
For the evaluation of the search eﬃciency, we used the approximate k-NN search
algorithm on one hundred query objects randomly chosen from the data set. We
varied the parameter k but the results were similar and we present only results
for k = 30. All presented results are averaged over the 100 queries. For each data
set, we varied the size of the candidate set provided by the server (see Section 9)
which inﬂuences the quality of the result and eﬃciency. Besides the measures
introduced in the previous section, the following values are presented:
decryption time time spent on deserialization and decryption of the candidate
objects received from the server,
recall quality of the result (see Section 4.1),
communication cost amount of data sent between the server and client.
The search results for the Encrypted M-Index are summarized in Table 5
(YEAST) and Table 6 (CoPhIR). Results for the HUMAN data set are not
presented – the trends do not diﬀer from YEAST (the sizes of the collections
are very similar and the character of data and distance function is the same).
For the two small data sets (YEAST and HUMAN) with a simple distance
function, server time is approximately 50% of the client time. For the CoPhIR
data set, the ratio of server/client time is approximately 1/5 due to resource
demanding distance computations on client side. The relative signiﬁcance of
decryption time (which includes also deserialization of the objects) is the same
for all the data sets (and candidate sizes) and it is relatively high; this cost
component can be hardly moved from the client, provided the system is secure
(decryption can be done only by an authorized client having a secret key).
144 S. Kozak, D. Novak, and P. Zezula
Table 5. Approximate 30-NN evaluation using the Encrypted M-Index (YEAST)
Candidate set size 150 300 600 1,500
Client time [s] 0.002 0.003 0.006 0.014
Decryption time [s] 0.001 0.002 0.004 0.010
Dist. comp. time [s] 0.001 0.001 0.002 0.003
Server time [s] 0.001 0.002 0.004 0.009
Communication time [s] 0.003 0.003 0.004 0.008
Overall time [s] 0.006 0.008 0.014 0.031
Recall [%] 59.80 82.87 91.3 91.6
Communication cost [kB] 25.805 51.643 103.308 258.314
Naturally, the decryption, distance computation and communication time values
are linearly proportional to the size of the candidate set SC, which also inﬂuences
the recall. For the YEAST data set, we can observe that the |SC| = 600
(about 20 % of the collection size) results in recall over 90 % and further increase
of the SC size does not lead to a signiﬁcant recall improvement. For CoPhIR, we
can achieve almost 90% recall with the SC size of 50,000 (5 % of the collection).
Table 6. Approximate 30-NN evaluation using the Encrypted M-Index (CoPhIR)
Candidate set size 500 1,000 5,000 10,000 20,000 50,000
Client time [s] 0.034 0.047 0.224 0.446 0.899 2.429
Decryption time [s] 0.019 0.026 0.132 0.264 0.534 1.451
Dist. comp. time [s] 0.009 0.018 0.082 0.180 0.356 0.969
Server time [s] 0.005 0.016 0.059 0.110 0.207 0.498
Communication time [s] 0.006 0.008 0.029 0.054 0.106 0.290
Overall time [s] 0.045 0.071 0.312 0.610 1.212 3.217
Recall [%] 7.619 10.952 36.667 55.238 71.905 87.143
Communication cost [kB] 460 921 4,605 9,211 18,423 46,058
For comparison, we present the results of the search procedure on the basic
(non-encrypted) M-Index – see Table 7 (YEAST) and Table 8 (CoPhIR). In this
setting, practically all the work is done on the server side. The most signiﬁcant
diﬀerence between encrypted and non-encrypted M-Index is in the communication
cost and time, because the non-encrypted variant directly returns the answer
set with 30 objects, not the candidate set (which can have up to 50,000 objects
in case of CoPhIR). The ﬁxed set of 30 objects transferred between sever and
client implies same communication time for all SC sizes. The amount of work on
the client is negligible, therefore there are no values in the search result tables.
Secure Metric-Based Index for Similarity Cloud 145
To summarize this section, the price paid for privacy in the search phase on
Encrypted M-Index is mainly the communication cost (which grows linearly with
the candidate set size) and the time of decrypting the candidate objects. This
results in approximately three times longer overall search time measured on the
encrypted variant. The distance computations, performed during the necessary
candidate set reﬁnement on the client, constitute between 1/10 and 1/3 of the
overall search time, depending on the complexity of the distance function.
Table 7. Approx. 30-NN evaluation using basic (non-encrypted) M-Index (YEAST)
Candidate set size 150 300 600 1,500
Client time [s] – – – –
Server time [s] 0.001 0.001 0.002 0.003
Dist. comp. time [s] 0.001 0.001 0.002 0.003
Communication time [s] 0.002 0.002 0.002 0.002
Overall time [s] 0.003 0.003 0.004 0.005
Communication cost [kB] 5.161 5.164 5.162 5.161
Table 8. Approx. 30-NN evaluation using basic (non-encrypted) M-Index (CoPhIR)
Candidate set size 500 1,000 5,000 10,000 20,000 50,000
Client time [s] – – – – – –
Server time [s] 0.023 0.032 0.116 0.215 0.416 1.029
Dist. comp. time [s] 0.009 0.018 0.082 0.180 0.356 0.969
Communication time [s] 0.006 0.006 0.005 0.005 0.005 0.006
Overall time [s] 0.029 0.038 0.121 0.220 0.421 1.023
Communication cost [kB] 26.421 26.174 26.403 26.325 26.196 26.254
5.4 Comparison with Other Solutions
In order to compare our solution with the referenced approaches, we evaluated
1-NN queries on the YEAST data set, which corresponds with the setting of Yiu
et al. [4]. To be more speciﬁc, we adjusted the approximate 1-NN strategy so
that the server-side M-Index was limited to access only one M-Index Voronoi cell
which then forms the candidate set (this leads to candidate sets of average size
42). Again, we ran these 1-NN queries for 100 randomly chosen query objects
(they were excluded from the indexed set). In Table 9, we report average values
of the collected measures. The recall value says how many queries (out of 100)
resulted in the actual nearest neighbor.
In comparison with results presented by Yiu et al. [4], the Encpryted M-Index
outperformed all the techniques in the communication cost. It also outperformed
FDH [4] (technique which uses also approximate search) in CPU time. On the
other hand, it takes more time to construct the index (the Encrypted M-Index
was approximately twice slower than FDH).
146 S. Kozak, D. Novak, and P. Zezula
However, it is always diﬃcult to compare the wall-clock CPU times, as they
strongly depend on speciﬁc implementation and hardware. Moreover, the referenced
paper does not specify technical details about the symmetric cipher used
(and its key length), the computational power and network setting. Also, the
referenced approaches [4] modify the distance function for indexing, therefore
they belong to the fourth privacy level whereas our approach fulﬁlls conditions
of the third level (see Section 2.3).
Table 9. Approximate 1-NN search evaluation results for the YEAST data set
Client time [ms] 0.509
Decryption time [ms] 0.160
Dist. comp. time [ms] 0.210
Server time [ms] 1.001
Communication time [ms] 1.180
Overall time [ms] 2.690
Recall [%] 94.0
Communication cost [kB] 2.368
6 Conclusions and Future Work
We proposed a method that can be used to ensure data privacy in similarity
search systems outsourced in a cloud. The proposed solution exploits existing
eﬃcient metric indexes based on a ﬁxed set of pivots. This set is part of the secret
key controlled by authorized clients, while the server itself cannot compute the
similarity distance function, nor it can access any data in an unencrypted form. A
potential attacker can only learn encrypted object data and pivot permutations.
Our approach has been implemented as a real client-server “similarity cloud”
system exploiting an existing mature implementation of the M-Index (diskeﬃcient,
parallel, potentially distributed). The performance of the system was
experimentally evaluated on several real data sets focusing on individual components
of the search time (server, communication, data decryption, client data
operations). The Encrypted M-Index has the intention of providing required privacy
while preserving the server-side eﬃciency as much as possible and relocate
only the necessary computations to the client (data decryption and computation
of query-data distances).
In the future, we would like to analyze the precise range and k-NN evaluation
strategies of Encrypted M-Index in comparison to the approximate strategy and
to other possible solutions. Further, we would like to study various types of
distance transformations (i.e. transform the distances to pivots stored on the
server for precise strategies); such transformation could better hide information
about the data set distribution and thus further restrict possible attacks.
Acknowledgments. This work was supported by national Czech Research
Foundation project GACR P103/12/G084.
Secure Metric-Based Index for Similarity Cloud 147
References
1. Park, H.A., Kim, B.H., Lee, D.H., Chung, Y.D., Zhan, J.: Secure similarity search.
In: 2007 IEEE International Conference on Granular Computing (GRC 2007), pp.
598–598. IEEE (2007)
2. Li, J., Wang, Q., Wang, C., Cao, N., Ren, K., Lou, W.: Fuzzy keyword search
over encrypted data in cloud computing. In: Proceeding of the 29th Conference on
Information Communications, pp. 441–445 (2010)
3. Cao, N., Wang, C., Li, M., Ren, K., Lou, W.: Privacy-preserving multi-keyword
ranked search over encrypted cloud data. In: 2011 Proceedings IEEE INFOCOM,
pp. 829–837. IEEE (2011)
4. Yiu, M.L., Assent, I., Jensen, C.S., Kalnis, P.: Outsourced Similarity Search on
Metric Data Assets. IEEE Transactions on Knowledge and Data Engineering 24(2),
338–352 (2012)
5. Novak, D., Batko, M.: Metric index: an eﬃcient and scalable solution for similarity
search. In: Second International Workshop on Similarity Search and Applications
(SISAP 2009), pp. 65–73. IEEE (2009)
6. Novak, D., Batko, M., Zezula, P.: Metric Index: An Eﬃcient and Scalable Solution
for Precise and Approximate Similarity Search. Information Systems 36(4), 721–733
(2011)
7. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space
Approach. In: Advanced Database Systems, vol. 32. Springer (2006)
8. Hore, B., Mehrotra, S., Canim, M., Kantarcioglu, M.: Secure multidimensional
range queries over outsourced data. The VLDB Journal 21(3), 333–358 (2011)
9. Ch´avez, E., Figueroa, K., Navarro, G.: Eﬀective Proximity Retrieval by Ordering
Permutations. IEEE Transactions on Pattern Analalysis and Machine Intelligence
30(9), 1647–1658 (2008)
10. Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted
ﬁles. In: Proceedings of the 3rd International Conference on Scalable Information
Systems (2008)
11. Esuli, A.: PP-Index: Using permutation preﬁxes for eﬃcient and scalable approximate
similarity search. In: Proceedings of LSDS-IR 2009 (2009)
12. Ch´avez, E., Navarro, G., Baeza-Yates, R., Marroqu´ın, J.L.: Searching in metric
spaces. ACM Computing Surveys 33(3), 273–321 (2001)
13. Skala, M.: Counting distance permutations. Journal of Discrete Algorithms 7(1),
49–61 (2009)
14. Batko, M., Novak, D., Zezula, P.: MESSIF: Metric similarity search implementation
framework. Digital Libraries Research and Development 4877(102), 1–10 (2007)
15. Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, R., Piccioli, T., Rabitti, F.:
CoPhIR: A Test Collection for Content-Based Image Retrieval. CoRR abs/0905.4
(2009)