Metric Embedding into the Hamming Space with the n-Simplex Projection

. Transformations of data objects into the Hamming space are often exploited to speed-up the similarity search in metric spaces. Techniques applicable in generic metric spaces require expensive learning, e.g., selection of pivoting objects. However, when searching in common Euclidean space, the best performance is usually achieved by transformations speciﬁcally designed for this space. We propose a novel transformation technique that provides a good trade-oﬀ between the applicability and the quality of the space approximation. It uses the n-Simplex projection to transform metric objects into a low-dimensional Euclidean space, and then transform this space to the Hamming space. We compare our approach theoretically and experimentally with several techniques of the metric embedding into the Hamming space. We focus on the applicability, learning cost, and the quality of search space approximation.


Introduction
The metric search problem aims at finding the most similar data objects to a given query object under the assumption that there exists a metric function assessing the dissimilarity of any two objects. The broad applicability of the metric space similarity model makes the metric search a challenging task, since the distance function is the only operation that can be exploited to compare two objects. One way to speed-up the metric searching is to transform the space to use a cheaper similarity function or to reduce data object sizes [4,9,14,19]. Recently, Connor et al. proposed the n-Simplex projection that transforms the metric space into a finite-dimensional Euclidean space [8,9]. Here, specialised similarity search techniques can be applied. Moreover, the Euclidean distance is more efficient to evaluate than many distance functions.
Another class of metric space transformations is formed by sketching techniques that transform data objects into short bit-strings called sketches [4,17,19]. The similarity of sketches is expressed by the Hamming distance, and sketches are exploited to prune the search space during query executions [18,19]. While some sketching techniques are applicable in generic metric spaces, others are designed for specific spaces [4]. The metric-based sketching techniques are broadly applicable, but their performance is often worse than that of the vectorbased sketching approaches when dealing with the vector spaces [4,17].
We propose a novel sketching technique NSP 50 that combines advantages of both approaches: wide applicability and good space approximation. It is applicable to the large class of metric spaces meeting the n-point property [3,7], and it consists of the projection of the search space into a low-dimensional Euclidean space (n-Simplex projection) and the binarization of the vectors. The NSP 50 technique is particularly advantageous for expensive metric functions, since the learning of the projection requires a low number of distance computations. The main contribution of the NSP 50 is a better trade-off between its applicability, quality of the space approximation, and the pre-processing cost.

Background and Related Work
We focus on the similarity search in domains modelled by the metric space (D, d), with the domain of objects D and the metric (distance) function d : [21] that expresses the dissimilarity of objects o ∈ D. We consider the data set S ⊆ D, and the so-called kNN queries that search for the k closest objects from S to a query object q ∈ D. Similarity queries are often evaluated in an approximate manner since the slightly imprecise results are sufficient in many real-life applications and they can be delivered significantly faster than the precise ones. Many metric space transformations have been proposed to speed-up the approximate similarity searching, including those producing the Hamming space [4,5,11,18,19], Euclidean space [9,16] and Permutation space [1,6,20]. We further restrict our attention to the metric embedding into the Hamming space.

Bit String Sketches for Speeding-Up Similarity Search
Sketching techniques sk(·) transform the metric space (D, d) to the Hamming space {0, 1} λ , h to approximate it with smaller objects and more efficient distance function. We denote the produced bit strings as sketches of length λ. Many sketching techniques were proposed -see for instance the survey [4]. Their main features are: (1) Quality, i.e., the ability to approximate the original metric space; (2) Applicability to various search spaces; (3) Robustness with respect to data (intrinsic) dimensionality; (4) Cost of the object-to-sketch transformation; (5) Cost of the transformation learning. In the following, we summarise concepts of three techniques that we later compare with the newly proposed NSP 50 technique. They all produce sketches with balanced bits, i.e. each bit i is set to 1 in one half of the sketches sk(o), o ∈ S. This is denoted by the suffix 50 in their notations. GHP 50 technique [18] uses λ pairs of reference objects (pivots), that define λ instances of the Generalized Hyperplane Partitioning (GHP) [21] of the dataset S. Therefore, each GHP instance splits the dataset into two parts according to the closer pivot, and these parts define values of one bit of all sketches sk(o), o ∈ S. The pivots are selected to produce balanced and low correlated bits [18]: (1) an initial set of pivots P sup ∈ D is selected in random, (2) the balance of the GHP is evaluated for all pivot pairs using a sample set T of S, (3) set P bal is formed by pivot pairs that divide T into parts balanced to at least 45 % to 55 %, and corresponding sketches sk bal are created, (4) the correlation matrix M with absolute values of the Pearson correlation coefficient is evaluated for all pairs of bits of sketches sk bal , and (5) a heuristic is applied to select rows and columns of M which form its sub-matrix with low values and size λ × λ. (6) Finally, the λ pivot pairs that produce the corresponding low correlated bits define sketches sk(o), o ∈ S. BP 50 uses the Ball Partitioning (BP) instead of the GHP [18]. BP uses one pivot and a radius to split data into two parts, that again define the values in one bit of sketches sk(o), o ∈ S. Pivots are selected again via a random set of pivots P sup , for which we evaluate radii dividing the sample set T into halves.
The same heuristic as in case of the technique GHP 50 is than employed to select λ pivots that produces low correlated bits. PCA 50 is a simple sketching technique surprisingly well approximating the Euclidean spaces [4,12,13,15,17]. It uses the Principal Component Analysis (PCA) to shrink the original vectors, which are then rotated using a random matrix and binarized by the thresholding. The i-th bit of sketch sk(o) thus expresses whether the i-th value in the shortened vector is bigger then the median computed on a sample set T . If sketches longer than the original vectors are desired, we propose to apply the PCA and to rotate transformed vectors using independent random matrices. Then we concatenate corresponding binarized vectors.
Sketching techniques applicable to generic metric spaces, e.g., GHP 50 and BP 50, are usually of a worse quality than vector-based sketching techniques when dealing with the vectors spaces [4,17]. Moreover, they require an expensive learning of the transformation. We propose the sketching technique NSP 50 to provide a better trade-off between the quality of the space approximation, applicability of the sketching, and the pre-processing cost.

The n-Simplex Projection
The n-Simplex projection [9] associated with a set of n pivots P n is a space transformation φ Pn : (D, d) → (R n , 2 ) that maps the original metric space to a n-dimensional Euclidean space. It can be applied to any metric space with the n-point property, which states that any n points o 1 , ..o n of the space can be isometrically embedded in the (n − 1)-dimensional Euclidean space. Many often used metric spaces such as Euclidean spaces of any dimension, spaces with the Triangular or Jensen-Shannon distances, and, more generally, any Hilbertembeddable spaces meet the n-point property [7]. The n-Simplex projection is properly described in [9]. Here, we sketch just the main concepts.
First, the n-point property guarantees that there exists an isometric embedding of the n pivots into (R n−1 , 2 ) space, i.e., it is possible to construct the vertices v pi ∈ R n−1 such that 2 (v pi , v pj ) = d(p i , p j ) for all i, j ∈ {1, . . . , n}. These vertices form the so-called base simplex. Second, for any other object o ∈ D, the (n + 1)-point property guarantees that there exists a vertex v o ∈ R n such that 2 (v o , v pi ) = d(o, p i ) for all i = 1, . . . , n. The n-Simplex projection assigns such v o to o, and Connor et al. [9] provide an iterative algorithm to compute the coordinates of the vertices v pi of the simplex base as well as the coordinates of the vector v o associated to o ∈ D. The base simplex is computed once and reused to project all data objects o ∈ S. Moreover, the Euclidean distance between any two projected vectors v o1 , v o2 ∈ R n is a lower-bound of their actual distance, and this bound becomes tighter with increasing number of pivots n [9].

The n-Simplex Sketching: Proposal and Comparison
We propose the sketching technique NSP 50 that transforms metric spaces with the n-point property to the Hamming space. It uses the n-Simplex projection with λ pivots to project objects into λ-dimensional Euclidean space; the obtained vectors are then randomly rotated and binarized using the median values in each coordinate. These medians are evaluated on the data sample set. The random rotation is applied to distribute information equally over the vectors, as the n-Simplex projection returns vectors with decreasing values along the dimensions.
For each data set S, there exists a finite number of pivotsñ such that φ Pñ is an isometric space embedding 1 . The identification of the minimumñ with this property is still an open problem. The convergence is achieved when all the projected data points have a zero value in their last component, so the NSP 50 technique as described above cannot produce meaningful sketches of length λ >ñ. We overcome this issue by a concatenation of smaller sketches obtained using different rotation matrices.
The proposed NSP 50 technique is inspired by the PCA 50 approach, but provides significantly broader applicability, as it can transform all the metric spaces with the n-point property. This includes spaces with very expensive distance functions, as mentioned in Sect. 2.2. Sketching techniques also require transformation learning of a significantly different complexity. We compare the novel NSP 50 technique with the GHP 50, BP 50 and PCA 50 approaches and we provide the table summarising the main features of these sketching techniques, including the costs of the learning and object to sketch transformations in terms of floating point operations and distance computations. This table is provided online 2 , due to the paper length limitation.
The GHP 50 and BP 50 techniques require an expensive pivot learning. Specifically, the GHP 50 requires (1) to examine the balance of the GHPs defined by various pivot pairs to create long sketches with the balanced bits, (2) an analysis of the pairwise bit correlations made for these sketches, and (3) a selection

Experiments
We evaluate the search quality of the NSP 50 technique on three data sets and we compare it with the sketching techniques PCA 50, GHP 50 and BP 50. We use three real-life data sets of visual features extracted from images: SQFD: 1 million adaptive-binning feature histograms [2] extracted from the Profiset collection 3 . Each signature consists of, on average, 60 cluster centroids in a 7-dimensional space. A weight is associate to each cluster, and the signatures are compared by the Signature Quadratic Form Distance [2]. Note that this metric is a cheaper alternative to Earth Movers Distance, nevertheless, the cost of the Signature Quadratic Form Distance evaluation is quadratic with respect to the number of cluster centroids. DeCAF: 1 million deep features extracted from the Profiset collection using the Deep Convolutional Neural Network described in [10]. Each feature is a 4,096-dimensional vector of values from the last hidden layer (fc7 ) of the neural network. The deep features use the ReLU activation function and are not 2 -normalised. These features are compared with the Euclidean distance. SIFT: 1 million SIFT descriptors from the ANN data set 4 . Each descriptor is a 128-dimensional vector. The Euclidean distance is used for the comparison. Figure 1 shows particular distance densities. We express the quality of the sketching techniques by the recall of the k-NN queries evaluated using a simple sketch-based filtering. More specifically, sketches are applied to select the candidate set CandSet(q) for each query object q ∈ D that consists of a fixed number of the most similar sketches to the query sketch sk(q); then, the candidate set is refined by the distance d(q, o), o ∈ CandSet(q) to return the k most similar objects o to q with the sketches in the candidate set CandSet(q). This approximate answer is compared with the precise one that consists of the k closest objects o ∈ S to q. The candidate sets consist of 2,000 sketches in the case of DeCAF and SIFT data sets, and 1,000 sketches in the case of the SQFD data set.
We evaluate experiments using 1,000 randomly selected query objects q ∈ D, and we depict results by Tukey box plots to show distributions of the recall values for particular query objects: the lower-and upper-bounds of the box show the quartiles, and the lines inside the boxes depict the medians of the recall values. The ends of the whiskers represent the minimum and the maximum non-outliers, and dots show the outlying recall values. In all cases, we examine 100 nearest neighbours queries to investigate properly the variance of the recall values over particular query objects. We use sketches of lengths λ ∈ {64, 128, 196, 256}.
Results. Figure 2a shows results for the SQFD data set. The colours of the box plots distinguish particular sketching techniques, the suffix of the column names denotes the length of sketches. The proposed NSP 50 technique significantly outperforms both, GHP 50 and BP 50 techniques, fixing the sketch length. The PCA 50 approach is not applicable for this data set, as we search different than the Euclidean space. The BP 50 technique performs worst and provides the median recall just 0.67 in case of 256bit sketches. The NSP 50 and GHP 50 approaches achieve a solid median recall of 0.88 and 0.81, respectively, even in case of 192bit sketches. We show also a coherence of the results when varying the candidate set size. Figure 2b reports the recalls for the candidate set sizes c ∈ {100, 500, 1000, 2000, 3000, 4000} and sketches of length 128 bits made by the sketching techniques NSP 50 and GHP 50. This figure shows that a given recall value can be achieved by the NSP 50 technique using a smaller candidate set than in case of the GHP 50. The recall values for the DeCAF and SIFT data sets are depicted in Fig. 3. The BP 50 technique is less robust concerning the dimensionality of the data, so it achieves poor recalls in case of DeCAF descriptors, but it is still reasonable for the SIFT data set. The quality of the newly proposed NSP 50 technique is slightly better then that of the GHP 50 technique in case of the DeCAF data set. Both are, however, outperformed by the PCA 50 technique, which is specialised for the Euclidean space. This interpretation is valid for all the sketch lengths λ we have tested. The differences between the NSP 50 and PCA 50 techniques practically dismiss in case of the SIFT data set. Both these techniques achieve significantly better recall than the BP 50 and the GHP 50 techniques.

Conclusions
We contribute to the area of the metric space embeddings into the Hamming space. We propose the NSP 50 technique that leverages the n-Simplex projection to transform metric objects into bit-string sketches. We compare the NSP 50 technique with three other state-of-the-art sketching techniques designed either for the general metric space or the Euclidean vector space. The experiments are conducted on three real life data sets of visual features using four different sketch lengths. We show that our technique provides advantages of both metric-based and specialised vector-based techniques, as it provides a good trade-off between the quality of the space approximation, applicability, and transformation learning cost.