Similarity Search:
Part I, Chapter 2
‹#›
SIMILARITY SEARCH
The Metric Space Approach
Pavel Zezula, Giuseppe Amato,
Vlastislav Dohnal, Michal Batko
book

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
2
Table of Content
nPart I: Metric searching in a nutshell
nFoundations of metric space searching
nSurvey of existing approaches
nPart II: Metric searching in large collections
nCentralized index structures
nApproximate similarity search
nParallel and distributed indexes

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
3
Survey of existing approaches
1.ball partitioning methods
2.generalized hyper-plane partitioning approaches
3.exploiting pre-computed distances
4.hybrid indexing approaches
5.approximated techniques

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
4
Survey of existing approaches
1.ball partitioning methods
1.Burkhard-Keller Tree
2.Fixed Queries Tree
3.Fixed Queries Array
4.Vantage Point Tree
1.Multi-Way Vantage Point Tree
5.Excluded Middle Vantage Point Forest
2.generalized hyper-plane partitioning approaches
3.exploiting pre-computed distances
4.hybrid indexing approaches
5.approximated techniques

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
5
Burkhard-Keller Tree (BKT) [BK73]
nApplicable to discrete distance functions only
nRecursively divides a given dataset X
nChoose an arbitrary point pjÎX, form subsets:
n Xi = {o Î X, d(o,pj) = i } for each distance i ≥ 0.
nFor each Xi create a sub-tree of pj
qempty subsets are ignored
n
pj
X3
X4
X2
pj
X4
X3
X2

m-ary tree

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
6
BKT: Range Query
nGiven a query R(q,r) :
ntraverse the tree starting from root
nin each internal node pj , do:
qreport pj on output if d(q,pj) ≤ r
qenter a child i if max{d(q,pj) – r, 0} ≤ i ≤ d(q,pj) + r
p1
2
3
4
3
5
p2
p3
p1
p2
p3
q
r

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
7
Fixed Queries Tree (FQT)
nmodification of BKT
neach level has a single pivot
qall objects stored in leaves
nduring search distance computations are saved
qusually more branches are accessed ® one distance comp.
p1
p2
2
3
4
3
4
5
p2
p1
0
0
p2
p1
p2
q
r

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
8
Fixed-Height FQT (FHFQT)
nextension of FQT
nall leaf nodes at the same level
qincreased filtering using more routing objects
qextended tree depth does not typically introduce further computations
p1
p2
2
3
4
3
4
5
p2
p1
0
0
p2
FQT
p2
p1
2
3
4
3
4
5
p1
0
0
p2
2
6
FHFQT
p1
p2
q
r

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
9
Fixed Queries Array (FQA)
nbased on FHFQT
nan h-level tree is transformed to an array of paths
qevery leaf node is represented with a path from the root node
qeach path is encoded as h values of distance
na search algorithm turns to a binary search in array intervals
p2
p1
2
3
4
3
4
5
p1
0
0
p2
2
6
FHFQT
0
2
2
3
3
4
2
0
3
4
5
6
p1
p2

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
10
Vantage Point Tree (VPT)
nuses ball partitioning
qrecursively divides given data set X
nchoose vantage point pÎX, compute median m
qS1 = {xÎX – {p} | d(x,p) ≤ m}
qS2 = {xÎX – {p} | d(x,p) ≥ m}
qthe equality sign ensures balancing
p1
p2
S1,1
S1,2
p1
m1
p2
S1,2
S1,1
m2

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
11
VPT (cont.)
nOne or more objects can be accommodated in leaves.
nVP tree is a balanced binary tree.
nStatic structure
n
n
n
n
n
n
nPivots p1,p2 and p3 belong to the database!
nIn the following, we assume just one object in a leaf.
p1
p2
p3
o4
o1
o3
o8
o9
o11
o7
o2
o6
o5
o10
o12
m1
m2
m3

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
12
VPT: Range Search
nGiven a query R(q,r) :
ntraverse the tree starting from its root
nin each internal node (pi,mi), do:
qif d(q,pi) ≤ r report pi on output
qif d(q,pi) - r ≤ mi search the left sub-tree (a,b)
qif d(q,pi) + r ≥ mi search the right sub-tree (b)
n
(a)
(b)
pi
mi
pi
mi
q
r
q
r

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
13
VPT: k-NN Search
nGiven a query NN(q):
ninitialization:    dNN =dmax       NN=nil
ntraverse the tree starting from its root
nin each internal node (pi,mi), do:
qif d(q,pi) ≤ dNN set dNN =d(q,pi), NN=pi
qif d(q,pi) - dNN  ≤ mi search the left sub-tree
qif d(q,pi) + dNN ≥ mi search the right sub-tree
q
nk-NN search only requires the arrays dNN[k] and NN[k]
qThe arrays are kept ordered with respect to the distance to q.
n

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
14
Multi-Way Vantage Point Tree
ninherits all principles from VPT
qbut partitioning is modified
nm-ary balanced tree
napplies multi-way ball partitioning
n
p1
m2
S1,1
S1,3
m1
m3
S1,2
S1,4
p1
S1,2
S1,3
S1,4
S1,1

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
15
Vantage Point Forest (VPF)
na forest of binary trees
nuses excluded middle partitioning
n
n
n
n
n
n
n
nmiddle area is excluded from the process of tree building
n
2r
pi
mi
pi
mi

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
16
VPF (cont.)
ngiven data set X is recursively divided and a binary tree is built
nexcluded middle areas are used for building another binary tree
n
p’1
M’1
p’2
p’3
M’2
M’3
S’1,1
S’2,1
S’1,2
S’2,2
M1 + M2 + M3
p1
M1
p2
p3
M2
M3
S1,1
S2,1
S1,2
S2,2
X

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
17
VPF: Range Search
nGiven a query R(q,r):
nstart with the first tree
qtraverse the tree starting from its root
qin each internal node (pi,mi), do:
nif d(q,pi) ≤ r report pi
nif d(q,pi) – r ≤ mi – r  search the left sub-tree
qif d(q,pi) + r ≥ mi – r      search the next tree !!!
nif d(q,pi) + r ≥ mi + r  search the right sub-tree
qif d(q,pi) – r ≤ mi + r      search the next tree !!!
nif d(q,pi) – r ≥ mi – r   and
n d(q,pi) + r ≤ mi + r   search only the next tree !!!

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
18
VPF: Range Search (cont.)
nQuery intersects all partitions
qSearch both sub-trees
qSearch the next tree
nQuery collides only with exclusion
qSearch just the next tree
2r
pi
mi
2r
pi
mi
q
r
q

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
19
Survey of existing approaches
1.ball partitioning methods
2.generalized hyper-plane partitioning approaches
1.Bisector Tree
2.Generalized Hyper-plane Tree
3.exploiting pre-computed distances
4.hybrid indexing approaches
5.approximated techniques

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
20
nApplies generalized hyper-plane partitioning
nRecursively divides a given dataset X
nChoose two arbitrary points p1,p2ÎX
nForm subsets from remaining objects:
n S1 = {o Î X, d(o,p1) ≤ d(o,p2)}
n S2 = {o Î X, d(o,p1) > d(o,p2)}
nCovering radii r1c and r2c are
n established:
qThe balls can intersect!
Bisector Tree (BT)
r1c
r2c
p1
p2

20

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
21
BT: Range Query
nGiven a query R(q,r) :
ntraverse the tree starting from its root
nin each internal node <pi,pj>, do:
qreport px on output if d(q,px) ≤ r
qenter a child of px if d(q,px) – r ≤ rxc
pi pj
pj
pi
ric
rjc
q
r

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
22
p4
p3
Monotonous Bisector Tree (MBT)
nA variant of Bisector Tree
nChild nodes inherit one pivot from the parent.
qFor convenience, no covering radii are shown.
p5
p6
p3
p4
p1
p2
p2
p1
Bisector Tree
Monotonous Bisector Tree

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
23
MBT (cont.)
nFewer pivots used ® fewer distance evaluations during query processing & more objects in leaves.
p1 p2
p3 p4
p5 p6
p1 p2
p1 p3
p2 p4
Bisector Tree
Monotonous Bisector Tree

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
24
Voronoi Tree
nExtension of Bisector Tree
nUses more pivots in each internal node
qUsually three pivots
p3
p2
p1
r3c
r2c
r1c

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
25
Generalized Hyper-plane Tree (GHT)
nSimilar to Bisector Trees
nCovering radii are not used
p1 p2
p3 p4
p5 p6
p6
p5
p3
p4
p1
p2

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
26
GHT: Range Query
nPruning based on hyper-plane partitioning
n
n
nGiven a query R(q,r) :
ntraverse the tree starting from its root
nin each internal node <pi,pj>, do:
qreport px on output if d(q,px) ≤ r
qenter the left child if d(q,pi) – r ≤ d(q,pj) + r
qenter the right child if d(q,pi) + r ≥ d(q,pj) - r
pj
r
q1
r
q2
pi

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
27
Survey of existing approaches
1.ball partitioning methods
2.generalized hyper-plane partitioning approaches
3.exploiting pre-computed distances
1.AESA
2.Linear AESA
3.Other Methods – Shapiro, Spaghettis
4.hybrid indexing approaches
5.approximated techniques

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
28
Exploiting Pre-computed Distances
nDuring insertion of an object into a structure some distances are evaluated
n
nIf they are remembered, we can employ them in filtering when processing a query

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
29
AESA
nApproximating and Eliminating Search Algorithm
nMatrix n´n of distances is stored
qDue to the symmetry, only a half (n(n-1)/2) is stored.
n
n
n
n
n
nEvery object can play a role of pivot.
o1
o2
o3
o4
o5
o6
o1
o2
o3
o4
o5
o6
o1
0
1.6
2.0
3.5
1.6
3.6
o2
1.6
0
1.0
2.6
2.6
4.2
o3
2.0
1.0
0
1.6
2.1
3.5
o4
3.5
2.6
1.6
0
3.0
3.4
o5
1.6
2.6
2.1
3.0
0
2.0
o6
3.6
4.2
3.5
3.4
2.0
0

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
30
o1
o2
o3
o4
o5
o6
o1
1.6
2.0
3.5
1.6
3.6
o2
1.0
2.6
2.6
4.2
o3
1.6
2.1
3.5
o4
3.0
3.4
o5
2.0
o6
AESA: Range Query
nGiven a query R(q,r) :
nRandomly pick an object and use it as pivot p
nCompute d(q,p)
nFilter out an object o if |d(q,p) – d(p,o)| > r
o1
o2=p
o3
o4
o5
o6
r
 q

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
31
o1
o2
o3
o4
o5
o6
o1
1.6
2.0
3.5
1.6
3.6
o2
1.0
2.6
2.6
4.2
o3
1.6
2.1
3.5
o4
3.0
3.4
o5
2.0
o6
AESA: Range Query (cont.)
nFrom remaining objects, select another object as pivot p.
qTo maximize pruning, select the closest object to q.
qIt maximizes the lower bound on distances |d(q,p) – d(p,o)|.
nFilter out objects using p.
o4
o5=p
o6
r
 q

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
32
AESA: Range Query (cont.)
nThis process is repeated until the number of remaining objects is small enough
qOr all objects have been used as pivots.
nCheck remaining objects
n directly with q.
qReport o if d(q,o) ≤ r.
q
n
nObjects o that fulfill d(q,p)+d(p,o) ≤ r can directly be reported on the output without further
checking.
qE.g. o5, because it was the pivot in the previous step.
o5
o6
r
 q
o1
o2
o3
o4
o5
o6
o1
1.6
2.0
3.5
1.6
3.6
o2
1.0
2.6
2.6
4.2
o3
1.6
2.1
3.5
o4
3.0
3.4
o5
2.0
o6

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
33
Linear AESA (LAESA)
nAESA is quadratic in space
nLAESA stores distances to m pivots only.
nPivots should be selected conveniently
qPivots as far away from each other as possible are chosen.
o1
o2
o3
o4
o5
o6
o2
1.6
0
1.0
2.6
2.6
4.2
o6
3.6
4.2
3.5
3.4
2.0
0
o1
o2
o3
o4
o5
o6
pivots

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
34
LAESA: Range Query
nDue to limited number of pivots, the algorithm differs.
nWe need not be able to select a pivot among non-discarded objects.
qFirst, all pivots are used for filtering.
qNext, remaining objects are directly compared to q.
o1
o2
o3
o4
o5
o6
o2
1.6
0
1.0
2.6
2.6
4.2
o6
3.6
4.2
3.5
3.4
2.0
0
o4
o6
r
 q
o2
o1
o3
o5

r=1.4

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
35
LAESA: Summary
nAESA and LAESA tend to be linear in distance computations
qFor larger query radii or higher values of k

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
36
Shapiro’s LAESA
nVery similar to LAESA
nDatabase objects are sorted with respect to the first pivot.
n
o2
o3
o1
o4
o5
o6
o2
0
1.0
1.6
2.6
2.6
4.2
o6
4.2
3.5
3.6
3.4
2.0
0
o1
o2
o3
o4
o5
o6
pivots

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
37
Shapiro’s LAESA: Range Query
nGiven a query R(q,r) :
nCompute d(q,p1)
nStart with object oi “closest” to q
qi.e. |d(q,p1) - d(p1,oi)| is minimal
o2
o3
o1
o4
o5
o6
o2
0
1.0
1.6
2.6
2.6
4.2
o6
4.2
3.5
3.6
3.4
2.0
0
d(q,o2) = 3.2
p1 = o2
o4 is picked
o4
o6
o2
o1
o3
o5
r
 q

r=1.4

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
38
Shapiro’s LAESA: Range Query (cont.)
nNext, oi is checked against all pivots
qDiscard it if |d(q,pj) – d(pj,oi)| > r for any pj
qIf not eliminated, check d(q,oi) ≤ r
n
o4
o6
o2
o1
o3
o5
o2
o3
o1
o4
o5
o6
o2
0
1.0
1.6
2.6
2.6
4.2
o6
4.2
3.5
3.6
3.4
2.0
0
r
 q
R(q,1.4)
d(q,o2) = 3.2
d(q,o6) = 1.2

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
39
Shapiro’s LAESA: Range Query (cont.)
nSearch continues with objects oi+1, oi-1, oi+2, oi-2, …
qUntil conditions |d(q,p1) – d(p1,oi+?)| > r
q and |d(q,p1) – d(p1,oi-?)| > r hold
n
o6
r
 q
o2
o1
o3
o5
o2
o3
o1
o4
o5
o6
o2
0
1.0
1.6
2.6
2.6
4.2
o6
4.2
3.5
3.6
3.4
2.0
0
p1 = o2
d(q,o2) = 3.2
|d(q,o2) – d(o2,o1)| = 1.6 > 1.4
|d(q,o2) – d(o2,o6)| = 1 ≤ 1.4

r=1.4

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
40
Spaghettis
nImprovement of LAESA
nMatrix m´n is stored in m arrays of length n.
nEach array is sorted according to the distances in it.
nPosition of object o can vary
n from array to array
qPointers (or array permutations)
q with respect to the preceding array
q must be stored.
o2
o2
0
o3
1.0
o1
1.6
o4
2.6
o5
2.6
o6
4.2
o6
0
2.0
3.4
3.5
3.6
4.2

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
41
Spaghettis: Range Query
nGiven a query R(q,r) :
nCompute distances to pivots, i.e. d(q,pi)
nOne interval is defined on each of m arrays
q[ d(q,pi) – r, d(q,pi) + r ] for all 1≤i≤m
o2
o2
0
o3
1.0
o1
1.6
o4
2.6
o5
2.6
o6
4.2
o6
0
2.0
3.4
3.5
3.6
4.2
o4
o6
r
 q
o2
o1
o3
o5

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
42
Spaghettis: Range Query (cont.)
nQualifying objects lie in the intervals’ intersection.
qPointers are followed from array to array.
nNon-discarded objects are checked against q.
o2
o2
0
o3
1.0
o1
1.6
o4
2.6
o5
2.6
o6
4.2
o6
0
2.0
3.4
3.5
3.6
4.2
o4
o6
r
 q
o2
o1
o3
o5
Response: o5, o6

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
43
Survey of existing approaches
1.ball partitioning methods
2.generalized hyper-plane partitioning approaches
3.exploiting pre-computed distances
4.hybrid indexing approaches
1.Multi Vantage Point Tree
2.Geometric Near-neighbor Access Tree
3.Spatial Approximation Tree
4.M-tree
5.Similarity Hashing
5.approximated techniques

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
44
Introduction
nStructures that store pre-computed distances have high space requirements
qBut good performance boost during query processing.
q
nHybrid approaches combine partitioning and pre-computed distances into a single system
qLess space requirements
qGood query performance

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
45
Multi Vantage Point Tree (MVPT)
nBased on Vantage Point Tree (VPT)
qTargeted to static collections as well.
nTries to decrease the number of pivots
qWith the aim of improving performance in terms of distance computations.
nStores distances to pivots in leaves
qThese distances are evaluated during insertion of objects.
nNo object duplication
qObjects playing the role of a pivot are stored only in internal nodes.
nLeaf nodes can contain more than one object.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
46
MVPT: Structure
nTwo pivots are used in each internal node
qVPT uses just one pivot.
qIdea: two levels of VPT collapsed into a single node
o1
o2
o2
internal
node
o2
o4
o5
o6
o7
o3
o1
VPT
MVPT
o8
o9
o10
o11
o12
o13
o14
o15
o4 o8
o9
o5 o10
o11
o6 o12
o13
o3 o7
o14 o15

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
47
MPVT: Internal Node
nBall partitioning is applied
qPivot p2 is shared
n
n
n
n
n
nIn general, MVPT can use k pivots in a node
qNumber of children is 2k !!!
qMulti-way partitioning can be used as well ® mk children
p1
p2
S1
S2
S3
S4
p2
S1
S3
S2
S4
p2
p1
dm
1
dm
2
dm
3

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
48
MVPT: Leaf Node
nLeaf node stores two “pivots” as well.
qThe first pivot is selected randomly,
qThe second pivot is picked as the furthest from the first one.
qThe same selection is used in internal nodes.
nCapacity is c objects + 2 pivots.
o1
o2
o3
o4
o5
o6
p1
1.6
4.1
1.0
2.6
2.6
3.3
p2
3.6
3.4
3.5
3.4
2.0
2.5
o6
p2
p1
o1
o3
o5
o4
o2
…
Distances from objects
to the first h pivots on
the path from the root
…
…
…
…
…

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
49
MVPT: Range Search
nGiven a query R(q,r) :
nInitialize the array PATH of h distances from q to the first h pivots.
qValues are initialized to undefined.
n
n
n
nStart in the root node and traverse the tree (depth-first).
q.PATH:
p1
p2
ph
-.-
-.-
-.-

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
50
MVPT: Range Search (cont.)
nIn an internal node with pivots pi , pi+1:
nCompute distances d(q,pi), d(q,pi+1)
qStore in q.PATH
nif they are within the first h pivots from the root.
qIf d(q,pi) ≤ r output pi
qIf d(q,pi+1) ≤ r output pi+1
qIf d(q,pi) ≤ dm1
nIf d(q,pi+1) ≤ dm2  visit the first branch
nIf d(q,pi+1) ≥ dm2  visit the second branch
qIf d(q,pi) ≥ dm1
nIf d(q,pi+1) ≤ dm3  visit the third branch
nIf d(q,pi+1) ≥ dm3  visit the fourth branch

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
51
MVPT: Range Search (cont.)
nIn a leaf node with pivots p1, p2 and objects oi:
nCompute distances d(q,p1), d(q,p2)
qIf d(q,pi) ≤ r output pi
qIf d(q,pi+1) ≤ r output pi+1
nFor all objects o1,…,oc:
qIf d(q,p1) - r ≤ d(oi,p1) ≤ d(q,p1) + r and
q    d(q,p2) - r ≤ d(oi,p2) ≤ d(q,p2) + r and
q "pj: q.PATH[j] - r ≤ oi.PATH[j] ≤ q.PATH[j] + r
nCompute d(q,oi)
nIf d(q,oi) ≤ r   output oi

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
52
Geometric Near-neighbor Access Tree (GNAT)
nm-ary tree based on
n Voronoi-like partitioning
qm can vary with the level in the tree.
nA set of pivots P={p1,…,pm} is selected from X
qSplit X into m subsets Si
q"oÎX-P:  oÎSi  if  d(pi,o)≤d(pj,o) for all j=1..m
qThis process is repeated recursively.
p1
o5
o7
p3
p4
p2
o8
o4
o1
o6
o9
o3
o2
p1 p2  p3 p4
o5 o7
o4 o8
o2 o3 o9
o1 o6

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
53
GNAT (cont.)
nPre-computed distances are also stored.
nAn m´m table of distance ranges is in each internal node.
qMinimum and maximum
q of distances between each
q pivot pi and the objects of
q each subset Sj are stored.
rlij
rhij
rhjj
pj
pi

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
54
GNAT (cont.)
nThe m´m table of distance ranges
n
n
n
n
n
nEach range [rlij,rhij ] is defined as:
qNotice that rlii=0.
p1
S1
S2
Sm-1
Sm
p2
pm-1
pm
[0.0, 2.1]
[2.3, 3.7]
[5.2, 6.0]
[1.0, 5.1]
[3.0, 3.8]
[0.0, 1.5]
[6.9, 7.8]
[2.5, 6.4]
[4.2, 7.0]
[2.8, 4.2]
[0.0, 0.9]
[5.9, 8.9]
[2.1, 4.0]
[6.8, 8.3]
[8.0, 8.7]
[0.0, 4.2]
…
…
…

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
55
GNAT: Choosing Pivots
nFor good clustering, pivots cannot be chosen randomly.
nFrom a sample 3m objects, select m pivots:
qThree is an empirically derived constant.
qThe first pivot at random.
qThe second pivot as the furthest object.
qThe third pivot as the furthest object from previous two.
nThe minimum of the two distances is maximized.
q…
qUntil we have m pivots.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
56
GNAT: Range Search
nGiven a query R(q,r) :
nStart in the root node and traverse the tree (depth-first).
nIn internal nodes, employ the distance ranges to prune some branches.
nIn leaf nodes, all objects are directly compared to q.
qIf d(q,o)≤ r ,  report o to the output.
n

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
57
nIn an internal node with pivots p1, p2,…, pm:
qPick one pivot pi at random.
nGradually pick next non-examined pivot pj:
qIf d(q,pi)-r > rhij or d(q,pi)+r < rlij,
q discard pj and its sub-tree.
nRemaining pivots pj are
n compared with q
qIf d(q,pi)-r > rhjj , discard pj and
q its sub-tree.
qIf d(q,pj)≤ r, output pj
qThe corresponding sub-tree is visited.
q
GNAT: Range Search (cont.)
rlij
rhij
rhjj
pj
pi
r2
 q2
r1
 q1

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
58
Spatial Approximation Tree (SAT)
nA tree based on Voronoi-like partitioning
qBut stores relations between partitions, i.e., an edge is between neighboring partitions.
qFor correctness in metric spaces, this would require to have edges between all pairs of objects in
X.
nSAT approximates such a graph.
nThe root p is a randomly selected object from X.
qA set N(p) of p’s neighbors is defined
qEvery object o Î X-N(p)-{p} is organized under the closest neighbor in N(p).
qCovering radius is defined for every internal node (object).

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
59
nIntuition of N(p)
qEach object of N(p) is closer to p than to any other object in N(p).
qAll objects in X-N(p)-{p} are closer to an object in N(p) than to p.
nThe root is o1
qN(o1)={o2,o3,o4,o5}
qo7 cannot be included since it is
q closer to o3 than to o1.
qCovering radius of o1 conceals
q all objects.
o1
SAT: Example
o2
o3
o4
o5
o6
o7
o8
o9
o10
o11
o12
o13
o14
rc

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
60
SAT: Building N(p)
nConstruction of minimal N(p) is NP-complete.
n
nHeuristics for creating N(p):
qThe pivot p, S=X-{p}, N(p)={}.
q
qSort objects in S with respect to their distances from p.
q
qStart adding objects to N(p).
nThe new object oN is added if it is not closer to any object already in N(p).
q

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
61
SAT: Range Search
nGiven a query R(q,r) :
nStart in the root node and traverse the tree.
nIn internal nodes, employ the distance ranges to prune some branches.
nIn leaf nodes, all objects are directly compared to q.
qIf d(q,o)≤ r   report o to the output.
n

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
62
SAT: Range Search (cont.)
nIn an internal node with the pivot p and N(p):
nTo prune some branches, locate the closest object ocÎN(p)È{p} to q.
qDiscard sub-trees odÎN(p) such that d(q,od)>2r+d(q,oc).
qThe pruning effect is maximized if d(q,oc) is minimal.
= oc
p2
p1
p3
p
v
t
u
s1
s
s2
r
 q
d(q,oc)+2r
pruned

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
63
nIf we pick s2 as the closest object, pruning will be improved.
qThe sub-tree p2 will be discarded.
nSelect the closest object among more “neighbors”:
qUse p’s ancestor and its neighbors.
q
n
SAT: Range Search (cont.)
oc =
p2
p1
p3
p
v
t
u
s1
s
s2
r
 q
d(q,oc)+2r
previously pruned
pruned

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
64
SAT: Range Search (cont.)
nFinally, apply covering radii of remaining objects
qDiscard od such that d(q,od)>rdc+r.
n

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
65
M-tree
ninherently dynamic structure
ndisk-oriented (fixed-size nodes)
nbuilt in a bottom-up fashion
n
neach node constrained by a sphere-like (ball) region
nleaf node: data objects + their distances from a pivot kept in the parent node
ninternal node: pivot + radius covering the subtree, distance from the pivot the parent pivot
nfiltering: covering radii + pre-computed distances

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
66
M-tree: Extensions
nbulk-loading algorithm
qconsiders the trade-off: dynamic properties vs. performance
qM-tree building algorithm for a dataset given in advance
qresults in more efficient M-tree
q
nSlim-tree
qvariant of M-tree (dynamic)
qreduces the fat-factor of the tree
qtree with smaller overlaps between particular tree regions
q
nmany variants and extensions – see Chapter 3

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
67
Similarity Hashing
nMultilevel structure
nOne hash function (r-split function) per level
qProducing several buckets.
nThe first level splits the whole data set.
nNext level partitions the exclusion zone of the previous level.
nThe exclusion zone of the last level forms the exclusion bucket of the whole structure.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
68
Similarity Hashing: Structure
4 separable buckets at the first level
2 separable buckets at the second level
exclusion bucket of the whole structure

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
69
Similarity Hashing: r-Split Function
nProduces several separable buckets.
qQueries with radius up to r accesses one bucket at most.
qIf the exclusion zone is touched, next level must be sought.
2r
2r
2r
2r
2r
r
r
r

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
70
Similarity Hashing: Features
nBounded search costs for queries with radius ≤ r.
qOne bucket per level at maximum
nBuckets of static files can be arranged in a way that I/O costs never exceed the sequential scan.
nDirect insertion of objects.
qSpecific bucket is addressed directly by computing hash functions.
n
nD-index is based on similarity hashing.
qUses excluded middle partitioning as the hash function.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
71
Survey of Existing Approaches
1.ball partitioning methods
2.generalized hyper-plane partitioning approaches
3.exploiting pre-computed distances
4.hybrid indexing approaches
5.approximated techniques

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
72
Approximate Similarity Search
nSpace transformation techniques
qIntroduced very briefly
n
nReducing the subset of data to be examined
qMost techniques originally proposed for vector spaces
nSome can also be used in metric spaces
qSome are specific for metric spaces

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
73
Exploiting Space Transformations
nSpace transformation techniques transform the original data space into another suitable space.
qAs an example consider dimensionality reduction.
n
nSpace transformation techniques are typically distance preserving and satisfy the lower-bounding
property:
qDistances measured in the transformed space are smaller than those computed in the original space.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
74
Exploiting Space Transformations (cont.)
nExact similarity search algorithms:
qSearch in the transformed space
qFilter out non-qualifying objects by re-measuring distances of retrieved objects in the original
space.
nApproximate similarity search algorithms
qSearch in the transformed space
qDo not perform the filtering step
nFalse hits may occur

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
75
BBD Trees
nA Balanced Box-Decomposition (BBD) tree hierarchically divides the vector space with d-dimensional
non-overlapping boxes.
qLeaf nodes of the tree contain a single object.
qBBD trees are intended as a main memory data structure.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
76
BBD Trees (cont.)
nExact k-NN(q) search is obtained as follows
qFind the leaf containing the query object
qEnumerate leaves in the increasing order of distance from q and maintain the k closest objects.
qStop when the distance of next leaf is greater than d(q,ok).
nApproximate k-NN(q):
qStop when the distance of next leaf is greater than d(q,ok)/(1+e).
nDistances from q to retrieved objects are at most 1+e times larger than that of the k-th actual
nearest neighbor of q.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
77
BBD Trees: Exact 1-NN Search
nGiven 1-NN(q):
7
6
1
2
3
4
5
8
9
 q
10

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
78
nGiven 1-NN(q):
qRadius d(q,oNN)/(1+e) is used instead!
nRegions 9 and 10 are not accessed:
qThey do not intersect the dashed circle of radius d(q,oNN)/(1+e).
nThe exact NN is missed!
BBD Trees: Approximate 1-NN Search
7
6
1
2
3
4
5
8
9
10
 q

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
79
Angle Property Technique
nObserved (non-intuitive) properties in high dimensional vector spaces:
qObjects tend to have the same distance.
nTherefore they tend to be distributed on the surface of ball regions.
qParent and child regions have very close radii.
nAll regions intersect one each other.
qThe angle formed by a query point, the centre of a ball region, and any data object is close to 90
degrees.
nThe higher the dimensionality, the closer to 90 degrees.
nThese properties can be exploited for approximate similarity search.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
80
Angle Property Technique: Example
Objects tend to be located here
Objects tend to be located here,…
and here
A region is accessed when a > q

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
81
Clustering for Indexing (Clindex)
nPerforms approximate similarity search in vector spaces exploiting clustering techniques.
nThe dataset is partitioned into clusters of similar objects:
qEach cluster is represented by a separate file sequentially stored on the disk.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
82
Clindex: Approximate Search
nApproximate similarity search:
qSeeks for the cluster containing (or the cluster closest to) the query object.
qSorts the objects in the cluster according to the distance to the query.
nThe search is approximate since qualifying objects can belong to other (non-accessed) clusters.
nMore clusters can be accessed to improve precision.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
83
Clindex: Clustering
nClustering:
qEach dimension of the d-dimensional vector space is divided into 2n segments: the result is
(2n)d cells in the data space.
qEach cell is associated with the number of objects it contains.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
84
Clindex: Clustering (cont.)
nClustering starts accessing cells in the decreasing order of number of contained objects:
qIf a cell is adjacent to a cluster it is attached to the cluster.
qIf a cell is not adjacent to any cluster it is used as the seed for a new cluster.
qIf a cell is adjacent to more than one cluster, a heuristics is used to decide:
nif the clusters should be merged or
nwhich cluster the cell belongs to.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
85
Clindex: Example
2
1
1
3
1
3
3
3
6
1
2
1
4
1
3
4
4
6
2
1
7
3
2
5
2
1
2
4
4
2
5
6
5
5
6
7
6
5
1
5
6
5
5
7
7
6
6
6
6
6
6
5
5
5
5
5
5
5
5
4
4
4
4
4
3
3
3
3
3
3
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
Retrieved objects
Missed objects

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
86
Vector Quantization index (VQ-Index)
nThis approach is also based on clustering techniques to perform approximate similarity search.
nSpecifically:
qThe dataset is grouped into (non-necessarily disjoint) subsets.
qLossy compression techniques are used to reduce the size of subsets.
qA similarity query is processed by choosing a subset where to search.
qThe chosen compressed dataset is searched after decompressing it.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
87
VQ-Index: Subset Generation
nSubset generation:
qQuery objects submitted by users are maintained in a history file.
qQueries in the history file are grouped into m clusters by using k-means algorithm.
qIn correspondence of each cluster Ci a subset Si of the dataset is generated as follows
q
q
qAn object may belong to several subsets.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
88
VQ-Index: Subset Generation (cont.)
nThe overlap of subsets versus performance can be tuned by the choice of m and k
qLarge k implies more objects in a subset, so more objects are recalled.
qLarge values of m implies more subsets, so less objects to be accessed.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
89
VQ-Index: Compression
nSubset compression with vector quantisation:
qAn encoder Enc function is used to associate every vector with an integer value taken from a
finite set {1,…,n}.
qA decoder Dec function is used to associate every number from the set {1,…,n} with a
representative vector.
qBy using Enc and Dec, every vector is represented by a representative vector
nSeveral vectors might be represented by the same representative.
qEnc is used to compress the content of Si by applying it to every object in it:
n

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
90
VQ-Index: Approximate Search
nApproximate search:
qGiven a query q:
qThe cluster Ci closest to the query is first located.
qAn approximation of Si is reconstructed, by applying the decoder function Deci .
qThe approximation of Si is searched for qualifying objects.
qApproximation occurs at two stages:
nQualifying objects may be included in other subsets, in addition to Si .
nThe reconstructed approximation of Si may contain vectors which differ from the original ones.
q

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
91
Buoy Indexing
nDataset is partitioned in disjoint clusters.
nA cluster is represented by a representative element – the buoy.
nClusters are bounded by a ball region having the buoy as center and the distance of the buoy to
the farthest element of the cluster as the radius.
nThis approach can be used in pure metric spaces.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
92
Buoy Indexing: Similarity Search
nGiven an exact k-NN query, clusters are accessed in the increasing distance to their buoys, until
current result-set cannot be improved.
qThat is, until d(q,ok) + ri < d(q,pi)
npi is the buoy, ri is the radius
nAn approximate k-NN query can be processed by stopping when
qeither previous exact condition is true, or
qa specified ratio f of clusters has been accessed.

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
93
Hierarchical Decomposition of Metric Spaces
nIn addition to previous ones, there are other methods that were appositively designed to
qWork on generic metric spaces
qOrganize large collections of data
nThey exploit the hierarchical decomposition of metric spaces.
n

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
94
Hierarchical Decomposition of Metric Spaces (cont.)
nThese will be discussed in details later on:
qRelative error approximation
nRelative error on distances of the approximate result is bounded.
qGood fraction approximation
nRetrieves k objects from a specified fraction of the objects closest to the query.
n

Similarity Search:
Part I, Chapter 2
‹#›
Similarity Search:
Part I, Chapter 2
95
Hierarchical Decomposition of Metric Spaces (cont.)
nThese will be discussed in details later on:
qSmall chance improvement approximation
nStops when chances of improving current result are low.
qProximity based approximation
nDiscards regions with small probability of containing qualifying objects.
qPAC (Probably Approximately Correct) nearest neighbor search
nRelative error on distances is bounded with a probability specified.
n