Similarity Search: Part I, Chapter 2 ‹#› SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko book Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 2 Table of Content nPart I: Metric searching in a nutshell nFoundations of metric space searching nSurvey of existing approaches nPart II: Metric searching in large collections nCentralized index structures nApproximate similarity search nParallel and distributed indexes Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 3 Survey of existing approaches 1.ball partitioning methods 2.generalized hyper-plane partitioning approaches 3.exploiting pre-computed distances 4.hybrid indexing approaches 5.approximated techniques Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 4 Survey of existing approaches 1.ball partitioning methods 1.Burkhard-Keller Tree 2.Fixed Queries Tree 3.Fixed Queries Array 4.Vantage Point Tree 1.Multi-Way Vantage Point Tree 5.Excluded Middle Vantage Point Forest 2.generalized hyper-plane partitioning approaches 3.exploiting pre-computed distances 4.hybrid indexing approaches 5.approximated techniques Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 5 Burkhard-Keller Tree (BKT) [BK73] nApplicable to discrete distance functions only nRecursively divides a given dataset X nChoose an arbitrary point pjÎX, form subsets: n Xi = {o Î X, d(o,pj) = i } for each distance i ≥ 0. nFor each Xi create a sub-tree of pj qempty subsets are ignored n pj X3 X4 X2 pj X4 X3 X2 m-ary tree Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 6 BKT: Range Query nGiven a query R(q,r) : ntraverse the tree starting from root nin each internal node pj , do: qreport pj on output if d(q,pj) ≤ r qenter a child i if max{d(q,pj) – r, 0} ≤ i ≤ d(q,pj) + r p1 2 3 4 3 5 p2 p3 p1 p2 p3 q r Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 7 Fixed Queries Tree (FQT) nmodification of BKT neach level has a single pivot qall objects stored in leaves nduring search distance computations are saved qusually more branches are accessed ® one distance comp. p1 p2 2 3 4 3 4 5 p2 p1 0 0 p2 p1 p2 q r Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 8 Fixed-Height FQT (FHFQT) nextension of FQT nall leaf nodes at the same level qincreased filtering using more routing objects qextended tree depth does not typically introduce further computations p1 p2 2 3 4 3 4 5 p2 p1 0 0 p2 FQT p2 p1 2 3 4 3 4 5 p1 0 0 p2 2 6 FHFQT p1 p2 q r Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 9 Fixed Queries Array (FQA) nbased on FHFQT nan h-level tree is transformed to an array of paths qevery leaf node is represented with a path from the root node qeach path is encoded as h values of distance na search algorithm turns to a binary search in array intervals p2 p1 2 3 4 3 4 5 p1 0 0 p2 2 6 FHFQT 0 2 2 3 3 4 2 0 3 4 5 6 p1 p2 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 10 Vantage Point Tree (VPT) nuses ball partitioning qrecursively divides given data set X nchoose vantage point pÎX, compute median m qS1 = {xÎX – {p} | d(x,p) ≤ m} qS2 = {xÎX – {p} | d(x,p) ≥ m} qthe equality sign ensures balancing p1 p2 S1,1 S1,2 p1 m1 p2 S1,2 S1,1 m2 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 11 VPT (cont.) nOne or more objects can be accommodated in leaves. nVP tree is a balanced binary tree. nStatic structure n n n n n n nPivots p1,p2 and p3 belong to the database! nIn the following, we assume just one object in a leaf. p1 p2 p3 o4 o1 o3 o8 o9 o11 o7 o2 o6 o5 o10 o12 m1 m2 m3 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 12 VPT: Range Search nGiven a query R(q,r) : ntraverse the tree starting from its root nin each internal node (pi,mi), do: qif d(q,pi) ≤ r report pi on output qif d(q,pi) - r ≤ mi search the left sub-tree (a,b) qif d(q,pi) + r ≥ mi search the right sub-tree (b) n (a) (b) pi mi pi mi q r q r Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 13 VPT: k-NN Search nGiven a query NN(q): ninitialization: dNN =dmax NN=nil ntraverse the tree starting from its root nin each internal node (pi,mi), do: qif d(q,pi) ≤ dNN set dNN =d(q,pi), NN=pi qif d(q,pi) - dNN ≤ mi search the left sub-tree qif d(q,pi) + dNN ≥ mi search the right sub-tree q nk-NN search only requires the arrays dNN[k] and NN[k] qThe arrays are kept ordered with respect to the distance to q. n Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 14 Multi-Way Vantage Point Tree ninherits all principles from VPT qbut partitioning is modified nm-ary balanced tree napplies multi-way ball partitioning n p1 m2 S1,1 S1,3 m1 m3 S1,2 S1,4 p1 S1,2 S1,3 S1,4 S1,1 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 15 Vantage Point Forest (VPF) na forest of binary trees nuses excluded middle partitioning n n n n n n n nmiddle area is excluded from the process of tree building n 2r pi mi pi mi Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 16 VPF (cont.) ngiven data set X is recursively divided and a binary tree is built nexcluded middle areas are used for building another binary tree n p’1 M’1 p’2 p’3 M’2 M’3 S’1,1 S’2,1 S’1,2 S’2,2 M1 + M2 + M3 p1 M1 p2 p3 M2 M3 S1,1 S2,1 S1,2 S2,2 X Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 17 VPF: Range Search nGiven a query R(q,r): nstart with the first tree qtraverse the tree starting from its root qin each internal node (pi,mi), do: nif d(q,pi) ≤ r report pi nif d(q,pi) – r ≤ mi – r search the left sub-tree qif d(q,pi) + r ≥ mi – r search the next tree !!! nif d(q,pi) + r ≥ mi + r search the right sub-tree qif d(q,pi) – r ≤ mi + r search the next tree !!! nif d(q,pi) – r ≥ mi – r and n d(q,pi) + r ≤ mi + r search only the next tree !!! Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 18 VPF: Range Search (cont.) nQuery intersects all partitions qSearch both sub-trees qSearch the next tree nQuery collides only with exclusion qSearch just the next tree 2r pi mi 2r pi mi q r q Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 19 Survey of existing approaches 1.ball partitioning methods 2.generalized hyper-plane partitioning approaches 1.Bisector Tree 2.Generalized Hyper-plane Tree 3.exploiting pre-computed distances 4.hybrid indexing approaches 5.approximated techniques Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 20 nApplies generalized hyper-plane partitioning nRecursively divides a given dataset X nChoose two arbitrary points p1,p2ÎX nForm subsets from remaining objects: n S1 = {o Î X, d(o,p1) ≤ d(o,p2)} n S2 = {o Î X, d(o,p1) > d(o,p2)} nCovering radii r1c and r2c are n established: qThe balls can intersect! Bisector Tree (BT) r1c r2c p1 p2 20 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 21 BT: Range Query nGiven a query R(q,r) : ntraverse the tree starting from its root nin each internal node , do: qreport px on output if d(q,px) ≤ r qenter a child of px if d(q,px) – r ≤ rxc pi pj pj pi ric rjc q r Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 22 p4 p3 Monotonous Bisector Tree (MBT) nA variant of Bisector Tree nChild nodes inherit one pivot from the parent. qFor convenience, no covering radii are shown. p5 p6 p3 p4 p1 p2 p2 p1 Bisector Tree Monotonous Bisector Tree Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 23 MBT (cont.) nFewer pivots used ® fewer distance evaluations during query processing & more objects in leaves. p1 p2 p3 p4 p5 p6 p1 p2 p1 p3 p2 p4 Bisector Tree Monotonous Bisector Tree Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 24 Voronoi Tree nExtension of Bisector Tree nUses more pivots in each internal node qUsually three pivots p3 p2 p1 r3c r2c r1c Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 25 Generalized Hyper-plane Tree (GHT) nSimilar to Bisector Trees nCovering radii are not used p1 p2 p3 p4 p5 p6 p6 p5 p3 p4 p1 p2 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 26 GHT: Range Query nPruning based on hyper-plane partitioning n n nGiven a query R(q,r) : ntraverse the tree starting from its root nin each internal node , do: qreport px on output if d(q,px) ≤ r qenter the left child if d(q,pi) – r ≤ d(q,pj) + r qenter the right child if d(q,pi) + r ≥ d(q,pj) - r pj r q1 r q2 pi Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 27 Survey of existing approaches 1.ball partitioning methods 2.generalized hyper-plane partitioning approaches 3.exploiting pre-computed distances 1.AESA 2.Linear AESA 3.Other Methods – Shapiro, Spaghettis 4.hybrid indexing approaches 5.approximated techniques Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 28 Exploiting Pre-computed Distances nDuring insertion of an object into a structure some distances are evaluated n nIf they are remembered, we can employ them in filtering when processing a query Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 29 AESA nApproximating and Eliminating Search Algorithm nMatrix n´n of distances is stored qDue to the symmetry, only a half (n(n-1)/2) is stored. n n n n n nEvery object can play a role of pivot. o1 o2 o3 o4 o5 o6 o1 o2 o3 o4 o5 o6 o1 0 1.6 2.0 3.5 1.6 3.6 o2 1.6 0 1.0 2.6 2.6 4.2 o3 2.0 1.0 0 1.6 2.1 3.5 o4 3.5 2.6 1.6 0 3.0 3.4 o5 1.6 2.6 2.1 3.0 0 2.0 o6 3.6 4.2 3.5 3.4 2.0 0 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 30 o1 o2 o3 o4 o5 o6 o1 1.6 2.0 3.5 1.6 3.6 o2 1.0 2.6 2.6 4.2 o3 1.6 2.1 3.5 o4 3.0 3.4 o5 2.0 o6 AESA: Range Query nGiven a query R(q,r) : nRandomly pick an object and use it as pivot p nCompute d(q,p) nFilter out an object o if |d(q,p) – d(p,o)| > r o1 o2=p o3 o4 o5 o6 r q Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 31 o1 o2 o3 o4 o5 o6 o1 1.6 2.0 3.5 1.6 3.6 o2 1.0 2.6 2.6 4.2 o3 1.6 2.1 3.5 o4 3.0 3.4 o5 2.0 o6 AESA: Range Query (cont.) nFrom remaining objects, select another object as pivot p. qTo maximize pruning, select the closest object to q. qIt maximizes the lower bound on distances |d(q,p) – d(p,o)|. nFilter out objects using p. o4 o5=p o6 r q Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 32 AESA: Range Query (cont.) nThis process is repeated until the number of remaining objects is small enough qOr all objects have been used as pivots. nCheck remaining objects n directly with q. qReport o if d(q,o) ≤ r. q n nObjects o that fulfill d(q,p)+d(p,o) ≤ r can directly be reported on the output without further checking. qE.g. o5, because it was the pivot in the previous step. o5 o6 r q o1 o2 o3 o4 o5 o6 o1 1.6 2.0 3.5 1.6 3.6 o2 1.0 2.6 2.6 4.2 o3 1.6 2.1 3.5 o4 3.0 3.4 o5 2.0 o6 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 33 Linear AESA (LAESA) nAESA is quadratic in space nLAESA stores distances to m pivots only. nPivots should be selected conveniently qPivots as far away from each other as possible are chosen. o1 o2 o3 o4 o5 o6 o2 1.6 0 1.0 2.6 2.6 4.2 o6 3.6 4.2 3.5 3.4 2.0 0 o1 o2 o3 o4 o5 o6 pivots Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 34 LAESA: Range Query nDue to limited number of pivots, the algorithm differs. nWe need not be able to select a pivot among non-discarded objects. qFirst, all pivots are used for filtering. qNext, remaining objects are directly compared to q. o1 o2 o3 o4 o5 o6 o2 1.6 0 1.0 2.6 2.6 4.2 o6 3.6 4.2 3.5 3.4 2.0 0 o4 o6 r q o2 o1 o3 o5 r=1.4 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 35 LAESA: Summary nAESA and LAESA tend to be linear in distance computations qFor larger query radii or higher values of k Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 36 Shapiro’s LAESA nVery similar to LAESA nDatabase objects are sorted with respect to the first pivot. n o2 o3 o1 o4 o5 o6 o2 0 1.0 1.6 2.6 2.6 4.2 o6 4.2 3.5 3.6 3.4 2.0 0 o1 o2 o3 o4 o5 o6 pivots Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 37 Shapiro’s LAESA: Range Query nGiven a query R(q,r) : nCompute d(q,p1) nStart with object oi “closest” to q qi.e. |d(q,p1) - d(p1,oi)| is minimal o2 o3 o1 o4 o5 o6 o2 0 1.0 1.6 2.6 2.6 4.2 o6 4.2 3.5 3.6 3.4 2.0 0 d(q,o2) = 3.2 p1 = o2 o4 is picked o4 o6 o2 o1 o3 o5 r q r=1.4 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 38 Shapiro’s LAESA: Range Query (cont.) nNext, oi is checked against all pivots qDiscard it if |d(q,pj) – d(pj,oi)| > r for any pj qIf not eliminated, check d(q,oi) ≤ r n o4 o6 o2 o1 o3 o5 o2 o3 o1 o4 o5 o6 o2 0 1.0 1.6 2.6 2.6 4.2 o6 4.2 3.5 3.6 3.4 2.0 0 r q R(q,1.4) d(q,o2) = 3.2 d(q,o6) = 1.2 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 39 Shapiro’s LAESA: Range Query (cont.) nSearch continues with objects oi+1, oi-1, oi+2, oi-2, … qUntil conditions |d(q,p1) – d(p1,oi+?)| > r q and |d(q,p1) – d(p1,oi-?)| > r hold n o6 r q o2 o1 o3 o5 o2 o3 o1 o4 o5 o6 o2 0 1.0 1.6 2.6 2.6 4.2 o6 4.2 3.5 3.6 3.4 2.0 0 p1 = o2 d(q,o2) = 3.2 |d(q,o2) – d(o2,o1)| = 1.6 > 1.4 |d(q,o2) – d(o2,o6)| = 1 ≤ 1.4 r=1.4 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 40 Spaghettis nImprovement of LAESA nMatrix m´n is stored in m arrays of length n. nEach array is sorted according to the distances in it. nPosition of object o can vary n from array to array qPointers (or array permutations) q with respect to the preceding array q must be stored. o2 o2 0 o3 1.0 o1 1.6 o4 2.6 o5 2.6 o6 4.2 o6 0 2.0 3.4 3.5 3.6 4.2 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 41 Spaghettis: Range Query nGiven a query R(q,r) : nCompute distances to pivots, i.e. d(q,pi) nOne interval is defined on each of m arrays q[ d(q,pi) – r, d(q,pi) + r ] for all 1≤i≤m o2 o2 0 o3 1.0 o1 1.6 o4 2.6 o5 2.6 o6 4.2 o6 0 2.0 3.4 3.5 3.6 4.2 o4 o6 r q o2 o1 o3 o5 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 42 Spaghettis: Range Query (cont.) nQualifying objects lie in the intervals’ intersection. qPointers are followed from array to array. nNon-discarded objects are checked against q. o2 o2 0 o3 1.0 o1 1.6 o4 2.6 o5 2.6 o6 4.2 o6 0 2.0 3.4 3.5 3.6 4.2 o4 o6 r q o2 o1 o3 o5 Response: o5, o6 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 43 Survey of existing approaches 1.ball partitioning methods 2.generalized hyper-plane partitioning approaches 3.exploiting pre-computed distances 4.hybrid indexing approaches 1.Multi Vantage Point Tree 2.Geometric Near-neighbor Access Tree 3.Spatial Approximation Tree 4.M-tree 5.Similarity Hashing 5.approximated techniques Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 44 Introduction nStructures that store pre-computed distances have high space requirements qBut good performance boost during query processing. q nHybrid approaches combine partitioning and pre-computed distances into a single system qLess space requirements qGood query performance Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 45 Multi Vantage Point Tree (MVPT) nBased on Vantage Point Tree (VPT) qTargeted to static collections as well. nTries to decrease the number of pivots qWith the aim of improving performance in terms of distance computations. nStores distances to pivots in leaves qThese distances are evaluated during insertion of objects. nNo object duplication qObjects playing the role of a pivot are stored only in internal nodes. nLeaf nodes can contain more than one object. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 46 MVPT: Structure nTwo pivots are used in each internal node qVPT uses just one pivot. qIdea: two levels of VPT collapsed into a single node o1 o2 o2 internal node o2 o4 o5 o6 o7 o3 o1 VPT MVPT o8 o9 o10 o11 o12 o13 o14 o15 o4 o8 o9 o5 o10 o11 o6 o12 o13 o3 o7 o14 o15 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 47 MPVT: Internal Node nBall partitioning is applied qPivot p2 is shared n n n n n nIn general, MVPT can use k pivots in a node qNumber of children is 2k !!! qMulti-way partitioning can be used as well ® mk children p1 p2 S1 S2 S3 S4 p2 S1 S3 S2 S4 p2 p1 dm 1 dm 2 dm 3 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 48 MVPT: Leaf Node nLeaf node stores two “pivots” as well. qThe first pivot is selected randomly, qThe second pivot is picked as the furthest from the first one. qThe same selection is used in internal nodes. nCapacity is c objects + 2 pivots. o1 o2 o3 o4 o5 o6 p1 1.6 4.1 1.0 2.6 2.6 3.3 p2 3.6 3.4 3.5 3.4 2.0 2.5 o6 p2 p1 o1 o3 o5 o4 o2 … Distances from objects to the first h pivots on the path from the root … … … … … Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 49 MVPT: Range Search nGiven a query R(q,r) : nInitialize the array PATH of h distances from q to the first h pivots. qValues are initialized to undefined. n n n nStart in the root node and traverse the tree (depth-first). q.PATH: p1 p2 ph -.- -.- -.- Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 50 MVPT: Range Search (cont.) nIn an internal node with pivots pi , pi+1: nCompute distances d(q,pi), d(q,pi+1) qStore in q.PATH nif they are within the first h pivots from the root. qIf d(q,pi) ≤ r output pi qIf d(q,pi+1) ≤ r output pi+1 qIf d(q,pi) ≤ dm1 nIf d(q,pi+1) ≤ dm2 visit the first branch nIf d(q,pi+1) ≥ dm2 visit the second branch qIf d(q,pi) ≥ dm1 nIf d(q,pi+1) ≤ dm3 visit the third branch nIf d(q,pi+1) ≥ dm3 visit the fourth branch Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 51 MVPT: Range Search (cont.) nIn a leaf node with pivots p1, p2 and objects oi: nCompute distances d(q,p1), d(q,p2) qIf d(q,pi) ≤ r output pi qIf d(q,pi+1) ≤ r output pi+1 nFor all objects o1,…,oc: qIf d(q,p1) - r ≤ d(oi,p1) ≤ d(q,p1) + r and q d(q,p2) - r ≤ d(oi,p2) ≤ d(q,p2) + r and q "pj: q.PATH[j] - r ≤ oi.PATH[j] ≤ q.PATH[j] + r nCompute d(q,oi) nIf d(q,oi) ≤ r output oi Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 52 Geometric Near-neighbor Access Tree (GNAT) nm-ary tree based on n Voronoi-like partitioning qm can vary with the level in the tree. nA set of pivots P={p1,…,pm} is selected from X qSplit X into m subsets Si q"oÎX-P: oÎSi if d(pi,o)≤d(pj,o) for all j=1..m qThis process is repeated recursively. p1 o5 o7 p3 p4 p2 o8 o4 o1 o6 o9 o3 o2 p1 p2 p3 p4 o5 o7 o4 o8 o2 o3 o9 o1 o6 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 53 GNAT (cont.) nPre-computed distances are also stored. nAn m´m table of distance ranges is in each internal node. qMinimum and maximum q of distances between each q pivot pi and the objects of q each subset Sj are stored. rlij rhij rhjj pj pi Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 54 GNAT (cont.) nThe m´m table of distance ranges n n n n n nEach range [rlij,rhij ] is defined as: qNotice that rlii=0. p1 S1 S2 Sm-1 Sm p2 pm-1 pm [0.0, 2.1] [2.3, 3.7] [5.2, 6.0] [1.0, 5.1] [3.0, 3.8] [0.0, 1.5] [6.9, 7.8] [2.5, 6.4] [4.2, 7.0] [2.8, 4.2] [0.0, 0.9] [5.9, 8.9] [2.1, 4.0] [6.8, 8.3] [8.0, 8.7] [0.0, 4.2] … … … Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 55 GNAT: Choosing Pivots nFor good clustering, pivots cannot be chosen randomly. nFrom a sample 3m objects, select m pivots: qThree is an empirically derived constant. qThe first pivot at random. qThe second pivot as the furthest object. qThe third pivot as the furthest object from previous two. nThe minimum of the two distances is maximized. q… qUntil we have m pivots. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 56 GNAT: Range Search nGiven a query R(q,r) : nStart in the root node and traverse the tree (depth-first). nIn internal nodes, employ the distance ranges to prune some branches. nIn leaf nodes, all objects are directly compared to q. qIf d(q,o)≤ r , report o to the output. n Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 57 nIn an internal node with pivots p1, p2,…, pm: qPick one pivot pi at random. nGradually pick next non-examined pivot pj: qIf d(q,pi)-r > rhij or d(q,pi)+r < rlij, q discard pj and its sub-tree. nRemaining pivots pj are n compared with q qIf d(q,pi)-r > rhjj , discard pj and q its sub-tree. qIf d(q,pj)≤ r, output pj qThe corresponding sub-tree is visited. q GNAT: Range Search (cont.) rlij rhij rhjj pj pi r2 q2 r1 q1 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 58 Spatial Approximation Tree (SAT) nA tree based on Voronoi-like partitioning qBut stores relations between partitions, i.e., an edge is between neighboring partitions. qFor correctness in metric spaces, this would require to have edges between all pairs of objects in X. nSAT approximates such a graph. nThe root p is a randomly selected object from X. qA set N(p) of p’s neighbors is defined qEvery object o Î X-N(p)-{p} is organized under the closest neighbor in N(p). qCovering radius is defined for every internal node (object). Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 59 nIntuition of N(p) qEach object of N(p) is closer to p than to any other object in N(p). qAll objects in X-N(p)-{p} are closer to an object in N(p) than to p. nThe root is o1 qN(o1)={o2,o3,o4,o5} qo7 cannot be included since it is q closer to o3 than to o1. qCovering radius of o1 conceals q all objects. o1 SAT: Example o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 rc Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 60 SAT: Building N(p) nConstruction of minimal N(p) is NP-complete. n nHeuristics for creating N(p): qThe pivot p, S=X-{p}, N(p)={}. q qSort objects in S with respect to their distances from p. q qStart adding objects to N(p). nThe new object oN is added if it is not closer to any object already in N(p). q Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 61 SAT: Range Search nGiven a query R(q,r) : nStart in the root node and traverse the tree. nIn internal nodes, employ the distance ranges to prune some branches. nIn leaf nodes, all objects are directly compared to q. qIf d(q,o)≤ r report o to the output. n Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 62 SAT: Range Search (cont.) nIn an internal node with the pivot p and N(p): nTo prune some branches, locate the closest object ocÎN(p)È{p} to q. qDiscard sub-trees odÎN(p) such that d(q,od)>2r+d(q,oc). qThe pruning effect is maximized if d(q,oc) is minimal. = oc p2 p1 p3 p v t u s1 s s2 r q d(q,oc)+2r pruned Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 63 nIf we pick s2 as the closest object, pruning will be improved. qThe sub-tree p2 will be discarded. nSelect the closest object among more “neighbors”: qUse p’s ancestor and its neighbors. q n SAT: Range Search (cont.) oc = p2 p1 p3 p v t u s1 s s2 r q d(q,oc)+2r previously pruned pruned Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 64 SAT: Range Search (cont.) nFinally, apply covering radii of remaining objects qDiscard od such that d(q,od)>rdc+r. n Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 65 M-tree ninherently dynamic structure ndisk-oriented (fixed-size nodes) nbuilt in a bottom-up fashion n neach node constrained by a sphere-like (ball) region nleaf node: data objects + their distances from a pivot kept in the parent node ninternal node: pivot + radius covering the subtree, distance from the pivot the parent pivot nfiltering: covering radii + pre-computed distances Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 66 M-tree: Extensions nbulk-loading algorithm qconsiders the trade-off: dynamic properties vs. performance qM-tree building algorithm for a dataset given in advance qresults in more efficient M-tree q nSlim-tree qvariant of M-tree (dynamic) qreduces the fat-factor of the tree qtree with smaller overlaps between particular tree regions q nmany variants and extensions – see Chapter 3 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 67 Similarity Hashing nMultilevel structure nOne hash function (r-split function) per level qProducing several buckets. nThe first level splits the whole data set. nNext level partitions the exclusion zone of the previous level. nThe exclusion zone of the last level forms the exclusion bucket of the whole structure. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 68 Similarity Hashing: Structure 4 separable buckets at the first level 2 separable buckets at the second level exclusion bucket of the whole structure Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 69 Similarity Hashing: r-Split Function nProduces several separable buckets. qQueries with radius up to r accesses one bucket at most. qIf the exclusion zone is touched, next level must be sought. 2r 2r 2r 2r 2r r r r Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 70 Similarity Hashing: Features nBounded search costs for queries with radius ≤ r. qOne bucket per level at maximum nBuckets of static files can be arranged in a way that I/O costs never exceed the sequential scan. nDirect insertion of objects. qSpecific bucket is addressed directly by computing hash functions. n nD-index is based on similarity hashing. qUses excluded middle partitioning as the hash function. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 71 Survey of Existing Approaches 1.ball partitioning methods 2.generalized hyper-plane partitioning approaches 3.exploiting pre-computed distances 4.hybrid indexing approaches 5.approximated techniques Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 72 Approximate Similarity Search nSpace transformation techniques qIntroduced very briefly n nReducing the subset of data to be examined qMost techniques originally proposed for vector spaces nSome can also be used in metric spaces qSome are specific for metric spaces Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 73 Exploiting Space Transformations nSpace transformation techniques transform the original data space into another suitable space. qAs an example consider dimensionality reduction. n nSpace transformation techniques are typically distance preserving and satisfy the lower-bounding property: qDistances measured in the transformed space are smaller than those computed in the original space. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 74 Exploiting Space Transformations (cont.) nExact similarity search algorithms: qSearch in the transformed space qFilter out non-qualifying objects by re-measuring distances of retrieved objects in the original space. nApproximate similarity search algorithms qSearch in the transformed space qDo not perform the filtering step nFalse hits may occur Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 75 BBD Trees nA Balanced Box-Decomposition (BBD) tree hierarchically divides the vector space with d-dimensional non-overlapping boxes. qLeaf nodes of the tree contain a single object. qBBD trees are intended as a main memory data structure. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 76 BBD Trees (cont.) nExact k-NN(q) search is obtained as follows qFind the leaf containing the query object qEnumerate leaves in the increasing order of distance from q and maintain the k closest objects. qStop when the distance of next leaf is greater than d(q,ok). nApproximate k-NN(q): qStop when the distance of next leaf is greater than d(q,ok)/(1+e). nDistances from q to retrieved objects are at most 1+e times larger than that of the k-th actual nearest neighbor of q. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 77 BBD Trees: Exact 1-NN Search nGiven 1-NN(q): 7 6 1 2 3 4 5 8 9 q 10 Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 78 nGiven 1-NN(q): qRadius d(q,oNN)/(1+e) is used instead! nRegions 9 and 10 are not accessed: qThey do not intersect the dashed circle of radius d(q,oNN)/(1+e). nThe exact NN is missed! BBD Trees: Approximate 1-NN Search 7 6 1 2 3 4 5 8 9 10 q Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 79 Angle Property Technique nObserved (non-intuitive) properties in high dimensional vector spaces: qObjects tend to have the same distance. nTherefore they tend to be distributed on the surface of ball regions. qParent and child regions have very close radii. nAll regions intersect one each other. qThe angle formed by a query point, the centre of a ball region, and any data object is close to 90 degrees. nThe higher the dimensionality, the closer to 90 degrees. nThese properties can be exploited for approximate similarity search. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 80 Angle Property Technique: Example Objects tend to be located here Objects tend to be located here,… and here A region is accessed when a > q Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 81 Clustering for Indexing (Clindex) nPerforms approximate similarity search in vector spaces exploiting clustering techniques. nThe dataset is partitioned into clusters of similar objects: qEach cluster is represented by a separate file sequentially stored on the disk. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 82 Clindex: Approximate Search nApproximate similarity search: qSeeks for the cluster containing (or the cluster closest to) the query object. qSorts the objects in the cluster according to the distance to the query. nThe search is approximate since qualifying objects can belong to other (non-accessed) clusters. nMore clusters can be accessed to improve precision. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 83 Clindex: Clustering nClustering: qEach dimension of the d-dimensional vector space is divided into 2n segments: the result is (2n)d cells in the data space. qEach cell is associated with the number of objects it contains. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 84 Clindex: Clustering (cont.) nClustering starts accessing cells in the decreasing order of number of contained objects: qIf a cell is adjacent to a cluster it is attached to the cluster. qIf a cell is not adjacent to any cluster it is used as the seed for a new cluster. qIf a cell is adjacent to more than one cluster, a heuristics is used to decide: nif the clusters should be merged or nwhich cluster the cell belongs to. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 85 Clindex: Example 2 1 1 3 1 3 3 3 6 1 2 1 4 1 3 4 4 6 2 1 7 3 2 5 2 1 2 4 4 2 5 6 5 5 6 7 6 5 1 5 6 5 5 7 7 6 6 6 6 6 6 5 5 5 5 5 5 5 5 4 4 4 4 4 3 3 3 3 3 3 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 Retrieved objects Missed objects Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 86 Vector Quantization index (VQ-Index) nThis approach is also based on clustering techniques to perform approximate similarity search. nSpecifically: qThe dataset is grouped into (non-necessarily disjoint) subsets. qLossy compression techniques are used to reduce the size of subsets. qA similarity query is processed by choosing a subset where to search. qThe chosen compressed dataset is searched after decompressing it. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 87 VQ-Index: Subset Generation nSubset generation: qQuery objects submitted by users are maintained in a history file. qQueries in the history file are grouped into m clusters by using k-means algorithm. qIn correspondence of each cluster Ci a subset Si of the dataset is generated as follows q q qAn object may belong to several subsets. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 88 VQ-Index: Subset Generation (cont.) nThe overlap of subsets versus performance can be tuned by the choice of m and k qLarge k implies more objects in a subset, so more objects are recalled. qLarge values of m implies more subsets, so less objects to be accessed. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 89 VQ-Index: Compression nSubset compression with vector quantisation: qAn encoder Enc function is used to associate every vector with an integer value taken from a finite set {1,…,n}. qA decoder Dec function is used to associate every number from the set {1,…,n} with a representative vector. qBy using Enc and Dec, every vector is represented by a representative vector nSeveral vectors might be represented by the same representative. qEnc is used to compress the content of Si by applying it to every object in it: n Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 90 VQ-Index: Approximate Search nApproximate search: qGiven a query q: qThe cluster Ci closest to the query is first located. qAn approximation of Si is reconstructed, by applying the decoder function Deci . qThe approximation of Si is searched for qualifying objects. qApproximation occurs at two stages: nQualifying objects may be included in other subsets, in addition to Si . nThe reconstructed approximation of Si may contain vectors which differ from the original ones. q Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 91 Buoy Indexing nDataset is partitioned in disjoint clusters. nA cluster is represented by a representative element – the buoy. nClusters are bounded by a ball region having the buoy as center and the distance of the buoy to the farthest element of the cluster as the radius. nThis approach can be used in pure metric spaces. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 92 Buoy Indexing: Similarity Search nGiven an exact k-NN query, clusters are accessed in the increasing distance to their buoys, until current result-set cannot be improved. qThat is, until d(q,ok) + ri < d(q,pi) npi is the buoy, ri is the radius nAn approximate k-NN query can be processed by stopping when qeither previous exact condition is true, or qa specified ratio f of clusters has been accessed. Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 93 Hierarchical Decomposition of Metric Spaces nIn addition to previous ones, there are other methods that were appositively designed to qWork on generic metric spaces qOrganize large collections of data nThey exploit the hierarchical decomposition of metric spaces. n Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 94 Hierarchical Decomposition of Metric Spaces (cont.) nThese will be discussed in details later on: qRelative error approximation nRelative error on distances of the approximate result is bounded. qGood fraction approximation nRetrieves k objects from a specified fraction of the objects closest to the query. n Similarity Search: Part I, Chapter 2 ‹#› Similarity Search: Part I, Chapter 2 95 Hierarchical Decomposition of Metric Spaces (cont.) nThese will be discussed in details later on: qSmall chance improvement approximation nStops when chances of improving current result are low. qProximity based approximation nDiscards regions with small probability of containing qualifying objects. qPAC (Probably Approximately Correct) nearest neighbor search nRelative error on distances is bounded with a probability specified. n