SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko book •Similarity Search: •Part II, Chapter 3 •2 Table of Content nPart I: Metric searching in a nutshell nFoundations of metric space searching nSurvey of existing approaches n nPart II: Metric searching in large collections nCentralized index structures nApproximate similarity search nParallel and distributed indexes •Similarity Search: •Part II, Chapter 3 •3 Features of “good” index structures nDynamicity qsupport insertions and deletions and minimize their costs nDisk storage qfor dealing with large collections of data nCPU & I/O optimization qsupport different distance measures with completely different CPU requirements, e.g., L2 and quadratic-form distance. nExtensibility qsimilarity queries, i.e., range query, k-nearest neighbors query •Similarity Search: •Part II, Chapter 3 •4 Centralized Index Structures for Large Databases 1.M-tree family 2. 2.hash-based metric indexing 3. 3.performance trials •Similarity Search: •Part II, Chapter 3 •5 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •6 The M-tree nInherently dynamic structure nDisk-oriented (fixed-size nodes) nBuilt in a bottom-up fashion qInspired by R-trees and B-trees n nAll data in leaf nodes nInternal nodes: pointers to subtrees and additional information nSimilar to GNAT, but objects are stored in leaves. •Similarity Search: •Part II, Chapter 3 •7 M-tree: Internal Node nInternal node consists of an entry for each subtree nEach entry consists of: qPivot: p qCovering radius of the sub-tree: rc qDistance from p to parent pivot pp: d(p,pp) qPointer to sub-tree: ptr q q q qAll objects in subtree ptr are within the distance rc from p. • •Similarity Search: •Part II, Chapter 3 •8 M-tree: Leaf Node nleaf node contains data entries neach entry consists of pairs: qobject (its identifier): o qdistance between o and its parent pivot: d(o,op) • •Similarity Search: •Part II, Chapter 3 •9 • •o7 M-tree: Example • •o1 • •o6 • •o10 •o3 • • • • • •o2 •o5 • • •o4 • • • • •o9 • •o8 • •o11 •o1 •4.5 •-.- • •o2 •6.9 •-.- • • • • • • •o1 •1.4 •0.0 • •o10 •1.2 •3.3 • • • • • • •o7 •1.3 •3.8 • •o2 •2.9 •0.0 • •o4 •1.6 •5.3 • • •o2 •0.0 • • •o8 •2.9 • • • • •o1 •0.0 • • •o6 •1.4 • • • • •o10 •0.0 • • •o3 •1.2 • • • • •o7 •0.0 • • •o5 •1.3 • •o11 •1.0 • •o4 •0.0 • • •o9 •1.6 • • • • • • • • • •Covering radius •Distance to parent •Distance to parent •Similarity Search: •Part II, Chapter 3 •10 M-tree: Insert nInsert a new object oN: nrecursively descend the tree to locate the most suitable leaf for oN nin each step enter the subtree with pivot p for which: q qno enlargement of radius rc needed, i.e., d(oN,p) ≤ rc nin case of ties, choose one with the nearest p to oN q qminimize the enlargement of rc •Similarity Search: •Part II, Chapter 3 •11 M-tree: Insert (cont.) nwhen reaching leaf node N then: qif N is not full then store oN in N qelse Split(N,oN). •Similarity Search: •Part II, Chapter 3 •12 M-tree: Split nSplit(N,oN): nLet S be the set containing all entries of N and oN nSelect pivots p1 and p2 from S nPartition S to S1 and S2 according to p1 and p2 nStore S1 in N and S2 in a new allocated node N’ nIf N is root qAllocate a new root and store entries for p1, p2 there nelse (let Np and pp be the parent node and parent pivot of N) qReplace entry pp with p1 qIf Np is full, then Split(Np,p2) qelse store p2 in node Np •Similarity Search: •Part II, Chapter 3 •13 M-tree: Pivot Selection nSeveral pivots selection policies qRANDOM – select pivots p1, p2 randomly qm_RAD – select p1, p2 with minimum (r1c + r2c) qmM_RAD – select p1, p2 with minimum max(r1c, r2c) qM_LB_DIST – let p1 = pp and p2 = oi | maxi { d(oi,pp) } nUses the pre-computed distances only nTwo versions (for most of the policies): qConfirmed – reuse the original pivot pp and select only one qUnconfirmed – select two pivots (notation: RANDOM_2) nIn the following, the mM_RAD_2 policy is used. •Similarity Search: •Part II, Chapter 3 •14 M-tree: Split Policy nUnbalanced qGeneralized hyperplane n nBalanced qLarger covering radii qWorse than unbalanced one • • • • • • • • • •p2 • • •p1 • • • • • • • • • • •p2 • • •p1 • nPartition S to S1 and S2 according to p1 and p2 •Similarity Search: •Part II, Chapter 3 •15 M-tree: Range Search nGiven R(q,r): nTraverse the tree in a depth-first manner nIn an internal node, for each entry áp,rc,d(p,pp),ptrñ qPrune the subtree if |d(q,pp) – d(p,pp)| – rc > r qApplication of the pivot-pivot constraint • • •q • • •q • • • • •r • •p •rc • • • •pp •r • •p •rc • • • •pp •Similarity Search: •Part II, Chapter 3 •16 M-tree: Range Search (cont.) nIf not discarded, compute d(q,p) and qPrune the subtree if d(q,p) – rc > r qApplication of the range-pivot constraint q q q q q nAll non-pruned entries are searched recursively. • • •q • •p •rc • •r • • • •Similarity Search: •Part II, Chapter 3 •17 M-tree: Range Search in Leaf Nodes nIn a leaf node, for each entry áo,d(o,op)ñ qIgnore entry if |d(q,op) – d(o,op)| > r qelse compute d(q,o) and check d(q,o) ≤ r qApplication of the object-pivot constraint •Similarity Search: •Part II, Chapter 3 •18 M-tree: k-NN Search nGiven k-NN(q): nBased on a priority queue and the pruning mechanisms applied in the range search. nPriority queue: qStores pointers to sub-trees where qualifying objects can be found. qConsidering an entry E=áp,rc,d(p,pp),ptrñ, the pair áptr,dmin(E)ñ is stored. q dmin(E)=max { d(p,q) – rc, 0 } nRange pruning: instead of fixed radius r, use the distance to the k-th current nearest neighbor. •Similarity Search: •Part II, Chapter 3 •19 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •20 Bulk-Loading Algorithm nfirst extension of M-tree nimproved tree-building (insert) algorithm nrequires the dataset to be given in advance q nNotation: qDataset X={o1,…,on} qNumber of entries per node: m nBulk-Loading Algorithm: qFirst phase: build the M-tree qSecond phase: refinement of unbalanced tree n •Similarity Search: •Part II, Chapter 3 •21 Bulk-Loading: First Phase nrandomly select l pivots P={p1,…,pl} from X qUsually l=m nobjects from X are assigned to the nearest pivot producing l subsets P1,…,Pl nrecursively apply the bulk-loading algorithm to the subsets and obtain l sub-trees T1,…,Tl qleaf nodes with maximally l objects ncreate the root node and connect all the sub-trees to it. •Similarity Search: •Part II, Chapter 3 •22 Bulk-Loading: Example (1) •o1 • •o4 • •o5 • • • •o2 • • • • • • • • •o6 • •o8 • • • •o9 • •o7 • • • •o3 • • • • • • • • • •o1 •o2 •o3 •root •o’3 •o7 •o6 •o’1 •o5 •o4 •o”3 •o9 •o8 • • •Similarity Search: •Part II, Chapter 3 •23 Bulk-Loading: Discussion nProblem of choosing pivots P={p1,…,pl} nsparse region ® shallow sub-tree qfar objects assigned to other pivots ndense region ® deep sub-tree n nobserve this phenomenon in the example •Similarity Search: •Part II, Chapter 3 •24 Bulk-Loading: Second Phase nrefinement of the unbalanced M-tree napply the following two techniques to adjust the set of pivots P={p1,…,pl} q qunder-filled nodes – reassign to other pivots and corresponding pivots deleted from P qdeeper subtrees – split into shallower ones and add the obtained pivots to P •Similarity Search: •Part II, Chapter 3 •25 nUnder-filled nodes in the example: o’1,o9 Bulk-Loading: Example (2) •o1 •o’1 •o5 •o4 •o’4 •o5 •o4 • •o’3 •o”3 •o9 •o8 •o”3 •o8 •o’3 •Similarity Search: •Part II, Chapter 3 •26 Bulk-Loading: Example (3) nAfter elimination of under-filled nodes. •o2 •o3 •root •o7 •o6 • • •o’4 •o5 •o4 •o”3 •o8 •o’3 •Similarity Search: •Part II, Chapter 3 •27 nSub-trees rooted in o4 and o3 in the tree are deeper q nsplit them into new subtrees rooted in o’4, o5, o”3, o8, o6, o7 nadd them into P and remove o4,o3 nbuild the super-tree (two levels) over the final set of pivots P={o2,o’4,o5,o”3,o8,o6,o7} – from Sample (3) Bulk-Loading: Example (4) •Similarity Search: •Part II, Chapter 3 •28 Bulk-Loading: Example (5) – Final •o1 • •o4 • •o5 • • • •o2 • • • • • • • •o6 • •o8 • • • •o9 • •o7 • • • •o3 • • • • • • • • •o2 •root •o3 •o3 •o6 •o8 • • •o4 •o5 •o4 •o2 •o7 •Similarity Search: •Part II, Chapter 3 •29 Bulk-Loading: Optimization nReduce the number of distance computations in the recursive calling of the algorithm qafter initial phase, we have distances d(pj,oi) for all objects X={o1,…,on} and all pivots P={p1,…,pl} qAssume the recursive processing of P1 qNew set of pivots is picked {p1,1 , …, p1,l’} qDuring clustering, we are assigning every object oÎP1 to its nearest pivot. qThe distance d(p1,j ,o) can be lower-bounded: q |d(p1,o) – d(p1,p1,j )| ≤ d(p1,j ,o) q •Similarity Search: •Part II, Chapter 3 •30 Bulk-Loading: Optimization (cont.) nIf this lower-bound is greater than the distance to the closest pivot p1,N so far, i.e., n |d(p1,o) – d(p1,p1,j )| > d(p1,N ,o) n then the evaluation of d(p1,j ,o) can be avoided. n nCuts costs by 11% qIt uses pre-computed distances to a single pivot. qby 20% when pre-computed distances to multiple pivots are used. •Similarity Search: •Part II, Chapter 3 •31 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •32 Multi-Way Insertion Algorithm nanother extension of M-tree insertion algorithm nobjective: build more compact trees qreduce search costs (both I/O and CPU) nfor dynamic datasets (not necessarily given in advance) nincrease insertion costs slightly nthe original single-way insertion visits exactly one root-leaf branch qleaf with no or minimum increase of covering radius qnot necessarily the most convenient •Similarity Search: •Part II, Chapter 3 •33 Multi-Way Insertion: Principle nwhen inserting an object oN nrun the point query R(oN,0) nfor all visited leaves (they can store oN without radii enlargement): compute the distance between oN and the leaf’s pivot nchoose the closest pivot (leaf) nif no leaf visited – run the single-way insertion •Similarity Search: •Part II, Chapter 3 •34 Multi-Way Insertion: Analysis nInsertion costs: n25% higher I/O costs (more nodes examined) nhigher CPU costs (more distances computed) n nSearch costs: n15% fewer disk accesses nalmost the same CPU costs for the range query n10% fewer distance computations for k-NN query •Similarity Search: •Part II, Chapter 3 •35 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •36 The Slim Tree nextension of M-tree – the same structure qspeed up insertion and node splitting qimprove storage utilization nnew node-selection heuristic for insertion nnew node-splitting algorithm nspecial post-processing procedure qmake the resulting trees more compact. n n •Similarity Search: •Part II, Chapter 3 •37 Slim Tree: Insertion nStarting at the root node, in each step: nfind a node that covers the incoming object nif none, select the node whose pivot is the nearest qM-tree would select the node whose covering radius requires the smallest expansion nif several nodes qualify, select the one which occupies the minimum space qM-trees would choose the node with closest pivot n n •Similarity Search: •Part II, Chapter 3 •38 Slim Tree: Insertion Analysis nfill insufficiently occupied nodes first qdefer splitting, boost node utilization, and cut the tree size nexperimental results (the same mM_RAD_2 splitting policy) show: qlower I/O costs qnearly the same number of distance computations qthis holds true for both the tree building procedure and the query execution n •Similarity Search: •Part II, Chapter 3 •39 Slim Tree: Node Split nsplitting of the overfilled nodes – high costs nmM_RAD_2 strategy is considered the best so far qComplexity O(n3) using O(n2) distance computations nthe Slim Tree splitting based on the minimum spanning tree (MST) qComplexity O(n2logn) using O(n2) distance computations nthe MST algorithm assumes a full graph qn objects qn(n-1) edges – distances between objects q •Similarity Search: •Part II, Chapter 3 •40 Slim Tree: Node Split (cont.) nSplitting policy based on the MST: 1.build the minimum spanning tree on the full graph 2.delete the longest edge 3.the two resulting sub-graphs form the new nodes 4.choose the pivot for each node as the object whose distance to the others in the group is the shortest n •Similarity Search: •Part II, Chapter 3 •41 Slim Tree: Node Split – Example n(a) the original Slim Tree node n(b) the minimum spanning tree n(c) the new two nodes • • •o1 • •o2 • •o3 • •o4 • •o5 • •o7 • •o6 • •oN • • •o1 •o2 • •o3 • •o4 • •o5 • •o7 • •o6 • •oN • •o1 • • •o2 •o3 • •o4 • •o5 • •o7 • • •o6 •oN • •(a) •(b) •(c) •Similarity Search: •Part II, Chapter 3 •42 Slim Tree: Node Split – Discussion ndoes not guarantee the balanced split na possible variant (more balanced splits): qchoose the most appropriate edge from among the longer edges in the MST qif no such edge is found (e.g., for a star-shaped dataset), accept the original unbalanced split q nexperiments prove that: qtree building using the MST algorithm is at least forty times faster than the mM_RAD_2 policy qquery execution time is not significantly better n •Similarity Search: •Part II, Chapter 3 •43 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •44 Slim-Down Algorithm npost-processing procedure nreduce the fat-factor of the tree qbasic idea: reduce the overlap between nodes on one level qminimize number of nodes visited by a point query, e.g., R(o3,0) n • •o4 • • •o3 • •o2 • •o1 •o5 • • •Node N •Node M •o4 • •o3 • •o2 • •o1 •o5 • • • •Node N •Node M • • •Similarity Search: •Part II, Chapter 3 •45 Slim-Down Algorithm: The Principle nFor each node N at the leaf level: 1.Find object o furthest from pivot of N 2.Search for a sibling node M that also covers o. If such a not fully occupied node exists, move o from N to M and update the covering radius of N. n nSteps 1 and 2 are applied to all nodes at the given level. If an object is relocated after a complete loop, the entire algorithm is executed again. n nObserve moving of o3 from N to M on previous slide. •Similarity Search: •Part II, Chapter 3 •46 Slim-Down Algorithm: Discussion nPrevent from infinite loop qcyclic moving of objects o4,o5,o6 nLimit the number of algorithm cycles n n nTrials proved reducing of I/O costs of at least 10% nThe idea of dynamic object relocation can be also applied to defer splitting. qMove distant objects from a node instead of splitting it. • •o1 •o2 • • •o3 •o5 • •o6 • •o4 • • • • • • • •o8 •o9 •o7 •Similarity Search: •Part II, Chapter 3 •47 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •48 Generalized Slim-Down Algorithm ngeneralization of Slim-down algorithm for non-leaf tree levels nthe covering radii rc must be taken into account before moving a non-leaf entry nthe generalized Slim-down starts from the leaf level qfollow the original Slim-down algorithm for leaves nascend up the tree terminating in the root •Similarity Search: •Part II, Chapter 3 •49 Generalized Slim-Down: The Principle nFor each entry E=áp,rc,…ñ at given non-leaf level: npose range query R(p,rc), nthe query determines the set of nodes that entirely contain the query region, nfrom this set, choose the node M whose parent pivot is closer to p than to pp, nif such M exists, move the entry E from N to M, nif possible, shrink the covering radius of N. •Similarity Search: •Part II, Chapter 3 •50 Generalized Slim-Down: Example nLeaf level: qmove two objects from o3 and o4 to o1 – shrink o3 and o4 nUpper level: qoriginally node M contains o1,o4 and node N contains o2,o3 qswap the nodes of o3 and o4 • •o1 • •o4 • •o2 • • • • • • • • • •o3 • • • • • •o1 • •o4 • •o2 • • • • • • • • •o3 • • • • •Node M •Node N •Node M •Node N • •Similarity Search: •Part II, Chapter 3 •51 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •52 Pivoting M-tree nupgrade of the standard M-tree nbound the region covered by nodes more tightly qdefine additional ring regions that restrict the ball regions qring regions: pivot p and two radii rmin, rmax qsuch objects o that: rmin ≤ d(o,p) ≤ rmax nbasic idea: qSelect additional pivots qEvery pivot defines two boundary values between which all node’s objects lie. qBoundary values for each pivot are stored in every node. (see a motivation example on the next slide) •Similarity Search: •Part II, Chapter 3 •53 PM-tree: Motivation Example noriginal M-tree nrange query R(q,r) intersects the node region nPM-tree (two pivots) nthis node not visited for query R(q,r) • • • • •r •q •p2 • • • • • • • • • •r •q • • • • • •p1 •Similarity Search: •Part II, Chapter 3 •54 PM-tree: Structure nselect additional set of pivots |P|=np nleaf node entry: áo,d(o,op),PDñ qPD – array of npd pivot distances: PD[i]=d(pi,o) qParameter npd < np ninternal node entry: áp,rc,d(p,pp),ptr,HRñ qHR – array of nhr intervals defining ring regions q q q qparameter nhr < np •Similarity Search: •Part II, Chapter 3 •55 PM-tree: Insertion ninsertion of object oN nthe HR arrays of nodes visited during insertion must be updated by values d(oN,pi) for all i ≤ nhr nthe leaf node: qcreate array PD and fill it with values d(oN,pj), " j ≤ npd nvalues d(oN,pj) are computed only once and used several times – max(nhr ,npd) distance computations ninsertions may force node splits n •Similarity Search: •Part II, Chapter 3 •56 PM-tree: Node Split nnode splits require some maintenance nleaf split: qset arrays HR of two new internal entries qset HR[i].min and HR[i].max as min/max of PD[j] qcompute additional distances: d(pj ,o), " j (npd < j ≤ nhr ) and take them into account qcan be expensive if nhr >> npd ninternal node split: qcreating two internal node entries with HR qset these HR arrays as union over all HR arrays of respective entries q •Similarity Search: •Part II, Chapter 3 •57 PM-tree: Range Query nGiven R(q,r): nevaluate distances d(q,pi), " i (i ≤ max(nhr ,npd)) ntraverse the tree, internal node áp,rc,d(p,pp),ptr,HRñ is visited if both the expressions hold: n n n n nleaf node entry test: q nM-tree: the first condition only •Similarity Search: •Part II, Chapter 3 •58 PM-tree: Parameter Setting ngeneral statements: qexistence of PD arrays in leaves reduce number of distance computations but increase the I/O cost qthe HR arrays reduce both CPU and I/O costs nexperiments proof that: qnpd=0 decreases I/O costs by 15% to 35% comparing to M-tree (for various values of nhr) qCPU cost reduced by about 30% qnpd=nhr / 4 leads to the same I/O costs as for M-tree qwith this setting – up to 10 times faster nparticular parameter setting depends on application q •Similarity Search: •Part II, Chapter 3 •59 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •60 The M+-tree nmodification of the M-tree nrestrict the application to Lp metrics (vector spaces) nbased on the concept of key dimension neach node partitioned into two twin-nodes qpartition according to a selected key dimension n •Similarity Search: •Part II, Chapter 3 •61 M+-tree: Principles nin an n-dimensional vector space nkey dimension for a set of objects is the dimension along which the data objects are most spread nfor any dimension Dkey and vectors (x1,…xn),(y1,…yn) n q nthis holds also for other Lp metrics nthis fact is applied to prune the search space n •Similarity Search: •Part II, Chapter 3 •62 M+-tree: Structure ninternal node is divided into two subsets qaccording to a selected dimension qleaving a gap between the two subsets qthe greater the gap the better filtering ninternal node entry: n qDkey – number of the key dimension qptrleft ,ptrright – pointers to the left and right twin-nodes qdlmax – maximal key-dimension value of the left twin qdrmin – minimal key-dimension value of the right twin •Similarity Search: •Part II, Chapter 3 •63 M+-tree: Example nsplitting of an overfilled node: qobjects of both twins are considered as a single set qapply standard mM_RAD_2 strategy nselect the key dimension for each node separately • • • • • • • • • • • • •oN • • • • • • • • • • • • • •oN • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •Similarity Search: •Part II, Chapter 3 •64 M+-tree: Performance nslightly more efficient than M-tree nbetter filtering for range queries with small radii npractically the same for larger radii nnearest neighbor queries: qa shorter priority queue – only one of the twin-nodes qsave some time for queue maintenance n nmoderate performance improvements napplication restricted to vector datasets with Lp •Similarity Search: •Part II, Chapter 3 •65 M-tree Family nThe M-tree nBulk-Loading Algorithm nMulti-Way Insertion Algorithm nThe Slim Tree nSlim-Down Algorithm qGeneralized Slim-Down Algorithm nPivoting M-tree nThe M+-tree nThe M2-tree •Similarity Search: •Part II, Chapter 3 •66 The M2-tree ngeneralization of M-tree nable to process complex similarity queries qcombined queries on several metrics at the same time qfor instance: an image database with keyword-annotated objects and color histograms qquery: Find images that contain a lion and the scenery around it like this. nqualifying objects identified by a scoring function df qcombines the particular distances (according to several different measures) •Similarity Search: •Part II, Chapter 3 •67 M2-tree: Structure neach object characterized by several features qe.g. o[1],o[2] qrespective distance measures may differ: d1,d2 nleaf node: M-tree vs. M2-tree n ninternal node: M-tree vs. M2-tree •Similarity Search: •Part II, Chapter 3 •68 M2-tree: Example nthe space transformation according to particular features can be seen as an n-dimensional space nthe subtree region forms a hypercube n •o1 • •o2 • •o5 • •o4 • • • •Similarity Search: •Part II, Chapter 3 •69 M2-tree: Range Search nGiven R(q,r): nM-tree prunes a subtree if |d(q,pp) – d(p,pp)| – rc > r nM2-tree: compute the lower bound for every feature n ncombine these bounds using the scoring function df nvisit those entries for which the result is ≤ r n nanalogous strategy for nearest neighbor queries •Similarity Search: •Part II, Chapter 3 •70 M2-tree: Performance nrunning k-NN queries nimage database mentioned in the example nM2-tree compared with sequential scan qthe same I/O costs qreduced number of distance computations nM2-tree compared with Fagin’s A0 (two M-trees) qM2-tree saves about 30% of I/Os qabout 20% of distance computations q A0 have higher I/O cost than the sequential scan •Similarity Search: •Part II, Chapter 3 •71 Centralized Index Structures for Large Databases 1.M-tree family 2. 2.hash-based metric indexing nDistance Index (D-index) nExtended D-Index (eD-index) 3. 3.performance trials •Similarity Search: •Part II, Chapter 3 •72 Distance Index (D-index) nHybrid structure qcombines pivot-filtering and partitioning. nMultilevel structure based on hashing qone r-split function per level. nThe first level splits the whole data set. nNext level partitions the exclusion zone of the previous level. nThe exclusion zone of the last level forms the exclusion bucket of the whole structure. •Similarity Search: •Part II, Chapter 3 •73 D-index: Structure • • • • • • • • • • • • • • •4 separable buckets at the first level •2 separable buckets at the second level •exclusion bucket of the whole structure • •Similarity Search: •Part II, Chapter 3 •74 D-index: Partitioning nBased on excluded middle partitioning qball partitioning variant is used. q q qbps1,r(x)= • • • • • • • • • • • • • •0 if d(x,p) ≤ dm - r •1 if d(x,p) > dm + r •− otherwise • • •dm • • •2r • •p •Separable set 1 •Separable set 0 •Exclusion set •Similarity Search: •Part II, Chapter 3 •75 D-index: Binary r-Split Function nBinary mapping: bps1,r: D → {0,1,−} qr-split function, r ≥ 0 qalso called the first order r-split function n nSeparable property (up to 2r ): q "x,y Î D, bps1,r(x) = 0 and bps1,r(y) = 1 Þ d(x,y) > 2r qNo objects closer than 2r can be found in both the separable sets. nSymmetry property: "x,y Î D, r2 ≥ r1, n bps1,r2(x) ¹ −, bps1,r1(y) = − Þ d(x,y) > r2 - r1 q •Similarity Search: •Part II, Chapter 3 •76 • •2r • D-index: Symmetry Property nEnsures that the exclusion set “shrinks” in a symmetric way as r decreases. nWe want to test whether a query intersects the exclusion set or not. • •2(r+r) •q2 • •r • •q1 • •r • •Similarity Search: •Part II, Chapter 3 •77 • •dm1 •2r D-index: General r-Split Function nCombination of several binary r-split functions qtwo in the example n • • •dm2 •2r Čárový popisek 2 (se zvýrazněním): Separable set 1 •Separable set 1 Čárový popisek 2 (se zvýrazněním): Separable set 0 •Separable set 0 Čárový popisek 2 (se zvýrazněním): Exclusion set •Exclusion set Čárový popisek 2 (se zvýrazněním): Separable set 3 •Separable set 3 Čárový popisek 2 (se zvýrazněním): Separable set 2 •Separable set 2 •Similarity Search: •Part II, Chapter 3 •78 D-index: General r-Split Function nA combination of n first order r-split functions: qbpsn,r: D → {0..2n-1, −} q qbpsn,r(x) = q nSeparable & symmetry properties hold qresulting sets are also separable up to 2r. •− if $i, bpsi1,r(x) = − •b all bpsi1,r(x) form a binary number b • •Similarity Search: •Part II, Chapter 3 •79 D-index: Insertion • • • • • • • • • • • • • • • • • • • • • • • • •Similarity Search: •Part II, Chapter 3 •80 D-index: Insertion Algorithm nDindexr(X, m1, m2, …, mh) qh – number of levels, qmi – number of binary functions combined on level i. nAlgorithm – insert the object oN: q for i=1 to h do q if bpsmi,r(oN) ¹ ‘-’ then q oN ® bucket with the index bpsmi,r(oN). q exit q end if q end do q oN ® global exclusion bucket. •Similarity Search: •Part II, Chapter 3 •81 D-index: Insertion Algorithm (cont.) nThe new object is inserted with one bucket access. n nRequires distance computations qassuming oN was inserted in a bucket on the level j. n •Similarity Search: •Part II, Chapter 3 •82 D-index: Range Query nDindexr(X, m1, m2, …, mh) qh – number of levels, qmi – number of binary functions combined on level i. q nGiven a query R(q,r) with r ≤r: qfor i=1 to h do q search in the bucket with the index bpsmi,0(q). qend do qsearch in the global exclusion bucket. qObjects o, d(q,o)≤r, are reported on the output. •Similarity Search: •Part II, Chapter 3 •83 D-index: Range Search (cont.) • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •q • •r • • • • • • •q • •r • • • • • • • •q • •r • • • •q • •r • • • • • • • •q • •r • •q • •r • •Similarity Search: •Part II, Chapter 3 •84 D-index: Range Query (cont.) nThe call bpsmi,0(q) always returns a value between 0 and 2mi -1. nExactly one bucket per level is accessed if r ≤r qh+1 bucket access. q nReducing the number of bucket accesses: qthe query region is in the exclusion set Þ proceed the next level directly, qthe query region is in a separable set Þ terminate the search. n •Similarity Search: •Part II, Chapter 3 •85 D-index: Advanced Range Query nfor i = 1 to h n if bpsmi,r+r (q) ¹ − then (exclusively in the separable bucket) n search in the bucket with the index bpsmi,r+r (q). n exit (search terminates) n end if n if r ≤ r then (the search radius up to r) n if bpsmi,r-r (q) ¹ − then (not exclusively in the exclusion zone) n search in the bucket with the index bpsmi,r-r (q). n end if n else (the search radius greater than r) n let {i1,…in} = G(bpsmi,r-r (q) ) n search in the buckets with the indexes i1,…,in. n end if nend for nsearch in the global exclusion bucket. > •Similarity Search: •Part II, Chapter 3 •86 D-index: Advanced Range Query (cont.) nThe advanced algorithm is not limited to r≤r. nAll tests for avoiding some bucket accesses are based on manipulation of parameters of split functions (i.e. r). nThe function G() returns a set of bucket indexes: qall minuses (-) in the split functions’ results are substituted by all combinations of ones and zeros, qe.g. bps3,r(q)=‘1--’ qG(bps3,r(q))={100,101,110,111} •Similarity Search: •Part II, Chapter 3 •87 D-index: Features nsupports disk storage ninsertion needs one bucket access qdistance computations vary from m1 up to ∑i=1..h mi nh+1 bucket accesses at maximum qfor all queries such that qualifying objects are within r nexact match (R(q,0)) qsuccessful – one bucket access qunsuccessful – typically no bucket is accessed •Similarity Search: •Part II, Chapter 3 •88 Similarity Join Query nThe similarity join can be evaluated by a simple algorithm which computes |X||Y| distances between all the pairs of objects. n •= NM distance computations • • • • • • • •X •Y •Similarity Search: •Part II, Chapter 3 •89 Similarity Self Join Query nThe similarity self join examines all pairs of objects of a set X, which is |X||X| distance computations. nDue to the symmetry property, d(x,y) = d(y,x), we can reduce the costs. n n q nThis is called the nested loops algorithm (NL). • • • • •X •distance computations We refer to an algorithm which has this cost as nested loops. In fact, it forms a baseline of all other algorithms. •Similarity Search: •Part II, Chapter 3 •90 Similarity Self Join Query (cont.) nSpecialized algorithms qusually built on top of a commercial DB system, or qtailored to specific needs of application. nD-index provides a very efficient algorithm for range queries: qa self join query can be evaluated using n Range Join Algorithm (RJ): n for each o in dataset X do n range_query(o, m) n end do n •Similarity Search: •Part II, Chapter 3 •91 Extended D-index (eD-index) nA variant of D-index which provides a specialized algorithm for similarity joins. nApplication independent – general solution. n nSplit functions manage replication. nD-index’s algorithms for range & k-NN queries are only slightly modified. •Similarity Search: •Part II, Chapter 3 •92 eD-index: Similarity Self Join Query nSimilarity self join is elaborated independently in each bucket. nThe result set is a union of answers of all sub-queries. •m •The lost pair!!! •Separable set 0 •Exclusion set •Separable set 1 • • • • • • • • • • • • • However, we propose another algorithm for similarity self joins which is elaborated independently in each bucket. The result of the similarity self join query is the concatenation of individual sub-queries. •Similarity Search: •Part II, Chapter 3 •93 eD-index: Overloading Principle nLost pairs are handled by replications qareas of width e are replicated in the exclusion set. nm ≤ e •m •Separable set 0 •Exclusion set • • • • • • • • • • • • • •e •Objects replicated to the exclusion set • • • •The duplicate !!! •Separable set 1 •Similarity Search: •Part II, Chapter 3 •94 • • • • • eD-index: r-Split Function Modification nThe modification of r-split function is implemented in the insertion algorithm by varying the parameter r qthe original stop condition in the D-index’s algorithm is changed. •Separable set 0 • •dm •2r •2(r +e) •Exclusion set • •Separable set 1 • • • • •p •Similarity Search: •Part II, Chapter 3 •95 eD-index: Insertion Algorithm neDindexr,e(X, m1, m2, …, mh) nAlgorithm – insert the object oN: q for i=1 to h do q if bpsmi,r(oN) ¹ ‘-’ then q oN ® bucket with the index bpsmi,r(oN). q if bpsmi,r+e(oN) ¹ ‘-’ then (not in the overloading area) q exit q end if q end if q end do q oN ® global exclusion bucket. •Similarity Search: •Part II, Chapter 3 •96 l •Bucket of 1st level • •Bucket of 2nd level eD-index: Handling Duplicates • • • • • • • • • • • •e • • • • • • •3rd level • • • • • •2nd level • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •1st level • • • •brown •green •blue •brown •green •The duplicates received brown & green colors. •Similarity Search: •Part II, Chapter 3 •97 eD-index: Overloading Join Algorithm nGiven similarity self-join query SJ(m): nExecute the query in every separable bucket on every level qand in the global exclusion bucket. nIn the bucket, apply sliding window algorithm. nThe query’s result is formed by concatenation of all sub-results. •Similarity Search: •Part II, Chapter 3 •98 nUse the triangle inequality qto avoid checking all pairs of objects in the bucket. nOrder all objects on distances to one pivot. nThe sliding window is then moved over all objects. qonly pairs of objects in the window are examined. n n • •m eD-index: Sliding Window • • • • • • • • • • • • • • • • • • • • • • • • • • • • nDue to the triangle inequality, the pair of objects outside the window cannot qualify: qd(x,y) ³ d(x,p) - d(y,p) > m n •p •Similarity Search: •Part II, Chapter 3 •99 eD-index: Sliding Window (cont.) nThe algorithm also employs qthe pivot filtering and qthe eD-index’s coloring technique. q nGiven a pair of objects o1,o2: qif a color is shared, this pair must have been reported on the level having this color – the pair is ignored without distance computation, else qif d(o1,o2)≤m , it is an original qualifying pair. •Similarity Search: •Part II, Chapter 3 •100 eD-index: Limitations nSimilarity self-join queries only qthe query selectivity must satisfy: m ≤ e. qit is not very restrictive since we usually look for close pairs. nThe parameters r and e depend on each other. qe ≤ 2r qIf e > 2r, the overloading zone is wider than the exclusion zone. nbecause we do not replicate objects between separable sets – only between a separable set and the exclusion zone, nsome qualifying pairs might be missed. •Similarity Search: •Part II, Chapter 3 •101 Centralized Index Structures for Large Databases 1.M-tree family 2. 2.hash-based metric indexing 3. 3.performance trials •Similarity Search: •Part II, Chapter 3 •102 Performance Trials nexperiments on M-tree and D-index nthree sets of experiments: 1.comparison of M-tree (tree-based approach) vs. D-index (hash-based approach) 2.processing different types of queries 3.scalability of the centralized indexes – growing the size of indexed dataset q •Similarity Search: •Part II, Chapter 3 •103 Datasets and Distance Measures ntrials performed on three datasets: qVEC: 45-dimensional vectors of image color features compared by the quadratic distance measure qURL: sets of URL addresses; the distance measure is based on the similarity of sets (Jaccard’s coefficient) qSTR: sentences of a Czech language corpus compared using an edit distance •Similarity Search: •Part II, Chapter 3 •104 Datasets: Distance Distribution ndistribution of distances within the datasets: qVEC: practically normal distance distribution qURL: discrete distribution qSTR: skewed distribution •Similarity Search: •Part II, Chapter 3 •105 Trials: Measurements & Settings nCPU costs: number of distance computations nI/O costs: number of block reads qThe same size of disk blocks n nQuery objects follow the dataset distribution nAverage values over 50 queries: qDifferent query objects qThe same selectivity nRadius or number of nearest neighbors •Similarity Search: •Part II, Chapter 3 •106 Comparison of Indexes nComparing performance of qM-tree – a tree-based approach qD-index – hash-based approach qsequential scan (baseline) n nDataset of 11,100 objects n nRange queries – increasing radius qmaximal selectivity about 20% of the dataset •Similarity Search: •Part II, Chapter 3 •107 Comparison: CPU Costs qgenerally, D-index outperforms M-tree for smaller radii qD-index: pivot-based filtering depends on data distribution and query size qM-tree outperforms D-index for discrete distribution npivot selection is more difficult for discrete distributions •Similarity Search: •Part II, Chapter 3 •108 Comparison: I/O Costs qM-tree needs twice the disk space to stored data than SEQ qinefficient if the distance function is easy to compute qD-index more efficient qa query with r=0: D-index accesses only one page (important, e.g., for deletion) •Similarity Search: •Part II, Chapter 3 •109 Different Query Types ncomparing processing performance of different types of queries qrange query qnearest neighbor query qsimilarity self join n nM-tree, D-index, sequential scan n •Similarity Search: •Part II, Chapter 3 •110 Range vs. k-NN: CPU Costs nnearest neighbor query: qsimilar trends for M-tree and D-index qthe D-index advantage of small radii processing decreases qexpensive even for small k – similar costs for both 1 and 100 qD-index still twice as fast as M-tree •Similarity Search: •Part II, Chapter 3 •111 Range vs. k-NN: I/O Costs nnearest neighbor query: qsimilar trends for I/O costs as for CPU costs qD-index four times faster than M-tree •Similarity Search: •Part II, Chapter 3 •112 Similarity Self Join: Settings nJ(X,X,m) – very demanding operation nthree algorithms to compare: qNL: nested loops – naive approach qRJ: range join – based on D-index qOJ: overloading join – eD-index nfor m: 2m ≤ r, i.e. m ≤ 600 for vectors ndatasets of about 11,000 objects nselectivity – retrieving up to 1,000,000 pairs (for high values of m) •Similarity Search: •Part II, Chapter 3 •113 Similarity Self Join: Complexity nQuadratic complexity qprohibitive for large DB qexample: 50,000 sentences qa range query: nsequential scan takes about 16 seconds n qa self join query: nnested loops algorithm takes 25,000 times more nabout 4 days and 15 hours! •Similarity Search: •Part II, Chapter 3 •114 Similarity Join: Results nRJ and OJ costs increase rapidly (logarithmic scale) nOJ outperforms RJ twice (STR) and 7 times for VEC: qhigh distances between VEC objects qhigh pruning effectiveness of pivot-based filtering for smaller m •Similarity Search: •Part II, Chapter 3 •115 Scalability: CPU Costs nlabels: radius or k + D (D-index), M (M-tree), SEQ ndata: from 100,000 to 600,000 objects nM-tree and D-index are faster (D-index slightly better) nlinear trends qrange query: r = 1,000; 2,000 qk-NN query: k = 1; 100 •Similarity Search: •Part II, Chapter 3 •116 Scalability: I/O Costs nthe same trends as for CPU costs nD-index more efficient than M-tree nexact match contrast: qM-tree: 6,000 block reads + 20,000 d. c. for 600,000 objects qD-index: read 1 block + 18 d. c. regardless of the data size •Similarity Search: •Part II, Chapter 3 •117 Scalability: Similarity Self Join nWe use the speedup s as the performance measure: n n n n n nSpeedup measures how many times is a specific algorithm faster than NL. •n •s •N •N •2 •) •1 •( •- •= •Distance computations of Nested Loops • •An algorithm’s •distance •computations Similarity join operations have quadratic complexity. This makes the operation prohibitive for large DB. For example, …. Before introducing two specific algorithms for similarity self joins in metric spaces, I’d like to present an index structure which provides good performance. So it can be applied to this long-running operations. •Similarity Search: •Part II, Chapter 3 •118 Scalability: Similarity Self Join (cont.) nSTR dataset: from 50,000 to 250,000 sentences nconstant speedup qE.g. a join query on 100,000 objects takes 10 minutes. qThe same join query on 200,000 objects takes 40 minutes. nOJ at least twice faster than RJ qRJ: range join qOJ: overloading join •Similarity Search: •Part II, Chapter 3 •119 Scalability Experiments: Conclusions nsimilarity search is expensive nthe scalability of centralized indexes is linear n ncannot be applied to huge data archives qbecome inefficient after a certain point q nPossible solutions: nsacrifice some precision: approximate techniques nuse more storage & computational power: distributed data structures •Similarity Search: •Part II, Chapter 3 •120 • • • • • eD-index: r-Split Function Modification •Separable set 0 • •dm •2r •2(r +e) •Exclusion set • •Separable set 1 • • • • •p qbsp1,r(x)= •0 • •if d(x,p) ≤ dm - r - e •0 •copy •if d(x,p) > dm - r - e Ù d(x,p) ≤ dm - r •1 • •if d(x,p) > dm + r + e •1 •copy •if d(x,p) > dm + r Ù d(x,p) ≤ dm + r + e •− • •otherwise • THIS SLIDE IS NOT PART OF THE EXPERIENCE. It is here just for the sake of potential future use.